tskit-dev / tstrait Goto Github PK
View Code? Open in Web Editor NEWQuantitative trait simulation for ARGs
Home Page: https://tskit.dev/software/tstrait
License: MIT License
Quantitative trait simulation for ARGs
Home Page: https://tskit.dev/software/tstrait
License: MIT License
In the documentation, add an example of
The current API assumes we provide num_causal_sites
. It might make more sense to turn this into causal_sites
and allow both integer values as well as numpy arrays. The case of an integer is the currently implemented behaviour. In case of an array, these should all be sites contained in the sites table. We would then no longer randomly pick sites.
Alternatively, it might make more sense to provide an example in the documentation on how to mask certain regions of the genome with tree sequences, in case we have knowledge of coding vs non-coding regions for example.
See job specification at https://github.com/tskit-dev/tskit/blob/main/.github/workflows/tests.yml#L9
I assume that tstrait
relies on (a) individuals being present in the tree sequence (in particular, each sample genome must belong to an individual) and (b) individuals being diploid?
It would be useful to clarify this somewhere, and state what happens e.g. if individuals are haploid, triploid, or whatever.
#27 is probably relevant here too.
We should be testing our results statistically by comparing them against other simulators. There's no reason we can do this at the small scale.
It shouldn't be too hard to use e.g. PhenotypeSimulator on some simple simulations, and qqplot the results as a comparison with our results.
So, we would do something like:
This will take some work to do, but is an important validation step.
The docs build is currently failing due to a variety of issues with the source:
/home/runner/work/tstrait/tstrait/tstrait/simulate_phenotype.py:docstring of tstrait.simulate_phenotype.sim_phenotype:4: CRITICAL: Unexpected section title.
Parameters
----------
/home/runner/work/tstrait/tstrait/tstrait/simulate_phenotype.py:docstring of tstrait.simulate_phenotype.sim_phenotype:23: CRITICAL: Unexpected section title.
Returns
-------
/home/runner/work/tstrait/tstrait/tstrait/simulate_phenotype.py:docstring of tstrait.simulate_phenotype.sim_phenotype:28: CRITICAL: Unexpected section title.
Raises
------
/home/runner/work/tstrait/tstrait/tstrait/simulate_phenotype.py:docstring of tstrait.simulate_phenotype.sim_phenotype:35: CRITICAL: Unexpected section title.
See Also
--------
/home/runner/work/tstrait/tstrait/tstrait/simulate_phenotype.py:docstring of tstrait.simulate_phenotype.sim_phenotype:40: CRITICAL: Unexpected section title.
Notes
-----
/home/runner/work/tstrait/tstrait/tstrait/simulate_phenotype.py:docstring of tstrait.simulate_phenotype.sim_phenotype:65: CRITICAL: Unexpected section title.
Examples
--------
WARNING: autodoc: failed to import class 'TraitModelAdditive' from module 'tstrait'; the following exception was raised:
Traceback (most recent call last):
File "/home/runner/work/tstrait/tstrait/venv/lib/python3.10/site-packages/sphinx/util/inspect.py", line 380, in safe_getattr
return getattr(obj, name, *defargs)
AttributeError: module 'tstrait' has no attribute 'TraitModelAdditive'
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/runner/work/tstrait/tstrait/venv/lib/python3.10/site-packages/sphinx/ext/autodoc/importer.py", line 101, in import_object
obj = attrgetter(obj, mangled_name)
File "/home/runner/work/tstrait/tstrait/venv/lib/python3.10/site-packages/sphinx/ext/autodoc/__init__.py", line 307, in get_attr
return autodoc_attrgetter(self.env.app, obj, name, *defargs)
File "/home/runner/work/tstrait/tstrait/venv/lib/python3.10/site-packages/sphinx/ext/autodoc/__init__.py", line 2795, in autodoc_attrgetter
return safe_getattr(obj, name, *defargs)
File "/home/runner/work/tstrait/tstrait/venv/lib/python3.10/site-packages/sphinx/util/inspect.py", line 396, in safe_getattr
raise AttributeError(name) from exc
AttributeError: TraitModelAdditive
WARNING: autodoc: failed to import class 'TraitModelAlleleFrequency' from module 'tstrait'; the following exception was raised:
Traceback (most recent call last):
File "/home/runner/work/tstrait/tstrait/venv/lib/python3.10/site-packages/sphinx/util/inspect.py", line 380, in safe_getattr
return getattr(obj, name, *defargs)
AttributeError: module 'tstrait' has no attribute 'TraitModelAlleleFrequency'
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/runner/work/tstrait/tstrait/venv/lib/python3.10/site-packages/sphinx/ext/autodoc/importer.py", line 101, in import_object
obj = attrgetter(obj, mangled_name)
File "/home/runner/work/tstrait/tstrait/venv/lib/python3.10/site-packages/sphinx/ext/autodoc/__init__.py", line 307, in get_attr
return autodoc_attrgetter(self.env.app, obj, name, *defargs)
File "/home/runner/work/tstrait/tstrait/venv/lib/python3.10/site-packages/sphinx/ext/autodoc/__init__.py", line 2795, in autodoc_attrgetter
return safe_getattr(obj, name, *defargs)
File "/home/runner/work/tstrait/tstrait/venv/lib/python3.10/site-packages/sphinx/util/inspect.py", line 396, in safe_getattr
raise AttributeError(name) from exc
AttributeError: TraitModelAlleleFrequency
WARNING: autodoc: failed to import class 'Result' from module 'tstrait'; the following exception was raised:
Traceback (most recent call last):
File "/home/runner/work/tstrait/tstrait/venv/lib/python3.10/site-packages/sphinx/util/inspect.py", line 380, in safe_getattr
return getattr(obj, name, *defargs)
AttributeError: module 'tstrait' has no attribute 'Result'
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/runner/work/tstrait/tstrait/venv/lib/python3.10/site-packages/sphinx/ext/autodoc/importer.py", line 101, in import_object
obj = attrgetter(obj, mangled_name)
File "/home/runner/work/tstrait/tstrait/venv/lib/python3.10/site-packages/sphinx/ext/autodoc/__init__.py", line 307, in get_attr
return autodoc_attrgetter(self.env.app, obj, name, *defargs)
File "/home/runner/work/tstrait/tstrait/venv/lib/python3.10/site-packages/sphinx/ext/autodoc/__init__.py", line 2795, in autodoc_attrgetter
return safe_getattr(obj, name, *defargs)
File "/home/runner/work/tstrait/tstrait/venv/lib/python3.10/site-packages/sphinx/util/inspect.py", line 396, in safe_getattr
raise AttributeError(name) from exc
AttributeError: Result
/home/runner/work/tstrait/tstrait/tstrait/simulate_phenotype.py:docstring of tstrait.simulate_phenotype.PhenotypeResult:4: CRITICAL: Unexpected section title.
Parameters
----------
/home/runner/work/tstrait/tstrait/tstrait/simulate_phenotype.py:docstring of tstrait.simulate_phenotype.PhenotypeResult:11: CRITICAL: Unexpected section title.
See Also
--------
/home/runner/work/tstrait/tstrait/tstrait/simulate_phenotype.py:docstring of tstrait.simulate_phenotype.PhenotypeResult:15: CRITICAL: Unexpected section title.
Examples
--------
WARNING: autodoc: failed to import class 'GenotypeResult' from module 'tstrait'; the following exception was raised:
Traceback (most recent call last):
File "/home/runner/work/tstrait/tstrait/venv/lib/python3.10/site-packages/sphinx/util/inspect.py", line 380, in safe_getattr
return getattr(obj, name, *defargs)
AttributeError: module 'tstrait' has no attribute 'GenotypeResult'
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/runner/work/tstrait/tstrait/venv/lib/python3.10/site-packages/sphinx/ext/autodoc/importer.py", line 101, in import_object
obj = attrgetter(obj, mangled_name)
File "/home/runner/work/tstrait/tstrait/venv/lib/python3.10/site-packages/sphinx/ext/autodoc/__init__.py", line 307, in get_attr
return autodoc_attrgetter(self.env.app, obj, name, *defargs)
File "/home/runner/work/tstrait/tstrait/venv/lib/python3.10/site-packages/sphinx/ext/autodoc/__init__.py", line 2795, in autodoc_attrgetter
return safe_getattr(obj, name, *defargs)
File "/home/runner/work/tstrait/tstrait/venv/lib/python3.10/site-packages/sphinx/util/inspect.py", line 396, in safe_getattr
raise AttributeError(name) from exc
AttributeError: GenotypeResult
/home/runner/work/tstrait/tstrait/docs/model.md: WARNING: Executing notebook failed: CellExecutionError [mystnb.exec]
/home/runner/work/tstrait/tstrait/docs/model.md: WARNING: Notebook exception traceback saved in: /home/runner/work/tstrait/tstrait/docs/_build/html/reports/model.err.log [mystnb.exec]
/home/runner/work/tstrait/tstrait/docs/quickstart.md: WARNING: Executing notebook failed: CellExecutionError [mystnb.exec]
/home/runner/work/tstrait/tstrait/docs/quickstart.md: WARNING: Notebook exception traceback saved in: /home/runner/work/tstrait/tstrait/docs/_build/html/reports/quickstart.err.log [mystnb.exec]
/home/runner/work/tstrait/tstrait/docs/simulation.md: WARNING: Executing notebook failed: CellExecutionError [mystnb.exec]
/home/runner/work/tstrait/tstrait/docs/simulation.md: WARNING: Notebook exception traceback saved in: /home/runner/work/tstrait/tstrait/docs/_build/html/reports/simulation.err.log [mystnb.exec]
looking for now-outdated files... none found
pickling environment... done
checking consistency... done
preparing documents... done
writing output... [ 16%] api
writing output... [ 33%] installation
writing output... [ 50%] intro
writing output... [ 66%] model
writing output... [ 83%] quickstart
/home/runner/work/tstrait/tstrait/tstrait/simulate_phenotype.py:docstring of tstrait.simulate_phenotype.sim_phenotype:16: WARNING: undefined label: frequency_dependence
writing output... [100%] simulation
/home/runner/work/tstrait/tstrait/tstrait/simulate_phenotype.py:docstring of tstrait.simulate_phenotype.sim_phenotype:41: WARNING: py:class reference target not found: pandas.DataFrame
/home/runner/work/tstrait/tstrait/tstrait/simulate_phenotype.py:docstring of tstrait.simulate_phenotype.sim_phenotype:61: WARNING: undefined label: phenotype_model
/home/runner/work/tstrait/tstrait/tstrait/simulate_phenotype.py:docstring of tstrait.simulate_phenotype.sim_phenotype:66: WARNING: undefined label: quickstart
/home/runner/work/tstrait/tstrait/tstrait/simulate_phenotype.py:docstring of tstrait.simulate_phenotype.PhenotypeResult:16: WARNING: undefined label: effect_size_output
/home/runner/work/tstrait/tstrait/tstrait/simulate_phenotype.py:docstring of tstrait.simulate_phenotype.PhenotypeResult:16: WARNING: undefined label: phenotype_output
/home/runner/work/tstrait/tstrait/docs/model.md:36: WARNING: py:class reference target not found: TraitModelAdditive
/home/runner/work/tstrait/tstrait/docs/model.md:40: WARNING: py:class reference target not found: TraitModelAdditive
/home/runner/work/tstrait/tstrait/docs/model.md:67: WARNING: py:class reference target not found: TraitModelAlleleFrequency
/home/runner/work/tstrait/tstrait/docs/model.md:79: WARNING: py:class reference target not found: TraitModelAlleleFrequency
/home/runner/work/tstrait/tstrait/docs/model.md:81: WARNING: py:class reference target not found: TraitModelAlleleFrequency
/home/runner/work/tstrait/tstrait/docs/simulation.md:55: WARNING: py:class reference target not found: Result
/home/runner/work/tstrait/tstrait/docs/simulation.md:58: WARNING: py:class reference target not found: GenotypeResult
/home/runner/work/tstrait/tstrait/docs/simulation.md:103: WARNING: py:class reference target not found: GenotypeResult
generating indices... genindex done
highlighting module code... [100%] tstrait.simulate_phenotype
writing additional pages... search done
copying static files... done
copying extra files... done
dumping search index in English (code: en)... done
dumping object inventory... done
[etoc] missing index.html written as redirect to 'intro.html'
build finished with problems, 33 warnings.
===============================================================================
Building your book, returns a non-zero exit code (1). Look above for the cause.
===============================================================================
Error occured; showing saved reports
make: *** [Makefile:8: dev] Error 1
Traceback (most recent call last):
File "/home/runner/work/tstrait/tstrait/venv/lib/python3.10/site-packages/jupyter_cache/executors/utils.py", line 58, in single_nb_execution
executenb(
File "/home/runner/work/tstrait/tstrait/venv/lib/python3.10/site-packages/nbclient/client.py", line 1305, in execute
return NotebookClient(nb=nb, resources=resources, km=km, **kwargs).execute()
File "/home/runner/work/tstrait/tstrait/venv/lib/python3.10/site-packages/jupyter_core/utils/__init__.py", line 166, in wrapped
return loop.run_until_complete(inner)
File "/opt/hostedtoolcache/Python/3.10.12/x64/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
return future.result()
File "/home/runner/work/tstrait/tstrait/venv/lib/python3.10/site-packages/nbclient/client.py", line 705, in async_execute
await self.async_execute_cell(
File "/home/runner/work/tstrait/tstrait/venv/lib/python3.10/site-packages/nbclient/client.py", line 1058, in async_execute_cell
await self._check_raise_for_error(cell, cell_index, exec_reply)
File "/home/runner/work/tstrait/tstrait/venv/lib/python3.10/site-packages/nbclient/client.py", line 914, in _check_raise_for_error
raise CellExecutionError.from_cell_and_msg(cell, exec_reply_content)
nbclient.exceptions.CellExecutionError: An error occurred while executing the following cell:
------------------
import msprime
import tstrait
import matplotlib.pyplot as plt
num_ind = 500
ts = msprime.sim_ancestry(num_ind, sequence_length=1_000_000, recombination_rate=1e-8,
population_size=10**4, random_seed=1)
ts = msprime.sim_mutations(ts, rate=1e-8, random_seed=1)
model = tstrait.TraitModelAdditive(trait_mean=0, trait_var=1)
sim_result = tstrait.sim_phenotype(ts, num_causal=1000, model=model, h2=0.3, random_seed=1)
plt.scatter(sim_result.genotype.allele_frequency, sim_result.genotype.effect_size)
plt.xlabel("Allele frequency")
plt.ylabel("Effect size")
plt.axhline(y=0, color='r', linestyle='-')
plt.title("TraitModelAdditive")
plt.show()
------------------
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
Cell In[1], line 10
6 ts = msprime.sim_ancestry(num_ind, sequence_length=1_000_000, recombination_rate=1e-8,
7 population_size=10**4, random_seed=1)
8 ts = msprime.sim_mutations(ts, rate=1e-8, random_seed=1)
---> 10 model = tstrait.TraitModelAdditive(trait_mean=0, trait_var=1)
11 sim_result = tstrait.sim_phenotype(ts, num_causal=1000, model=model, h2=0.3, random_seed=1)
13 plt.scatter(sim_result.genotype.allele_frequency, sim_result.genotype.effect_size)
AttributeError: module 'tstrait' has no attribute 'TraitModelAdditive'
Traceback (most recent call last):
File "/home/runner/work/tstrait/tstrait/venv/lib/python3.10/site-packages/jupyter_cache/executors/utils.py", line 58, in single_nb_execution
executenb(
File "/home/runner/work/tstrait/tstrait/venv/lib/python3.10/site-packages/nbclient/client.py", line 1305, in execute
return NotebookClient(nb=nb, resources=resources, km=km, **kwargs).execute()
File "/home/runner/work/tstrait/tstrait/venv/lib/python3.10/site-packages/jupyter_core/utils/__init__.py", line 166, in wrapped
return loop.run_until_complete(inner)
File "/opt/hostedtoolcache/Python/3.10.12/x64/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
return future.result()
File "/home/runner/work/tstrait/tstrait/venv/lib/python3.10/site-packages/nbclient/client.py", line 705, in async_execute
await self.async_execute_cell(
File "/home/runner/work/tstrait/tstrait/venv/lib/python3.10/site-packages/nbclient/client.py", line 1058, in async_execute_cell
await self._check_raise_for_error(cell, cell_index, exec_reply)
File "/home/runner/work/tstrait/tstrait/venv/lib/python3.10/site-packages/nbclient/client.py", line 914, in _check_raise_for_error
raise CellExecutionError.from_cell_and_msg(cell, exec_reply_content)
nbclient.exceptions.CellExecutionError: An error occurred while executing the following cell:
------------------
import msprime
import tstrait
import matplotlib.pyplot as plt
num_ind = 2000
ts = msprime.sim_ancestry(num_ind, sequence_length=1_000_000, recombination_rate=1e-8,
population_size=10**4, random_seed=1)
ts = msprime.sim_mutations(ts, rate=1e-8, random_seed=1)
model = tstrait.TraitModelAlleleFrequency(trait_mean=0, trait_var=1, alpha=-0.3)
sim_result = tstrait.sim_phenotype(ts, num_causal=1000, model=model, h2=0.3, random_seed=1)
------------------
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
Cell In[1], line 10
6 ts = msprime.sim_ancestry(num_ind, sequence_length=1_000_000, recombination_rate=1e-8,
7 population_size=10**4, random_seed=1)
8 ts = msprime.sim_mutations(ts, rate=1e-8, random_seed=1)
---> 10 model = tstrait.TraitModelAlleleFrequency(trait_mean=0, trait_var=1, alpha=-0.3)
11 sim_result = tstrait.sim_phenotype(ts, num_causal=1000, model=model, h2=0.3, random_seed=1)
AttributeError: module 'tstrait' has no attribute 'TraitModelAlleleFrequency'
Traceback (most recent call last):
File "/home/runner/work/tstrait/tstrait/venv/lib/python3.10/site-packages/jupyter_cache/executors/utils.py", line 58, in single_nb_execution
executenb(
File "/home/runner/work/tstrait/tstrait/venv/lib/python3.10/site-packages/nbclient/client.py", line 1305, in execute
return NotebookClient(nb=nb, resources=resources, km=km, **kwargs).execute()
File "/home/runner/work/tstrait/tstrait/venv/lib/python3.10/site-packages/jupyter_core/utils/__init__.py", line 166, in wrapped
return loop.run_until_complete(inner)
File "/opt/hostedtoolcache/Python/3.10.12/x64/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
return future.result()
File "/home/runner/work/tstrait/tstrait/venv/lib/python3.10/site-packages/nbclient/client.py", line 705, in async_execute
await self.async_execute_cell(
File "/home/runner/work/tstrait/tstrait/venv/lib/python3.10/site-packages/nbclient/client.py", line 1058, in async_execute_cell
await self._check_raise_for_error(cell, cell_index, exec_reply)
File "/home/runner/work/tstrait/tstrait/venv/lib/python3.10/site-packages/nbclient/client.py", line 914, in _check_raise_for_error
raise CellExecutionError.from_cell_and_msg(cell, exec_reply_content)
nbclient.exceptions.CellExecutionError: An error occurred while executing the following cell:
------------------
import msprime
import tstrait
num_ind = 5
ts = msprime.sim_ancestry(num_ind, sequence_length=1_000_000, recombination_rate=1e-8,
population_size=10**4, random_seed=1)
ts = msprime.sim_mutations(ts, rate=1e-8, random_seed=1)
model = tstrait.TraitModelAlleleFrequency(trait_mean=0, trait_var=1, alpha=-0.3)
sim_result = tstrait.sim_phenotype(ts, num_causal=3, model=model, h2=0.3, random_seed=1)
------------------
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
Cell In[1], line 9
5 ts = msprime.sim_ancestry(num_ind, sequence_length=1_000_000, recombination_rate=1e-8,
6 population_size=10**4, random_seed=1)
7 ts = msprime.sim_mutations(ts, rate=1e-8, random_seed=1)
----> 9 model = tstrait.TraitModelAlleleFrequency(trait_mean=0, trait_var=1, alpha=-0.3)
10 sim_result = tstrait.sim_phenotype(ts, num_causal=3, model=model, h2=0.3, random_seed=1)
AttributeError: module 'tstrait' has no attribute 'TraitModelAlleleFrequency'
Error: Process completed with exit code 2.
Steps to do after pheotype simulation:
We're not getting footers with the version information on the docs build.
See tskit-dev/msprime#2201 for related issue.
When writing this up, I found "causal allele" much easier to explain, and on reflection, I don't see any reason to use causal_state
(other than to mirror the tskit terminology). I think it's much clearer what our model is if we use "causal allele" throughout.
Also go through the documentation and remove references to "causal mutation", as this is incorrect (we can have multiple causal mutations to the causal allele).
Divide the trait mean by the number of causal sites.
Steps required:
simulate_phenotype.py
The following information should be added to the documentation once the Python codes are finished:
Need to create a conda-forge package
tstrait currently chooses causal sites at random among all the sites in the tree sequence data. We can change the Python API and provide user control of which sites could be causal. For example, the users can select the minimum and maximum allele frequency of causal sites, such that tstrait doesn't choose singletons.
stdpopsim has some functionalities that could be used for this.
It would be difficult for users to specify their model by using the TraitModel
superclass. It would be nice if the user can specify their model into TraitModel.
Some suggestions:
Alternatively, we can write how the users can create a TraitModel object in the tstrait documentation, but then it would be necessary for the users to have some background knowledege in Python classes.
Need to update the CI infrastructure to push documentation to tskit.dev
Needs @benjeffery I think
The current code does not detect an error if the causal site ID in the genetic value dataframe is not included in the tree sequence data. The error is only given at site = self.ts.site(single_id)
in sim_phenotype.py
, so it will be better to raise an error here if an error is caused here. Shall we put try:
here?
The current code randomly selects causal mutations. To consider the mutation model with multiple alleles, it would be necessary to do a similar thing with causal sites instead. The next steps will be:
After these steps are completed, we can go to the next step of simulating effect sizes of tree sequence data with multiple alleles.
Dominance could be added to tstrait, as we obtain the genotype of individuals in the intermediate step of the simulation. We can let tstrait do:
instead of simply adding effect size * number of causal mutations
to the genetic values of individuals.
I assume that the following steps will be required:
Xarray is very powerful, and gives a path to scaling out to things like Dask, chunked arrays, etc. It's also the basic element in sgkit, which makes me think it's a missed opportunity if tstrait doesn't go for the same basic technology from the outset.
I think all we'd have to do really is to wrap Pandas dataframes we're currently returning with Xarray and to label the dimensions, and this would give us some great future proofing.
Again, while we're making changes, perhaps it's best to just make this one?
Write modular code, so that it would be simpler to implement other functions later
We need to audit the sphinx extensions listed in extra_extensions
in the _config.yml
and the requirements/CI-docs/requirements.txt
to only use things that are actively used in the docs.
I guess there's some extra changes here after moving the to jupyter notebook backend (rather than IPython?) @daikitag?
First issue ;)
Let's write some unit tests:
First things that need to be covered are:
There's a few places we're using # pragma: no cover
where we could/should be testing.
Review these and replace with crafted test cases where approporate.
# pragma: no cover
should only be used in cases where it's not possible to test, like in a numba jitted function.
Statistical tests for the trait model can be written now. It should be done after the unit tests.
Using codecov - @GertjanBisschop should have access to codecov if he logs in using github. Let me know if not.
As this is pure-python I think you could get away with linux-only testing for now. That will remove the need for conda in the setup.
Create a function that simulates phenotypes based on the genotype matrix determined from the tree sequence, so that we can compare the times
The markdown for explaning the codes can be written now. Specifically, the following should be addressed:
In tstrait version 0.0.1, the simulated effect sizes are divided by the number of causal sites, which will dramatically shrink the phenotypic variance when the number of causal sites is large.
Example of a trait simulation by using tstrait version 0.0.1.
Many thanks to Alison for detecting this major bug.
_config.yml
. It is slightly different compared with tskit
or msprime
.build.sh
README.md
setup.cfg
setup.cfg
dOxygen
in docs like tskit
?tskit
. Should we create a similar file?tsinfer
and tsdate
)I think it would be clearer if we had
result = sim_phenotype()
result.traits # output of sim_trait
result.phenotypes # output of sim_phenotype
assuming that we do the refactoring in #103
Next steps in phenotype simulation framework:
We would for example want to verify statistically whether we get the correct distribution of the simulated genotypic values for all individuals. For each individual this is the sum of many beta-values. So we know what this distribution should look like (without any noise added). We can use a qq-plot to check this distribution (using a high enough number of snp's) to check this against the expectation.
Very rarely, a site can have 0 or 1 allele frequency. It would be important to fix the issue.
pip install -U jsonschema[format-nongpl]
must be used to use tskit-book-theme
.
Simulation of multiple traits with pleiotropy.
Input:
Trait model:
Phenotype simulation (can set a different function for pleiotropy):
Steps required:
I think we've made a minor architectural mistake in how we put together the modular components of a simulation which we should fix while we can. Currently, we have
df_trait = sim_trait(ts, num_causal, model, random_seed)
genetic_result = sim_genetic(ts, trait_df, alpha, random_seed)
where df_trait looks like this:
site_id effect_size trait_id
3 0.140148 0
and genetic_result has two parts, effect_size
which looks like:
site_id effect_size trait_id causal_state allele_frequency
3 0.140148 0 G 0.2
and .genetic
which looks like
trait_id individual_id genetic_value
0 0 0.897513
0 1 0.793933
0 2 0.140148
0 3 0.378682
0 4 0.378682
While writing this up, I found it very confusing. Why are we doing it this way? Wouldn't it be much simpler to do this:
df_trait = sim_trait(ts, num_causal, model, alpha, random_seed)
and df_trait
looks like
site_id effect_size trait_id causal_state allele_frequency
3 0.140148 0 G 0.2
and then we do
df_genetic = genetic_value(ts, df_trait)
and returns
trait_id individual_id genetic_value
0 0 0.897513
0 1 0.793933
0 2 0.140148
0 3 0.378682
0 4 0.378682
So, we do the choice of causal state and correction of the effect size by allele frequency within sim_trait
, and then the genetic_value
function simply does all the propagating.
The only reason I can see for doing it in the current way is to avoid going through the trees multiple times, but I feel that this is a premature optimisation and that clarity of the modular API is more important.
Thoughts @GertjanBisschop, @daikitag?
In #52 we added the sim_trait(ts, num_causal)
function which returns a pandas data frame with num_causal rows. It seems to me that we will often want to simulate multiple traits, which perhaps we should try to support natively. I'm guessing (and correct me if I'm wrong!) that people will do things like this:
num_traits = 100
# Choose number of causal sites uniformly for each trait
num_causal = np.random.uniform(10, 100, size=num_traits)
for j in range(num_traits):
df_trait = tstrait.sim_trait(ts, num_causal[j])
# df_trait has cols site_id, causal_state, and effect_size
X = tstrait.sim_phenotypes(ts, df_trait)
# Do something with X
A better pattern (and one more suitable for interoperability with things like SLiM etc might be:
num_traits = 100
# Choose number of causal sites uniformly for each trait
num_causal = np.random.uniform(10, 100, size=num_traits)
df_traits = tstrait.sim_traits(ts, num_causal)
# df_traits has cols trait_id, site_id, causal_state, and effect_size
for trait_id, X in enumerate(tstrait.sim_phenotypes(ts, df_traits)):
# Do something with X
The gain here is that we can communicate multiple traits at once - otherwise, it might get quite clunky when trying to work with many traits. The sim_phenotypes
function returns an iterator over the traits and the generated phenotypes for each trait (much like msprime does for random seeds).
Thoughts?
sites = np.random.choice([s.id for s in tree.sites()], size=num_sites, replace=False)
is very time consuming. I'm currently using rng.choice(range(num_sites), size=num_causal, replace=False)
, but it won't work when some site IDs are missingsim_env:
def sim_env(genetic_df, *, h2=None, random_seed=None):
It's EOL'd so no point in making our initial version compatible.
Needs to be done before #30
Steps that need to be taken before merging the Jerome branch to main
phenotype.py
fileOriginally here: #93
Traceback (most recent call last):
File "/home/rkerr/dataplan-2/tbasim/Python/demography_from_tree_example.py", line 280, in <module>
trait_df = tstrait.sim_trait(ts_new, num_causal=nqtl, model=model, random_seed=1234)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/rkerr/.local/lib/python3.11/site-packages/tstrait/simulate_effect_size.py", line 137, in sim_trait
trait_df = simulator._sim_causal_mutation()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/rkerr/.local/lib/python3.11/site-packages/tstrait/simulate_effect_size.py", line 57, in _sim_causal_mutation
site_id_array = self._choose_causal_site()
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/rkerr/.local/lib/python3.11/site-packages/tstrait/simulate_effect_size.py", line 38, in _choose_causal_site
site_id = self.rng.choice(
^^^^^^^^^^^^^^^^
File "_generator.pyx", line 847, in numpy.random._generator.Generator.choice
TypeError: expected a sequence of integers or a single integer, got '400.0'
Can you give us a minimum working example here please @richard-Hobart so we can reproduce?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.