tskit-dev / tstrait Goto Github PK

View Code? Open in Web Editor NEW

6.0 8.0 5.0 1.67 MB

Quantitative trait simulation for ARGs

Home Page: https://tskit.dev/software/tstrait

License: MIT License

Python 96.35% R 3.65%

args tree-sequences tskit

tstrait's Introduction

tstrait

tstrait is a quantitative trait simulator of tskit tree sequences.

Documentation: https://tskit.dev/tstrait/docs/
Source code: https://github.com/tskit-dev/tstrait
Tree sequence documentation: https://tskit.dev/learn/

It provides:

Simulation of quantitative traits
Simulation of genetic effect sizes from various distributions
Simulation of pleiotropic phenotypes
Simulation of environmental noise
Computation of individual genetic values given the effect sizes of causal mutations

tstrait's People

Contributors

Stargazers

Watchers

Forkers

gertjanbisschop jeromekelleher benjeffery hyanwong daikitag

tstrait's Issues

Create Test PyPI and PyPI packages

Needs to be done before #30

Phenotype Simulation

Next steps in phenotype simulation framework:

Efficient tree traversal algorithm to determine the allele frequency
Statistical tests against expected mean/variance
Making the phenotypic simulation code faster through Numba/C codes (The current tree traversal algorithm code is very slow to implement)
Dominance/epistasis (If possible, no need in GWAS simulation framework, but adding dominance effects is simple)

Add documentation to tskit.dev

Need to update the CI infrastructure to push documentation to tskit.dev

Needs @benjeffery I think

Fix docs build

The docs build is currently failing due to a variety of issues with the source:

/home/runner/work/tstrait/tstrait/tstrait/simulate_phenotype.py:docstring of tstrait.simulate_phenotype.sim_phenotype:4: CRITICAL: Unexpected section title.

Parameters
----------
/home/runner/work/tstrait/tstrait/tstrait/simulate_phenotype.py:docstring of tstrait.simulate_phenotype.sim_phenotype:23: CRITICAL: Unexpected section title.

Returns
-------
/home/runner/work/tstrait/tstrait/tstrait/simulate_phenotype.py:docstring of tstrait.simulate_phenotype.sim_phenotype:28: CRITICAL: Unexpected section title.

Raises
------
/home/runner/work/tstrait/tstrait/tstrait/simulate_phenotype.py:docstring of tstrait.simulate_phenotype.sim_phenotype:35: CRITICAL: Unexpected section title.

See Also
--------
/home/runner/work/tstrait/tstrait/tstrait/simulate_phenotype.py:docstring of tstrait.simulate_phenotype.sim_phenotype:40: CRITICAL: Unexpected section title.

Notes
-----
/home/runner/work/tstrait/tstrait/tstrait/simulate_phenotype.py:docstring of tstrait.simulate_phenotype.sim_phenotype:65: CRITICAL: Unexpected section title.

Examples
--------
WARNING: autodoc: failed to import class 'TraitModelAdditive' from module 'tstrait'; the following exception was raised:
Traceback (most recent call last):
  File "/home/runner/work/tstrait/tstrait/venv/lib/python3.10/site-packages/sphinx/util/inspect.py", line 380, in safe_getattr
    return getattr(obj, name, *defargs)
AttributeError: module 'tstrait' has no attribute 'TraitModelAdditive'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/runner/work/tstrait/tstrait/venv/lib/python3.10/site-packages/sphinx/ext/autodoc/importer.py", line 101, in import_object
    obj = attrgetter(obj, mangled_name)
  File "/home/runner/work/tstrait/tstrait/venv/lib/python3.10/site-packages/sphinx/ext/autodoc/__init__.py", line 307, in get_attr
    return autodoc_attrgetter(self.env.app, obj, name, *defargs)
  File "/home/runner/work/tstrait/tstrait/venv/lib/python3.10/site-packages/sphinx/ext/autodoc/__init__.py", line 2795, in autodoc_attrgetter
    return safe_getattr(obj, name, *defargs)
  File "/home/runner/work/tstrait/tstrait/venv/lib/python3.10/site-packages/sphinx/util/inspect.py", line 396, in safe_getattr
    raise AttributeError(name) from exc
AttributeError: TraitModelAdditive

WARNING: autodoc: failed to import class 'TraitModelAlleleFrequency' from module 'tstrait'; the following exception was raised:
Traceback (most recent call last):
  File "/home/runner/work/tstrait/tstrait/venv/lib/python3.10/site-packages/sphinx/util/inspect.py", line 380, in safe_getattr
    return getattr(obj, name, *defargs)
AttributeError: module 'tstrait' has no attribute 'TraitModelAlleleFrequency'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/runner/work/tstrait/tstrait/venv/lib/python3.10/site-packages/sphinx/ext/autodoc/importer.py", line 101, in import_object
    obj = attrgetter(obj, mangled_name)
  File "/home/runner/work/tstrait/tstrait/venv/lib/python3.10/site-packages/sphinx/ext/autodoc/__init__.py", line 307, in get_attr
    return autodoc_attrgetter(self.env.app, obj, name, *defargs)
  File "/home/runner/work/tstrait/tstrait/venv/lib/python3.10/site-packages/sphinx/ext/autodoc/__init__.py", line 2795, in autodoc_attrgetter
    return safe_getattr(obj, name, *defargs)
  File "/home/runner/work/tstrait/tstrait/venv/lib/python3.10/site-packages/sphinx/util/inspect.py", line 396, in safe_getattr
    raise AttributeError(name) from exc
AttributeError: TraitModelAlleleFrequency

WARNING: autodoc: failed to import class 'Result' from module 'tstrait'; the following exception was raised:
Traceback (most recent call last):
  File "/home/runner/work/tstrait/tstrait/venv/lib/python3.10/site-packages/sphinx/util/inspect.py", line 380, in safe_getattr
    return getattr(obj, name, *defargs)
AttributeError: module 'tstrait' has no attribute 'Result'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/runner/work/tstrait/tstrait/venv/lib/python3.10/site-packages/sphinx/ext/autodoc/importer.py", line 101, in import_object
    obj = attrgetter(obj, mangled_name)
  File "/home/runner/work/tstrait/tstrait/venv/lib/python3.10/site-packages/sphinx/ext/autodoc/__init__.py", line 307, in get_attr
    return autodoc_attrgetter(self.env.app, obj, name, *defargs)
  File "/home/runner/work/tstrait/tstrait/venv/lib/python3.10/site-packages/sphinx/ext/autodoc/__init__.py", line 2795, in autodoc_attrgetter
    return safe_getattr(obj, name, *defargs)
  File "/home/runner/work/tstrait/tstrait/venv/lib/python3.10/site-packages/sphinx/util/inspect.py", line 396, in safe_getattr
    raise AttributeError(name) from exc
AttributeError: Result

/home/runner/work/tstrait/tstrait/tstrait/simulate_phenotype.py:docstring of tstrait.simulate_phenotype.PhenotypeResult:4: CRITICAL: Unexpected section title.

Parameters
----------
/home/runner/work/tstrait/tstrait/tstrait/simulate_phenotype.py:docstring of tstrait.simulate_phenotype.PhenotypeResult:11: CRITICAL: Unexpected section title.

See Also
--------
/home/runner/work/tstrait/tstrait/tstrait/simulate_phenotype.py:docstring of tstrait.simulate_phenotype.PhenotypeResult:15: CRITICAL: Unexpected section title.

Examples
--------
WARNING: autodoc: failed to import class 'GenotypeResult' from module 'tstrait'; the following exception was raised:
Traceback (most recent call last):
  File "/home/runner/work/tstrait/tstrait/venv/lib/python3.10/site-packages/sphinx/util/inspect.py", line 380, in safe_getattr
    return getattr(obj, name, *defargs)
AttributeError: module 'tstrait' has no attribute 'GenotypeResult'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/runner/work/tstrait/tstrait/venv/lib/python3.10/site-packages/sphinx/ext/autodoc/importer.py", line 101, in import_object
    obj = attrgetter(obj, mangled_name)
  File "/home/runner/work/tstrait/tstrait/venv/lib/python3.10/site-packages/sphinx/ext/autodoc/__init__.py", line 307, in get_attr
    return autodoc_attrgetter(self.env.app, obj, name, *defargs)
  File "/home/runner/work/tstrait/tstrait/venv/lib/python3.10/site-packages/sphinx/ext/autodoc/__init__.py", line 2795, in autodoc_attrgetter
    return safe_getattr(obj, name, *defargs)
  File "/home/runner/work/tstrait/tstrait/venv/lib/python3.10/site-packages/sphinx/util/inspect.py", line 396, in safe_getattr
    raise AttributeError(name) from exc
AttributeError: GenotypeResult

/home/runner/work/tstrait/tstrait/docs/model.md: WARNING: Executing notebook failed: CellExecutionError [mystnb.exec]
/home/runner/work/tstrait/tstrait/docs/model.md: WARNING: Notebook exception traceback saved in: /home/runner/work/tstrait/tstrait/docs/_build/html/reports/model.err.log [mystnb.exec]
/home/runner/work/tstrait/tstrait/docs/quickstart.md: WARNING: Executing notebook failed: CellExecutionError [mystnb.exec]
/home/runner/work/tstrait/tstrait/docs/quickstart.md: WARNING: Notebook exception traceback saved in: /home/runner/work/tstrait/tstrait/docs/_build/html/reports/quickstart.err.log [mystnb.exec]
/home/runner/work/tstrait/tstrait/docs/simulation.md: WARNING: Executing notebook failed: CellExecutionError [mystnb.exec]
/home/runner/work/tstrait/tstrait/docs/simulation.md: WARNING: Notebook exception traceback saved in: /home/runner/work/tstrait/tstrait/docs/_build/html/reports/simulation.err.log [mystnb.exec]
looking for now-outdated files... none found
pickling environment... done
checking consistency... done
preparing documents... done
writing output... [ 16%] api                                                   
writing output... [ 33%] installation                                          
writing output... [ 50%] intro                                                 
writing output... [ 66%] model                                                 
writing output... [ 83%] quickstart                                            
/home/runner/work/tstrait/tstrait/tstrait/simulate_phenotype.py:docstring of tstrait.simulate_phenotype.sim_phenotype:16: WARNING: undefined label: frequency_dependence
writing output... [100%] simulation                                            
/home/runner/work/tstrait/tstrait/tstrait/simulate_phenotype.py:docstring of tstrait.simulate_phenotype.sim_phenotype:41: WARNING: py:class reference target not found: pandas.DataFrame
/home/runner/work/tstrait/tstrait/tstrait/simulate_phenotype.py:docstring of tstrait.simulate_phenotype.sim_phenotype:61: WARNING: undefined label: phenotype_model
/home/runner/work/tstrait/tstrait/tstrait/simulate_phenotype.py:docstring of tstrait.simulate_phenotype.sim_phenotype:66: WARNING: undefined label: quickstart
/home/runner/work/tstrait/tstrait/tstrait/simulate_phenotype.py:docstring of tstrait.simulate_phenotype.PhenotypeResult:16: WARNING: undefined label: effect_size_output
/home/runner/work/tstrait/tstrait/tstrait/simulate_phenotype.py:docstring of tstrait.simulate_phenotype.PhenotypeResult:16: WARNING: undefined label: phenotype_output
/home/runner/work/tstrait/tstrait/docs/model.md:36: WARNING: py:class reference target not found: TraitModelAdditive
/home/runner/work/tstrait/tstrait/docs/model.md:40: WARNING: py:class reference target not found: TraitModelAdditive
/home/runner/work/tstrait/tstrait/docs/model.md:67: WARNING: py:class reference target not found: TraitModelAlleleFrequency
/home/runner/work/tstrait/tstrait/docs/model.md:79: WARNING: py:class reference target not found: TraitModelAlleleFrequency
/home/runner/work/tstrait/tstrait/docs/model.md:81: WARNING: py:class reference target not found: TraitModelAlleleFrequency
/home/runner/work/tstrait/tstrait/docs/simulation.md:55: WARNING: py:class reference target not found: Result
/home/runner/work/tstrait/tstrait/docs/simulation.md:58: WARNING: py:class reference target not found: GenotypeResult
/home/runner/work/tstrait/tstrait/docs/simulation.md:103: WARNING: py:class reference target not found: GenotypeResult
generating indices... genindex done
highlighting module code... [100%] tstrait.simulate_phenotype                  
writing additional pages... search done
copying static files... done
copying extra files... done
dumping search index in English (code: en)... done
dumping object inventory... done
[etoc] missing index.html written as redirect to 'intro.html'
build finished with problems, 33 warnings.

===============================================================================

Building your book, returns a non-zero exit code (1). Look above for the cause.

===============================================================================

Error occured; showing saved reports
make: *** [Makefile:8: dev] Error 1
Traceback (most recent call last):
  File "/home/runner/work/tstrait/tstrait/venv/lib/python3.10/site-packages/jupyter_cache/executors/utils.py", line 58, in single_nb_execution
    executenb(
  File "/home/runner/work/tstrait/tstrait/venv/lib/python3.10/site-packages/nbclient/client.py", line 1305, in execute
    return NotebookClient(nb=nb, resources=resources, km=km, **kwargs).execute()
  File "/home/runner/work/tstrait/tstrait/venv/lib/python3.10/site-packages/jupyter_core/utils/__init__.py", line 166, in wrapped
    return loop.run_until_complete(inner)
  File "/opt/hostedtoolcache/Python/3.10.12/x64/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
    return future.result()
  File "/home/runner/work/tstrait/tstrait/venv/lib/python3.10/site-packages/nbclient/client.py", line 705, in async_execute
    await self.async_execute_cell(
  File "/home/runner/work/tstrait/tstrait/venv/lib/python3.10/site-packages/nbclient/client.py", line 1058, in async_execute_cell
    await self._check_raise_for_error(cell, cell_index, exec_reply)
  File "/home/runner/work/tstrait/tstrait/venv/lib/python3.10/site-packages/nbclient/client.py", line 914, in _check_raise_for_error
    raise CellExecutionError.from_cell_and_msg(cell, exec_reply_content)
nbclient.exceptions.CellExecutionError: An error occurred while executing the following cell:
------------------
import msprime
import tstrait
import matplotlib.pyplot as plt

num_ind = 500
ts = msprime.sim_ancestry(num_ind, sequence_length=1_000_000, recombination_rate=1e-8,
                          population_size=10**4, random_seed=1)
ts = msprime.sim_mutations(ts, rate=1e-8, random_seed=1)

model = tstrait.TraitModelAdditive(trait_mean=0, trait_var=1)
sim_result = tstrait.sim_phenotype(ts, num_causal=1000, model=model, h2=0.3, random_seed=1)

plt.scatter(sim_result.genotype.allele_frequency, sim_result.genotype.effect_size)
plt.xlabel("Allele frequency")
plt.ylabel("Effect size")
plt.axhline(y=0, color='r', linestyle='-')
plt.title("TraitModelAdditive")
plt.show()
------------------


---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In[1], line 10
      6 ts = msprime.sim_ancestry(num_ind, sequence_length=1_000_000, recombination_rate=1e-8,
      7                           population_size=10**4, random_seed=1)
      8 ts = msprime.sim_mutations(ts, rate=1e-8, random_seed=1)
---> 10 model = tstrait.TraitModelAdditive(trait_mean=0, trait_var=1)
     11 sim_result = tstrait.sim_phenotype(ts, num_causal=1000, model=model, h2=0.3, random_seed=1)
     13 plt.scatter(sim_result.genotype.allele_frequency, sim_result.genotype.effect_size)

AttributeError: module 'tstrait' has no attribute 'TraitModelAdditive'

Traceback (most recent call last):
  File "/home/runner/work/tstrait/tstrait/venv/lib/python3.10/site-packages/jupyter_cache/executors/utils.py", line 58, in single_nb_execution
    executenb(
  File "/home/runner/work/tstrait/tstrait/venv/lib/python3.10/site-packages/nbclient/client.py", line 1305, in execute
    return NotebookClient(nb=nb, resources=resources, km=km, **kwargs).execute()
  File "/home/runner/work/tstrait/tstrait/venv/lib/python3.10/site-packages/jupyter_core/utils/__init__.py", line 166, in wrapped
    return loop.run_until_complete(inner)
  File "/opt/hostedtoolcache/Python/3.10.12/x64/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
    return future.result()
  File "/home/runner/work/tstrait/tstrait/venv/lib/python3.10/site-packages/nbclient/client.py", line 705, in async_execute
    await self.async_execute_cell(
  File "/home/runner/work/tstrait/tstrait/venv/lib/python3.10/site-packages/nbclient/client.py", line 1058, in async_execute_cell
    await self._check_raise_for_error(cell, cell_index, exec_reply)
  File "/home/runner/work/tstrait/tstrait/venv/lib/python3.10/site-packages/nbclient/client.py", line 914, in _check_raise_for_error
    raise CellExecutionError.from_cell_and_msg(cell, exec_reply_content)
nbclient.exceptions.CellExecutionError: An error occurred while executing the following cell:
------------------
import msprime
import tstrait
import matplotlib.pyplot as plt

num_ind = 2000
ts = msprime.sim_ancestry(num_ind, sequence_length=1_000_000, recombination_rate=1e-8,
                          population_size=10**4, random_seed=1)
ts = msprime.sim_mutations(ts, rate=1e-8, random_seed=1)

model = tstrait.TraitModelAlleleFrequency(trait_mean=0, trait_var=1, alpha=-0.3)
sim_result = tstrait.sim_phenotype(ts, num_causal=1000, model=model, h2=0.3, random_seed=1)
------------------


---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In[1], line 10
      6 ts = msprime.sim_ancestry(num_ind, sequence_length=1_000_000, recombination_rate=1e-8,
      7                           population_size=10**4, random_seed=1)
      8 ts = msprime.sim_mutations(ts, rate=1e-8, random_seed=1)
---> 10 model = tstrait.TraitModelAlleleFrequency(trait_mean=0, trait_var=1, alpha=-0.3)
     11 sim_result = tstrait.sim_phenotype(ts, num_causal=1000, model=model, h2=0.3, random_seed=1)

AttributeError: module 'tstrait' has no attribute 'TraitModelAlleleFrequency'

Traceback (most recent call last):
  File "/home/runner/work/tstrait/tstrait/venv/lib/python3.10/site-packages/jupyter_cache/executors/utils.py", line 58, in single_nb_execution
    executenb(
  File "/home/runner/work/tstrait/tstrait/venv/lib/python3.10/site-packages/nbclient/client.py", line 1305, in execute
    return NotebookClient(nb=nb, resources=resources, km=km, **kwargs).execute()
  File "/home/runner/work/tstrait/tstrait/venv/lib/python3.10/site-packages/jupyter_core/utils/__init__.py", line 166, in wrapped
    return loop.run_until_complete(inner)
  File "/opt/hostedtoolcache/Python/3.10.12/x64/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
    return future.result()
  File "/home/runner/work/tstrait/tstrait/venv/lib/python3.10/site-packages/nbclient/client.py", line 705, in async_execute
    await self.async_execute_cell(
  File "/home/runner/work/tstrait/tstrait/venv/lib/python3.10/site-packages/nbclient/client.py", line 1058, in async_execute_cell
    await self._check_raise_for_error(cell, cell_index, exec_reply)
  File "/home/runner/work/tstrait/tstrait/venv/lib/python3.10/site-packages/nbclient/client.py", line 914, in _check_raise_for_error
    raise CellExecutionError.from_cell_and_msg(cell, exec_reply_content)
nbclient.exceptions.CellExecutionError: An error occurred while executing the following cell:
------------------
import msprime
import tstrait

num_ind = 5
ts = msprime.sim_ancestry(num_ind, sequence_length=1_000_000, recombination_rate=1e-8,
                          population_size=10**4, random_seed=1)
ts = msprime.sim_mutations(ts, rate=1e-8, random_seed=1)

model = tstrait.TraitModelAlleleFrequency(trait_mean=0, trait_var=1, alpha=-0.3)
sim_result = tstrait.sim_phenotype(ts, num_causal=3, model=model, h2=0.3, random_seed=1)
------------------


---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In[1], line 9
      5 ts = msprime.sim_ancestry(num_ind, sequence_length=1_000_000, recombination_rate=1e-8,
      6                           population_size=10**4, random_seed=1)
      7 ts = msprime.sim_mutations(ts, rate=1e-8, random_seed=1)
----> 9 model = tstrait.TraitModelAlleleFrequency(trait_mean=0, trait_var=1, alpha=-0.3)
     10 sim_result = tstrait.sim_phenotype(ts, num_causal=3, model=model, h2=0.3, random_seed=1)

AttributeError: module 'tstrait' has no attribute 'TraitModelAlleleFrequency'

Error: Process completed with exit code 2.

Allele frequency 0 and 1

Very rarely, a site can have 0 or 1 allele frequency. It would be important to fix the issue.

Validate against simplePHENOTYPES

https://cran.r-project.org/web/packages/simplePHENOTYPES/simplePHENOTYPES.pdf

Documentation

pip install -U jsonschema[format-nongpl] must be used to use tskit-book-theme.

We would for example want to verify statistically whether we get the correct distribution of the simulated genotypic values for all individuals. For each individual this is the sum of many beta-values. So we know what this distribution should look like (without any noise added). We can use a qq-plot to check this distribution (using a high enough number of snp's) to check this against the expectation.

Unit test + flake8/black

Steps that need to be taken before merging the Jerome branch to main

Write unit test for phenotype.py file
Use flake8 and black

causal sites

The current API assumes we provide num_causal_sites. It might make more sense to turn this into causal_sites and allow both integer values as well as numpy arrays. The case of an integer is the currently implemented behaviour. In case of an array, these should all be sites contained in the sites table. We would then no longer randomly pick sites.

Alternatively, it might make more sense to provide an example in the documentation on how to mask certain regions of the genome with tree sequences, in case we have knowledge of coding vs non-coding regions for example.

Add testing

As this is pure-python I think you could get away with linux-only testing for now. That will remove the need for conda in the setup.

Causal sites

tstrait currently chooses causal sites at random among all the sites in the tree sequence data. We can change the Python API and provide user control of which sites could be causal. For example, the users can select the minimum and maximum allele frequency of causal sites, such that tstrait doesn't choose singletons.
stdpopsim has some functionalities that could be used for this.

Move causal state selection to ``sim_trait`` and make ``sim_genetic`` deterministic (and rename)

I think we've made a minor architectural mistake in how we put together the modular components of a simulation which we should fix while we can. Currently, we have

df_trait = sim_trait(ts, num_causal, model, random_seed)
genetic_result = sim_genetic(ts, trait_df, alpha, random_seed)

where df_trait looks like this:

site_id	effect_size	trait_id
3	0.140148	0

and genetic_result has two parts, effect_size which looks like:

site_id	effect_size	trait_id	causal_state	allele_frequency
3	0.140148	0	G	0.2

and .genetic which looks like

trait_id	individual_id	genetic_value
0	0	0.897513
0	1	0.793933
0	2	0.140148
0	3	0.378682
0	4	0.378682

While writing this up, I found it very confusing. Why are we doing it this way? Wouldn't it be much simpler to do this:

df_trait = sim_trait(ts, num_causal, model, alpha, random_seed)

and df_trait looks like

site_id	effect_size	trait_id	causal_state	allele_frequency
3	0.140148	0	G	0.2

and then we do

df_genetic = genetic_value(ts, df_trait)

and returns

trait_id	individual_id	genetic_value
0	0	0.897513
0	1	0.793933
0	2	0.140148
0	3	0.378682
0	4	0.378682

So, we do the choice of causal state and correction of the effect size by allele frequency within sim_trait, and then the genetic_value function simply does all the propagating.

The only reason I can see for doing it in the current way is to avoid going through the trees multiple times, but I feel that this is a premature optimisation and that clarity of the modular API is more important.

Thoughts @GertjanBisschop, @daikitag?

Incorrect variance with multiple causal sites

In tstrait version 0.0.1, the simulated effect sizes are divided by the number of causal sites, which will dramatically shrink the phenotypic variance when the number of causal sites is large.

Example of a trait simulation by using tstrait version 0.0.1.

Many thanks to Alison for detecting this major bug.

Dominance

Dominance could be added to tstrait, as we obtain the genotype of individuals in the intermediate step of the simulation. We can let tstrait do:

0 mutation: Nothing
1 mutation: Add effect size + dominance effect
2 mutation: Add effect size * 2

instead of simply adding effect size * number of causal mutations to the genetic values of individuals.

I assume that the following steps will be required:

Create a new function for simulating individuals with dominance, as the computational speed of tstrait will drop with dominance
Create DominanceModel to control how dominance effects are being simulated
Statistical test to test dominance

Change return type from Pandas dataframe to Xarray dataset?

Xarray is very powerful, and gives a path to scaling out to things like Dask, chunked arrays, etc. It's also the basic element in sgkit, which makes me think it's a missed opportunity if tstrait doesn't go for the same basic technology from the outset.

I think all we'd have to do really is to wrap Pandas dataframes we're currently returning with Xarray and to label the dimensions, and this would give us some great future proofing.

Again, while we're making changes, perhaps it's best to just make this one?

https://docs.xarray.dev/en/stable/index.html

https://github.com/tskit-dev/tstrait/issues/new

Add conda distribution

Modular code

Write modular code, so that it would be simpler to implement other functions later

Change "causal_state" to "causal_allele"?

When writing this up, I found "causal allele" much easier to explain, and on reflection, I don't see any reason to use causal_state (other than to mirror the tskit terminology). I think it's much clearer what our model is if we use "causal allele" throughout.

Also go through the documentation and remove references to "causal mutation", as this is incorrect (we can have multiple causal mutations to the causal allele).

Enable causal sites to be selected

The current code randomly selects causal mutations. To consider the mutation model with multiple alleles, it would be necessary to do a similar thing with causal sites instead. The next steps will be:

Select causal site
Obtain mutation ID in the causal site (assuming infinite sites model)
Randomly select 1 mutation in the site that is different from the ancestral state (Not assuming infinite sites model)

After these steps are completed, we can go to the next step of simulating effect sizes of tree sequence data with multiple alleles.

Issues with the current code

Obtaining site ID like sites = np.random.choice([s.id for s in tree.sites()], size=num_sites, replace=False) is very time consuming. I'm currently using rng.choice(range(num_sites), size=num_causal, replace=False), but it won't work when some site IDs are missing
Are there any fast ways to obtain individual IDs?
Converting genotypic values of nodes to genotypic values of individuals
Output of the code. Should the output of the simulation be dataframes, or should it be a Python object and the user can use the code inside the package to obtain dataframes?

Add mergify

See https://github.com/tskit-dev/tskit/blob/main/.mergify.yml

Steps to make tstrait as a library

Pleiotropy

Simulation of multiple traits with pleiotropy.

Input:

Trait model:

Mean vector
Covariance matrix

Phenotype simulation (can set a different function for pleiotropy):

Degree of pleiotropy
Heritability vector
TraitModel vector (one for pleiotropy, another for individual ones)

Steps required:

Create a new TraitModel to simulate traits from a multivariate normal distribution
Create a new function to simulate multiple traits
Statistical tests for pleiotropy

Remove tests and support for Python 3.7

It's EOL'd so no point in making our initial version compatible.

Review # pragma: no cover usage

There's a few places we're using # pragma: no cover where we could/should be testing.

Review these and replace with crafted test cases where approporate.

# pragma: no cover should only be used in cases where it's not possible to test, like in a numba jitted function.

Add linting to CI

See job specification at https://github.com/tskit-dev/tskit/blob/main/.github/workflows/tests.yml#L9

GWAS Summary Statistics

Steps to do after pheotype simulation:

Code to exclude the effects of SNPs having small frequency
Tree traversal algorithm to conduct simple linear regression analysis against each SNPs to determine the p-value and effect sizes
Making the code faster through numba/C
Genotype imputation (Maybe)

Add coverage

Using codecov - @GertjanBisschop should have access to codecov if he logs in using github. Let me know if not.

Markdown introduction

The markdown for explaning the codes can be written now. Specifically, the following should be addressed:

The basic way of using the code
Explaining each models
The output
How we can obtain further output by using msprime

Document how ploidy works

I assume that tstrait relies on (a) individuals being present in the tree sequence (in particular, each sample genome must belong to an individual) and (b) individuals being diploid?

It would be useful to clarify this somewhere, and state what happens e.g. if individuals are haploid, triploid, or whatever.

#27 is probably relevant here too.

Trait mean

Divide the trait mean by the number of causal sites.

Steps required:

Change trait model in simulate_phenotype.py
Change statistical tests and unit tests to account for the new trait mean

Trait model

It would be difficult for users to specify their model by using the TraitModel superclass. It would be nice if the user can specify their model into TraitModel.

Some suggestions:

Put distribution as an input of TraitModel
The user writes the function to simulate effect sizes and feed it into the TraitModel

Alternatively, we can write how the users can create a TraitModel object in the tstrait documentation, but then it would be necessary for the users to have some background knowledege in Python classes.

Conda package

Need to create a conda-forge package

Simulating multiple traits

In #52 we added the sim_trait(ts, num_causal) function which returns a pandas data frame with num_causal rows. It seems to me that we will often want to simulate multiple traits, which perhaps we should try to support natively. I'm guessing (and correct me if I'm wrong!) that people will do things like this:

num_traits = 100
# Choose number of causal sites uniformly for each trait
num_causal = np.random.uniform(10, 100, size=num_traits)
for j in range(num_traits):
    df_trait = tstrait.sim_trait(ts, num_causal[j])
    # df_trait has cols  site_id, causal_state, and effect_size
    X = tstrait.sim_phenotypes(ts, df_trait)
   # Do something with X

A better pattern (and one more suitable for interoperability with things like SLiM etc might be:

num_traits = 100
# Choose number of causal sites uniformly for each trait
num_causal = np.random.uniform(10, 100, size=num_traits)
df_traits = tstrait.sim_traits(ts, num_causal)
# df_traits has cols trait_id,  site_id, causal_state, and effect_size
for trait_id, X in enumerate(tstrait.sim_phenotypes(ts, df_traits)):
    # Do something with X

The gain here is that we can communicate multiple traits at once - otherwise, it might get quite clunky when trying to work with many traits. The sim_phenotypes function returns an iterator over the traits and the generated phenotypes for each trait (much like msprime does for random seeds).

Thoughts?

Audit docs requirements

We need to audit the sphinx extensions listed in extra_extensions in the _config.yml and the requirements/CI-docs/requirements.txt to only use things that are actively used in the docs.

I guess there's some extra changes here after moving the to jupyter notebook backend (rather than IPython?) @daikitag?

Detect site ID

The current code does not detect an error if the causal site ID in the genetic value dataframe is not included in the tree sequence data. The error is only given at site = self.ts.site(single_id) in sim_phenotype.py, so it will be better to raise an error here if an error is caused here. Shall we put try: here?

Docs versioning not working

We're not getting footers with the version information on the docs build.

See tskit-dev/msprime#2201 for related issue.

Change names of fields in return type of sim_phenotype

I think it would be clearer if we had

result = sim_phenotype()
result.traits  #  output of sim_trait
result.phenotypes # output of sim_phenotype

assuming that we do the refactoring in #103

Genotype matrix

Create a function that simulates phenotypes based on the genotype matrix determined from the tree sequence, so that we can compare the times

Add basic unit tests

First issue ;)
Let's write some unit tests:
First things that need to be covered are:

Check whether all edges cases (we can think of) are caught and whether the correct error is raised. (e.g. when the number of sites we request is smaller than the number of mutations on the tree).
Check whether the output of each function has all the right properties: dimensions of arrays, ...

Statistical tests

Statistical tests for the trait model can be written now. It should be done after the unit tests.

Update signatures of sim_phenotype and sim_env

h2 is a named only argument, default=None
alpha default = None

sim_env:

def sim_env(genetic_df, *, h2=None, random_seed=None):

Add statistical tests against existing simulator

We should be testing our results statistically by comparing them against other simulators. There's no reason we can do this at the small scale.

It shouldn't be too hard to use e.g. PhenotypeSimulator on some simple simulations, and qqplot the results as a comparison with our results.

So, we would do something like:

Simulate a small tree sequence using msprime for say, 10 samples, and then for n replicates each do:
Compute phenotypes for each of the 10 individuals under a given model to get a distribution per-individual using tstrait
Export the data to VCF and do the same thing with an external too/ We can then compare the per-individual distributions as qqplots, which should be meaningful.

This will take some work to do, but is an important validation step.

Error in rng.choice

Originally here: #93

Traceback (most recent call last):
  File "/home/rkerr/dataplan-2/tbasim/Python/demography_from_tree_example.py", line 280, in <module>
    trait_df = tstrait.sim_trait(ts_new, num_causal=nqtl, model=model, random_seed=1234)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/rkerr/.local/lib/python3.11/site-packages/tstrait/simulate_effect_size.py", line 137, in sim_trait
    trait_df = simulator._sim_causal_mutation()
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/rkerr/.local/lib/python3.11/site-packages/tstrait/simulate_effect_size.py", line 57, in _sim_causal_mutation
    site_id_array = self._choose_causal_site()
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/rkerr/.local/lib/python3.11/site-packages/tstrait/simulate_effect_size.py", line 38, in _choose_causal_site
    site_id = self.rng.choice(
              ^^^^^^^^^^^^^^^^
  File "_generator.pyx", line 847, in numpy.random._generator.Generator.choice
TypeError: expected a sequence of integers or a single integer, got '400.0'

Can you give us a minimum working example here please @richard-Hobart so we can reproduce?

Pleiotropy
Dominance
TraitModel (Indicate how users can set their own model)
Causal sites
Multiple allele (This is an advantage of tstrait, but it is not mentioned in the documentation)

Build wheels

See for example https://github.com/tskit-dev/tszip/blob/main/.github/workflows/wheels.yml