Giter Site home page Giter Site logo

tstrait's People

Contributors

benjeffery avatar daikitag avatar gertjanbisschop avatar hyanwong avatar jeromekelleher avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

tstrait's Issues

Documentation: h2 = 1

In the documentation, add an example of $h^2=1$, such that the users can only obtain the genetic values of individuals. Then, describe how users can simulate environmental noise and add effects from covariates, such as gender and age.

causal sites

The current API assumes we provide num_causal_sites. It might make more sense to turn this into causal_sites and allow both integer values as well as numpy arrays. The case of an integer is the currently implemented behaviour. In case of an array, these should all be sites contained in the sites table. We would then no longer randomly pick sites.

Alternatively, it might make more sense to provide an example in the documentation on how to mask certain regions of the genome with tree sequences, in case we have knowledge of coding vs non-coding regions for example.

Document how ploidy works

I assume that tstrait relies on (a) individuals being present in the tree sequence (in particular, each sample genome must belong to an individual) and (b) individuals being diploid?

It would be useful to clarify this somewhere, and state what happens e.g. if individuals are haploid, triploid, or whatever.

#27 is probably relevant here too.

Add statistical tests against existing simulator

We should be testing our results statistically by comparing them against other simulators. There's no reason we can do this at the small scale.

It shouldn't be too hard to use e.g. PhenotypeSimulator on some simple simulations, and qqplot the results as a comparison with our results.

So, we would do something like:

  1. Simulate a small tree sequence using msprime for say, 10 samples, and then for n replicates each do:
  2. Compute phenotypes for each of the 10 individuals under a given model to get a distribution per-individual using tstrait
  3. Export the data to VCF and do the same thing with an external too/ We can then compare the per-individual distributions as qqplots, which should be meaningful.

This will take some work to do, but is an important validation step.

Fix docs build

The docs build is currently failing due to a variety of issues with the source:

/home/runner/work/tstrait/tstrait/tstrait/simulate_phenotype.py:docstring of tstrait.simulate_phenotype.sim_phenotype:4: CRITICAL: Unexpected section title.

Parameters
----------
/home/runner/work/tstrait/tstrait/tstrait/simulate_phenotype.py:docstring of tstrait.simulate_phenotype.sim_phenotype:23: CRITICAL: Unexpected section title.

Returns
-------
/home/runner/work/tstrait/tstrait/tstrait/simulate_phenotype.py:docstring of tstrait.simulate_phenotype.sim_phenotype:28: CRITICAL: Unexpected section title.

Raises
------
/home/runner/work/tstrait/tstrait/tstrait/simulate_phenotype.py:docstring of tstrait.simulate_phenotype.sim_phenotype:35: CRITICAL: Unexpected section title.

See Also
--------
/home/runner/work/tstrait/tstrait/tstrait/simulate_phenotype.py:docstring of tstrait.simulate_phenotype.sim_phenotype:40: CRITICAL: Unexpected section title.

Notes
-----
/home/runner/work/tstrait/tstrait/tstrait/simulate_phenotype.py:docstring of tstrait.simulate_phenotype.sim_phenotype:65: CRITICAL: Unexpected section title.

Examples
--------
WARNING: autodoc: failed to import class 'TraitModelAdditive' from module 'tstrait'; the following exception was raised:
Traceback (most recent call last):
  File "/home/runner/work/tstrait/tstrait/venv/lib/python3.10/site-packages/sphinx/util/inspect.py", line 380, in safe_getattr
    return getattr(obj, name, *defargs)
AttributeError: module 'tstrait' has no attribute 'TraitModelAdditive'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/runner/work/tstrait/tstrait/venv/lib/python3.10/site-packages/sphinx/ext/autodoc/importer.py", line 101, in import_object
    obj = attrgetter(obj, mangled_name)
  File "/home/runner/work/tstrait/tstrait/venv/lib/python3.10/site-packages/sphinx/ext/autodoc/__init__.py", line 307, in get_attr
    return autodoc_attrgetter(self.env.app, obj, name, *defargs)
  File "/home/runner/work/tstrait/tstrait/venv/lib/python3.10/site-packages/sphinx/ext/autodoc/__init__.py", line 2795, in autodoc_attrgetter
    return safe_getattr(obj, name, *defargs)
  File "/home/runner/work/tstrait/tstrait/venv/lib/python3.10/site-packages/sphinx/util/inspect.py", line 396, in safe_getattr
    raise AttributeError(name) from exc
AttributeError: TraitModelAdditive

WARNING: autodoc: failed to import class 'TraitModelAlleleFrequency' from module 'tstrait'; the following exception was raised:
Traceback (most recent call last):
  File "/home/runner/work/tstrait/tstrait/venv/lib/python3.10/site-packages/sphinx/util/inspect.py", line 380, in safe_getattr
    return getattr(obj, name, *defargs)
AttributeError: module 'tstrait' has no attribute 'TraitModelAlleleFrequency'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/runner/work/tstrait/tstrait/venv/lib/python3.10/site-packages/sphinx/ext/autodoc/importer.py", line 101, in import_object
    obj = attrgetter(obj, mangled_name)
  File "/home/runner/work/tstrait/tstrait/venv/lib/python3.10/site-packages/sphinx/ext/autodoc/__init__.py", line 307, in get_attr
    return autodoc_attrgetter(self.env.app, obj, name, *defargs)
  File "/home/runner/work/tstrait/tstrait/venv/lib/python3.10/site-packages/sphinx/ext/autodoc/__init__.py", line 2795, in autodoc_attrgetter
    return safe_getattr(obj, name, *defargs)
  File "/home/runner/work/tstrait/tstrait/venv/lib/python3.10/site-packages/sphinx/util/inspect.py", line 396, in safe_getattr
    raise AttributeError(name) from exc
AttributeError: TraitModelAlleleFrequency

WARNING: autodoc: failed to import class 'Result' from module 'tstrait'; the following exception was raised:
Traceback (most recent call last):
  File "/home/runner/work/tstrait/tstrait/venv/lib/python3.10/site-packages/sphinx/util/inspect.py", line 380, in safe_getattr
    return getattr(obj, name, *defargs)
AttributeError: module 'tstrait' has no attribute 'Result'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/runner/work/tstrait/tstrait/venv/lib/python3.10/site-packages/sphinx/ext/autodoc/importer.py", line 101, in import_object
    obj = attrgetter(obj, mangled_name)
  File "/home/runner/work/tstrait/tstrait/venv/lib/python3.10/site-packages/sphinx/ext/autodoc/__init__.py", line 307, in get_attr
    return autodoc_attrgetter(self.env.app, obj, name, *defargs)
  File "/home/runner/work/tstrait/tstrait/venv/lib/python3.10/site-packages/sphinx/ext/autodoc/__init__.py", line 2795, in autodoc_attrgetter
    return safe_getattr(obj, name, *defargs)
  File "/home/runner/work/tstrait/tstrait/venv/lib/python3.10/site-packages/sphinx/util/inspect.py", line 396, in safe_getattr
    raise AttributeError(name) from exc
AttributeError: Result

/home/runner/work/tstrait/tstrait/tstrait/simulate_phenotype.py:docstring of tstrait.simulate_phenotype.PhenotypeResult:4: CRITICAL: Unexpected section title.

Parameters
----------
/home/runner/work/tstrait/tstrait/tstrait/simulate_phenotype.py:docstring of tstrait.simulate_phenotype.PhenotypeResult:11: CRITICAL: Unexpected section title.

See Also
--------
/home/runner/work/tstrait/tstrait/tstrait/simulate_phenotype.py:docstring of tstrait.simulate_phenotype.PhenotypeResult:15: CRITICAL: Unexpected section title.

Examples
--------
WARNING: autodoc: failed to import class 'GenotypeResult' from module 'tstrait'; the following exception was raised:
Traceback (most recent call last):
  File "/home/runner/work/tstrait/tstrait/venv/lib/python3.10/site-packages/sphinx/util/inspect.py", line 380, in safe_getattr
    return getattr(obj, name, *defargs)
AttributeError: module 'tstrait' has no attribute 'GenotypeResult'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/runner/work/tstrait/tstrait/venv/lib/python3.10/site-packages/sphinx/ext/autodoc/importer.py", line 101, in import_object
    obj = attrgetter(obj, mangled_name)
  File "/home/runner/work/tstrait/tstrait/venv/lib/python3.10/site-packages/sphinx/ext/autodoc/__init__.py", line 307, in get_attr
    return autodoc_attrgetter(self.env.app, obj, name, *defargs)
  File "/home/runner/work/tstrait/tstrait/venv/lib/python3.10/site-packages/sphinx/ext/autodoc/__init__.py", line 2795, in autodoc_attrgetter
    return safe_getattr(obj, name, *defargs)
  File "/home/runner/work/tstrait/tstrait/venv/lib/python3.10/site-packages/sphinx/util/inspect.py", line 396, in safe_getattr
    raise AttributeError(name) from exc
AttributeError: GenotypeResult

/home/runner/work/tstrait/tstrait/docs/model.md: WARNING: Executing notebook failed: CellExecutionError [mystnb.exec]
/home/runner/work/tstrait/tstrait/docs/model.md: WARNING: Notebook exception traceback saved in: /home/runner/work/tstrait/tstrait/docs/_build/html/reports/model.err.log [mystnb.exec]
/home/runner/work/tstrait/tstrait/docs/quickstart.md: WARNING: Executing notebook failed: CellExecutionError [mystnb.exec]
/home/runner/work/tstrait/tstrait/docs/quickstart.md: WARNING: Notebook exception traceback saved in: /home/runner/work/tstrait/tstrait/docs/_build/html/reports/quickstart.err.log [mystnb.exec]
/home/runner/work/tstrait/tstrait/docs/simulation.md: WARNING: Executing notebook failed: CellExecutionError [mystnb.exec]
/home/runner/work/tstrait/tstrait/docs/simulation.md: WARNING: Notebook exception traceback saved in: /home/runner/work/tstrait/tstrait/docs/_build/html/reports/simulation.err.log [mystnb.exec]
looking for now-outdated files... none found
pickling environment... done
checking consistency... done
preparing documents... done
writing output... [ 16%] api                                                   
writing output... [ 33%] installation                                          
writing output... [ 50%] intro                                                 
writing output... [ 66%] model                                                 
writing output... [ 83%] quickstart                                            
/home/runner/work/tstrait/tstrait/tstrait/simulate_phenotype.py:docstring of tstrait.simulate_phenotype.sim_phenotype:16: WARNING: undefined label: frequency_dependence
writing output... [100%] simulation                                            
/home/runner/work/tstrait/tstrait/tstrait/simulate_phenotype.py:docstring of tstrait.simulate_phenotype.sim_phenotype:41: WARNING: py:class reference target not found: pandas.DataFrame
/home/runner/work/tstrait/tstrait/tstrait/simulate_phenotype.py:docstring of tstrait.simulate_phenotype.sim_phenotype:61: WARNING: undefined label: phenotype_model
/home/runner/work/tstrait/tstrait/tstrait/simulate_phenotype.py:docstring of tstrait.simulate_phenotype.sim_phenotype:66: WARNING: undefined label: quickstart
/home/runner/work/tstrait/tstrait/tstrait/simulate_phenotype.py:docstring of tstrait.simulate_phenotype.PhenotypeResult:16: WARNING: undefined label: effect_size_output
/home/runner/work/tstrait/tstrait/tstrait/simulate_phenotype.py:docstring of tstrait.simulate_phenotype.PhenotypeResult:16: WARNING: undefined label: phenotype_output
/home/runner/work/tstrait/tstrait/docs/model.md:36: WARNING: py:class reference target not found: TraitModelAdditive
/home/runner/work/tstrait/tstrait/docs/model.md:40: WARNING: py:class reference target not found: TraitModelAdditive
/home/runner/work/tstrait/tstrait/docs/model.md:67: WARNING: py:class reference target not found: TraitModelAlleleFrequency
/home/runner/work/tstrait/tstrait/docs/model.md:79: WARNING: py:class reference target not found: TraitModelAlleleFrequency
/home/runner/work/tstrait/tstrait/docs/model.md:81: WARNING: py:class reference target not found: TraitModelAlleleFrequency
/home/runner/work/tstrait/tstrait/docs/simulation.md:55: WARNING: py:class reference target not found: Result
/home/runner/work/tstrait/tstrait/docs/simulation.md:58: WARNING: py:class reference target not found: GenotypeResult
/home/runner/work/tstrait/tstrait/docs/simulation.md:103: WARNING: py:class reference target not found: GenotypeResult
generating indices... genindex done
highlighting module code... [100%] tstrait.simulate_phenotype                  
writing additional pages... search done
copying static files... done
copying extra files... done
dumping search index in English (code: en)... done
dumping object inventory... done
[etoc] missing index.html written as redirect to 'intro.html'
build finished with problems, 33 warnings.

===============================================================================

Building your book, returns a non-zero exit code (1). Look above for the cause.

===============================================================================

Error occured; showing saved reports
make: *** [Makefile:8: dev] Error 1
Traceback (most recent call last):
  File "/home/runner/work/tstrait/tstrait/venv/lib/python3.10/site-packages/jupyter_cache/executors/utils.py", line 58, in single_nb_execution
    executenb(
  File "/home/runner/work/tstrait/tstrait/venv/lib/python3.10/site-packages/nbclient/client.py", line 1305, in execute
    return NotebookClient(nb=nb, resources=resources, km=km, **kwargs).execute()
  File "/home/runner/work/tstrait/tstrait/venv/lib/python3.10/site-packages/jupyter_core/utils/__init__.py", line 166, in wrapped
    return loop.run_until_complete(inner)
  File "/opt/hostedtoolcache/Python/3.10.12/x64/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
    return future.result()
  File "/home/runner/work/tstrait/tstrait/venv/lib/python3.10/site-packages/nbclient/client.py", line 705, in async_execute
    await self.async_execute_cell(
  File "/home/runner/work/tstrait/tstrait/venv/lib/python3.10/site-packages/nbclient/client.py", line 1058, in async_execute_cell
    await self._check_raise_for_error(cell, cell_index, exec_reply)
  File "/home/runner/work/tstrait/tstrait/venv/lib/python3.10/site-packages/nbclient/client.py", line 914, in _check_raise_for_error
    raise CellExecutionError.from_cell_and_msg(cell, exec_reply_content)
nbclient.exceptions.CellExecutionError: An error occurred while executing the following cell:
------------------
import msprime
import tstrait
import matplotlib.pyplot as plt

num_ind = 500
ts = msprime.sim_ancestry(num_ind, sequence_length=1_000_000, recombination_rate=1e-8,
                          population_size=10**4, random_seed=1)
ts = msprime.sim_mutations(ts, rate=1e-8, random_seed=1)

model = tstrait.TraitModelAdditive(trait_mean=0, trait_var=1)
sim_result = tstrait.sim_phenotype(ts, num_causal=1000, model=model, h2=0.3, random_seed=1)

plt.scatter(sim_result.genotype.allele_frequency, sim_result.genotype.effect_size)
plt.xlabel("Allele frequency")
plt.ylabel("Effect size")
plt.axhline(y=0, color='r', linestyle='-')
plt.title("TraitModelAdditive")
plt.show()
------------------


---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In[1], line 10
      6 ts = msprime.sim_ancestry(num_ind, sequence_length=1_000_000, recombination_rate=1e-8,
      7                           population_size=10**4, random_seed=1)
      8 ts = msprime.sim_mutations(ts, rate=1e-8, random_seed=1)
---> 10 model = tstrait.TraitModelAdditive(trait_mean=0, trait_var=1)
     11 sim_result = tstrait.sim_phenotype(ts, num_causal=1000, model=model, h2=0.3, random_seed=1)
     13 plt.scatter(sim_result.genotype.allele_frequency, sim_result.genotype.effect_size)

AttributeError: module 'tstrait' has no attribute 'TraitModelAdditive'

Traceback (most recent call last):
  File "/home/runner/work/tstrait/tstrait/venv/lib/python3.10/site-packages/jupyter_cache/executors/utils.py", line 58, in single_nb_execution
    executenb(
  File "/home/runner/work/tstrait/tstrait/venv/lib/python3.10/site-packages/nbclient/client.py", line 1305, in execute
    return NotebookClient(nb=nb, resources=resources, km=km, **kwargs).execute()
  File "/home/runner/work/tstrait/tstrait/venv/lib/python3.10/site-packages/jupyter_core/utils/__init__.py", line 166, in wrapped
    return loop.run_until_complete(inner)
  File "/opt/hostedtoolcache/Python/3.10.12/x64/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
    return future.result()
  File "/home/runner/work/tstrait/tstrait/venv/lib/python3.10/site-packages/nbclient/client.py", line 705, in async_execute
    await self.async_execute_cell(
  File "/home/runner/work/tstrait/tstrait/venv/lib/python3.10/site-packages/nbclient/client.py", line 1058, in async_execute_cell
    await self._check_raise_for_error(cell, cell_index, exec_reply)
  File "/home/runner/work/tstrait/tstrait/venv/lib/python3.10/site-packages/nbclient/client.py", line 914, in _check_raise_for_error
    raise CellExecutionError.from_cell_and_msg(cell, exec_reply_content)
nbclient.exceptions.CellExecutionError: An error occurred while executing the following cell:
------------------
import msprime
import tstrait
import matplotlib.pyplot as plt

num_ind = 2000
ts = msprime.sim_ancestry(num_ind, sequence_length=1_000_000, recombination_rate=1e-8,
                          population_size=10**4, random_seed=1)
ts = msprime.sim_mutations(ts, rate=1e-8, random_seed=1)

model = tstrait.TraitModelAlleleFrequency(trait_mean=0, trait_var=1, alpha=-0.3)
sim_result = tstrait.sim_phenotype(ts, num_causal=1000, model=model, h2=0.3, random_seed=1)
------------------


---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In[1], line 10
      6 ts = msprime.sim_ancestry(num_ind, sequence_length=1_000_000, recombination_rate=1e-8,
      7                           population_size=10**4, random_seed=1)
      8 ts = msprime.sim_mutations(ts, rate=1e-8, random_seed=1)
---> 10 model = tstrait.TraitModelAlleleFrequency(trait_mean=0, trait_var=1, alpha=-0.3)
     11 sim_result = tstrait.sim_phenotype(ts, num_causal=1000, model=model, h2=0.3, random_seed=1)

AttributeError: module 'tstrait' has no attribute 'TraitModelAlleleFrequency'

Traceback (most recent call last):
  File "/home/runner/work/tstrait/tstrait/venv/lib/python3.10/site-packages/jupyter_cache/executors/utils.py", line 58, in single_nb_execution
    executenb(
  File "/home/runner/work/tstrait/tstrait/venv/lib/python3.10/site-packages/nbclient/client.py", line 1305, in execute
    return NotebookClient(nb=nb, resources=resources, km=km, **kwargs).execute()
  File "/home/runner/work/tstrait/tstrait/venv/lib/python3.10/site-packages/jupyter_core/utils/__init__.py", line 166, in wrapped
    return loop.run_until_complete(inner)
  File "/opt/hostedtoolcache/Python/3.10.12/x64/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
    return future.result()
  File "/home/runner/work/tstrait/tstrait/venv/lib/python3.10/site-packages/nbclient/client.py", line 705, in async_execute
    await self.async_execute_cell(
  File "/home/runner/work/tstrait/tstrait/venv/lib/python3.10/site-packages/nbclient/client.py", line 1058, in async_execute_cell
    await self._check_raise_for_error(cell, cell_index, exec_reply)
  File "/home/runner/work/tstrait/tstrait/venv/lib/python3.10/site-packages/nbclient/client.py", line 914, in _check_raise_for_error
    raise CellExecutionError.from_cell_and_msg(cell, exec_reply_content)
nbclient.exceptions.CellExecutionError: An error occurred while executing the following cell:
------------------
import msprime
import tstrait

num_ind = 5
ts = msprime.sim_ancestry(num_ind, sequence_length=1_000_000, recombination_rate=1e-8,
                          population_size=10**4, random_seed=1)
ts = msprime.sim_mutations(ts, rate=1e-8, random_seed=1)

model = tstrait.TraitModelAlleleFrequency(trait_mean=0, trait_var=1, alpha=-0.3)
sim_result = tstrait.sim_phenotype(ts, num_causal=3, model=model, h2=0.3, random_seed=1)
------------------


---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In[1], line 9
      5 ts = msprime.sim_ancestry(num_ind, sequence_length=1_000_000, recombination_rate=1e-8,
      6                           population_size=10**4, random_seed=1)
      7 ts = msprime.sim_mutations(ts, rate=1e-8, random_seed=1)
----> 9 model = tstrait.TraitModelAlleleFrequency(trait_mean=0, trait_var=1, alpha=-0.3)
     10 sim_result = tstrait.sim_phenotype(ts, num_causal=3, model=model, h2=0.3, random_seed=1)

AttributeError: module 'tstrait' has no attribute 'TraitModelAlleleFrequency'

Error: Process completed with exit code 2.

GWAS Summary Statistics

Steps to do after pheotype simulation:

  • Code to exclude the effects of SNPs having small frequency
  • Tree traversal algorithm to conduct simple linear regression analysis against each SNPs to determine the p-value and effect sizes
  • Making the code faster through numba/C
  • Genotype imputation (Maybe)

Change "causal_state" to "causal_allele"?

When writing this up, I found "causal allele" much easier to explain, and on reflection, I don't see any reason to use causal_state (other than to mirror the tskit terminology). I think it's much clearer what our model is if we use "causal allele" throughout.

Also go through the documentation and remove references to "causal mutation", as this is incorrect (we can have multiple causal mutations to the causal allele).

Trait mean

Divide the trait mean by the number of causal sites.

Steps required:

  • Change trait model in simulate_phenotype.py
  • Change statistical tests and unit tests to account for the new trait mean

Documentation

The following information should be added to the documentation once the Python codes are finished:

  • Pleiotropy
  • Dominance
  • TraitModel (Indicate how users can set their own model)
  • Causal sites
  • Multiple allele (This is an advantage of tstrait, but it is not mentioned in the documentation)

Causal sites

tstrait currently chooses causal sites at random among all the sites in the tree sequence data. We can change the Python API and provide user control of which sites could be causal. For example, the users can select the minimum and maximum allele frequency of causal sites, such that tstrait doesn't choose singletons.
stdpopsim has some functionalities that could be used for this.

Trait model

It would be difficult for users to specify their model by using the TraitModel superclass. It would be nice if the user can specify their model into TraitModel.

Some suggestions:

  • Put distribution as an input of TraitModel
  • The user writes the function to simulate effect sizes and feed it into the TraitModel

Alternatively, we can write how the users can create a TraitModel object in the tstrait documentation, but then it would be necessary for the users to have some background knowledege in Python classes.

Detect site ID

The current code does not detect an error if the causal site ID in the genetic value dataframe is not included in the tree sequence data. The error is only given at site = self.ts.site(single_id) in sim_phenotype.py, so it will be better to raise an error here if an error is caused here. Shall we put try: here?

Enable causal sites to be selected

The current code randomly selects causal mutations. To consider the mutation model with multiple alleles, it would be necessary to do a similar thing with causal sites instead. The next steps will be:

  1. Select causal site
  2. Obtain mutation ID in the causal site (assuming infinite sites model)
  3. Randomly select 1 mutation in the site that is different from the ancestral state (Not assuming infinite sites model)

After these steps are completed, we can go to the next step of simulating effect sizes of tree sequence data with multiple alleles.

Dominance

Dominance could be added to tstrait, as we obtain the genotype of individuals in the intermediate step of the simulation. We can let tstrait do:

  • 0 mutation: Nothing
  • 1 mutation: Add effect size + dominance effect
  • 2 mutation: Add effect size * 2

instead of simply adding effect size * number of causal mutations to the genetic values of individuals.

I assume that the following steps will be required:

  • Create a new function for simulating individuals with dominance, as the computational speed of tstrait will drop with dominance
  • Create DominanceModel to control how dominance effects are being simulated
  • Statistical test to test dominance

Change return type from Pandas dataframe to Xarray dataset?

Xarray is very powerful, and gives a path to scaling out to things like Dask, chunked arrays, etc. It's also the basic element in sgkit, which makes me think it's a missed opportunity if tstrait doesn't go for the same basic technology from the outset.

I think all we'd have to do really is to wrap Pandas dataframes we're currently returning with Xarray and to label the dimensions, and this would give us some great future proofing.

Again, while we're making changes, perhaps it's best to just make this one?

https://docs.xarray.dev/en/stable/index.html

https://github.com/tskit-dev/tstrait/issues/new

Modular code

Write modular code, so that it would be simpler to implement other functions later

Audit docs requirements

We need to audit the sphinx extensions listed in extra_extensions in the _config.yml and the requirements/CI-docs/requirements.txt to only use things that are actively used in the docs.

I guess there's some extra changes here after moving the to jupyter notebook backend (rather than IPython?) @daikitag?

Add basic unit tests

First issue ;)
Let's write some unit tests:
First things that need to be covered are:

  • Check whether all edges cases (we can think of) are caught and whether the correct error is raised. (e.g. when the number of sites we request is smaller than the number of mutations on the tree).
  • Check whether the output of each function has all the right properties: dimensions of arrays, ...

Review # pragma: no cover usage

There's a few places we're using # pragma: no cover where we could/should be testing.

Review these and replace with crafted test cases where approporate.

# pragma: no cover should only be used in cases where it's not possible to test, like in a numba jitted function.

Statistical tests

Statistical tests for the trait model can be written now. It should be done after the unit tests.

Add testing

As this is pure-python I think you could get away with linux-only testing for now. That will remove the need for conda in the setup.

Genotype matrix

Create a function that simulates phenotypes based on the genotype matrix determined from the tree sequence, so that we can compare the times

Markdown introduction

The markdown for explaning the codes can be written now. Specifically, the following should be addressed:

  • The basic way of using the code
  • Explaining each models
  • The output
  • How we can obtain further output by using msprime

Incorrect variance with multiple causal sites

In tstrait version 0.0.1, the simulated effect sizes are divided by the number of causal sites, which will dramatically shrink the phenotypic variance when the number of causal sites is large.

Example of a trait simulation by using tstrait version 0.0.1.

image

Many thanks to Alison for detecting this major bug.

Steps to make tstrait as a library

  • _config.yml. It is slightly different compared with tskit or msprime.
  • Makefile (Is it correct?)
  • build.sh
  • README.md
  • setup.cfg
  • License in setup.cfg
  • Do we have to put dOxygen in docs like tskit?
  • .flake8 in tskit. Should we create a similar file?
  • Logo (Like tsinfer and tsdate)
  • .pre-commit-config.yaml might not be working when I commit to github.

Phenotype Simulation

Next steps in phenotype simulation framework:

  • Efficient tree traversal algorithm to determine the allele frequency
  • Statistical tests against expected mean/variance
  • Making the phenotypic simulation code faster through Numba/C codes (The current tree traversal algorithm code is very slow to implement)
  • Dominance/epistasis (If possible, no need in GWAS simulation framework, but adding dominance effects is simple)

Add statistical tests

We would for example want to verify statistically whether we get the correct distribution of the simulated genotypic values for all individuals. For each individual this is the sum of many beta-values. So we know what this distribution should look like (without any noise added). We can use a qq-plot to check this distribution (using a high enough number of snp's) to check this against the expectation.

Allele frequency 0 and 1

Very rarely, a site can have 0 or 1 allele frequency. It would be important to fix the issue.

Documentation

pip install -U jsonschema[format-nongpl] must be used to use tskit-book-theme.

Pleiotropy

Simulation of multiple traits with pleiotropy.

Input:

Trait model:

  • Mean vector
  • Covariance matrix

Phenotype simulation (can set a different function for pleiotropy):

  • Degree of pleiotropy
  • Heritability vector
  • TraitModel vector (one for pleiotropy, another for individual ones)

Steps required:

  • Create a new TraitModel to simulate traits from a multivariate normal distribution
  • Create a new function to simulate multiple traits
  • Statistical tests for pleiotropy

Move causal state selection to ``sim_trait`` and make ``sim_genetic`` deterministic (and rename)

I think we've made a minor architectural mistake in how we put together the modular components of a simulation which we should fix while we can. Currently, we have

df_trait = sim_trait(ts, num_causal, model, random_seed)
genetic_result = sim_genetic(ts, trait_df, alpha, random_seed)

where df_trait looks like this:

site_id	effect_size	trait_id
3	0.140148	0

and genetic_result has two parts, effect_size which looks like:

site_id	effect_size	trait_id	causal_state	allele_frequency
3	0.140148	0	G	0.2

and .genetic which looks like

trait_id	individual_id	genetic_value
0	0	0.897513
0	1	0.793933
0	2	0.140148
0	3	0.378682
0	4	0.378682

While writing this up, I found it very confusing. Why are we doing it this way? Wouldn't it be much simpler to do this:

df_trait = sim_trait(ts, num_causal, model, alpha, random_seed)

and df_trait looks like

site_id	effect_size	trait_id	causal_state	allele_frequency
3	0.140148	0	G	0.2

and then we do

df_genetic = genetic_value(ts, df_trait)

and returns

trait_id	individual_id	genetic_value
0	0	0.897513
0	1	0.793933
0	2	0.140148
0	3	0.378682
0	4	0.378682

So, we do the choice of causal state and correction of the effect size by allele frequency within sim_trait, and then the genetic_value function simply does all the propagating.

The only reason I can see for doing it in the current way is to avoid going through the trees multiple times, but I feel that this is a premature optimisation and that clarity of the modular API is more important.

Thoughts @GertjanBisschop, @daikitag?

Simulating multiple traits

In #52 we added the sim_trait(ts, num_causal) function which returns a pandas data frame with num_causal rows. It seems to me that we will often want to simulate multiple traits, which perhaps we should try to support natively. I'm guessing (and correct me if I'm wrong!) that people will do things like this:

num_traits = 100
# Choose number of causal sites uniformly for each trait
num_causal = np.random.uniform(10, 100, size=num_traits)
for j in range(num_traits):
    df_trait = tstrait.sim_trait(ts, num_causal[j])
    # df_trait has cols  site_id, causal_state, and effect_size
    X = tstrait.sim_phenotypes(ts, df_trait)
   # Do something with X

A better pattern (and one more suitable for interoperability with things like SLiM etc might be:

num_traits = 100
# Choose number of causal sites uniformly for each trait
num_causal = np.random.uniform(10, 100, size=num_traits)
df_traits = tstrait.sim_traits(ts, num_causal)
# df_traits has cols trait_id,  site_id, causal_state, and effect_size
for trait_id, X in enumerate(tstrait.sim_phenotypes(ts, df_traits)):
    # Do something with X

The gain here is that we can communicate multiple traits at once - otherwise, it might get quite clunky when trying to work with many traits. The sim_phenotypes function returns an iterator over the traits and the generated phenotypes for each trait (much like msprime does for random seeds).

Thoughts?

Issues with the current code

  • Obtaining site ID like sites = np.random.choice([s.id for s in tree.sites()], size=num_sites, replace=False) is very time consuming. I'm currently using rng.choice(range(num_sites), size=num_causal, replace=False), but it won't work when some site IDs are missing
  • Are there any fast ways to obtain individual IDs?
  • Converting genotypic values of nodes to genotypic values of individuals
  • Output of the code. Should the output of the simulation be dataframes, or should it be a Python object and the user can use the code inside the package to obtain dataframes?

Unit test + flake8/black

Steps that need to be taken before merging the Jerome branch to main

  • Write unit test for phenotype.py file
  • Use flake8 and black

Error in rng.choice

Originally here: #93

Traceback (most recent call last):
  File "/home/rkerr/dataplan-2/tbasim/Python/demography_from_tree_example.py", line 280, in <module>
    trait_df = tstrait.sim_trait(ts_new, num_causal=nqtl, model=model, random_seed=1234)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/rkerr/.local/lib/python3.11/site-packages/tstrait/simulate_effect_size.py", line 137, in sim_trait
    trait_df = simulator._sim_causal_mutation()
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/rkerr/.local/lib/python3.11/site-packages/tstrait/simulate_effect_size.py", line 57, in _sim_causal_mutation
    site_id_array = self._choose_causal_site()
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/rkerr/.local/lib/python3.11/site-packages/tstrait/simulate_effect_size.py", line 38, in _choose_causal_site
    site_id = self.rng.choice(
              ^^^^^^^^^^^^^^^^
  File "_generator.pyx", line 847, in numpy.random._generator.Generator.choice
TypeError: expected a sequence of integers or a single integer, got '400.0'

Can you give us a minimum working example here please @richard-Hobart so we can reproduce?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.