caporaso-lab / sourcetracker2 Goto Github PK

SourceTracker2

License: BSD 3-Clause "New" or "Revised" License

Jupyter Notebook 41.60% Python 58.04% Makefile 0.03% TeX 0.33%

sourcetracker2's Introduction

SourceTracker2

SourceTracker was originally described in Knights et al., 2011. If you use this package, please cite the original SourceTracker paper linked above pending publication of SourceTracker 2.

API vs. CLI

There are two ways to access the SourceTracker 2 functionality, via the command line (CLI) or the python API. Users seeking to replicate the functionality of SourceTracker 1 should use the command line functionality (sourcetracker2 gibbs) Programmatic users are encouraged to use the API (exposed via gibbs). The help documentation is broken down into sections with separate subsections for API and CLI usage.

File Formats

Command Line

For descriptions of all file formats and options, please see the help documentation, available with the command sourcetracker2 gibbs --help.

This script requires a feature X sample contingency table (traditionally an OTU table) and sample metadata (traditionally a mapping file).

The feature X sample table is an nXm table (n rows, m columns) with sample IDs in the first column, and feature IDs in the first row. The values in each 'cell' of the table must be integer counts.

The sample metadata file is an sXk table (s rows, k columns) with sample IDs in the first column, and metadata headers in the first row. The values in each 'cell' of the table can be any type of data, and represent information about each sample.

Any feature table that can be read by the biom-format >= 2.1.4 package will be acceptable input. For example file formats please look at the test mapping file and feature (OTU) tables we have included here.

API

For descriptions of the requirements, please see documentation in the gibbs function. This function wraps the main workhorse function gibbs_sampler and exposes all the parameters necessary to control the behavior of the Gibb's sampling as well as the parallel functionality etc.

A superficial but important difference from the CLI framework is that, internally, SourceTracker 2 represents all tables as sample X feature (samples are rows, columns are features). This reflects choices in Dan's original code, as well as eases metadata based subsetting of tables. The API functions expect data in sample X column format.

Preprocessing

Command line

Input feature data should be counts. If non-count data (e.g. the count of feature i in sample j was 4.63) is passed, the data will be rounded to the nearest integer.

Rarefaction is performed by default at 1000 seqs/sample for both sinks and sources. This is done to prevent samples with more counts from dominating the contributions. Rarefaction depth can be set (or entirely disabled) with the --source_rarefaction_depth and --sink_rarefaction_depth parameters. Sources samples which are collapsed are rarefied after collapse.

Samples which are not present in both the input feature table and the metadata are excluded from the analysis.

Samples which come from the same source environment are 'collapsed', meaning their mean value for each feature is computed and used in the analysis. See the 'Theory' section below for a discussion of this approach.

API

The gibbs function does not perform collapsing or rarefaction. Validty checks are performed on the input to the gibbs data. If sources and sinks dataframes are passed, the columns of the dataframes are checked and must overlap exactly (identical order).

Output

Command line

There are two default output files, the mixing_proporitions.txt and mixing_proportion_stds.txt. mixing_proporitions.txt is a tab-separated contingency table with sinks as rows and sources as columns. The values in the table are the mean fractional contributions of each source to each sink. mixing_proporitions_stds has the same format, but contains the standard deviation of each fractional contribution.

Optionally, you can create per-sink feature X sample tables with the --per_sink_feature_assignments flag. The per-sink feature tables are labeled with the name of the sink. For example, if we have a sink called 'hand_sample3' the output feature table would be 'hand_sample3.feature_table.txt'. These tables record the origin source of each sink sequence (count of a feature).

API

The outputs of the gibbs function are identical to the command line outputs, just in dataframe form.

Documentation

This script replicates and extends the functionality of Dan Knights's SourceTracker R package.

A major improvement in this version of SourceTracker is the ability to run it in parallel. Currently, parallelization across a single machine is supported for both estimation of source proportions and leave-one-out source class prediction. The speedup from parallelization should be approximately a factor of jobs that are passed. For instance, passing --jobs 4 will decrease runtime approximately 4X (less due to overhead). Note that you can only specify as many jobs as you have sink samples. For instance, passing 10 jobs with only 5 sink samples will not result in the code executing any faster than passing 5 jobs, since there is a 1 sink per job limit. Said another way, a single sink sample cannot be split up into multiple jobs.

Installation

SourceTracker2 is Python 3 software. The easiest way to install it is using Anaconda. If you don't already have Anaconda installed, you can install it following their install instructions.

To install SourceTracker 2 using Anaconda, run the following commands:

conda create -n st2 -c conda-forge python=3.8 numpy scipy scikit-bio h5py hdf5 seaborn biom-format
conda activate st2
pip install sourcetracker

If you wish to use the latest features in the development version of sourcetracker2 you can install it directly from the github repository.

Note: Only run this if you wish to use the development version

pip install https://github.com/caporaso-lab/sourcetracker2/archive/master.zip

To test that your installation was successful, try the following command:

sourcetracker2 gibbs --help

When you open a new terminal, you'll always need to run:

source activate st2

to activate your SourceTracker 2 environment.

Theory

This document describes some of the basic theory for use of SourceTracker2. For more theory and a visual walkthrough, please see the Juypter notebook.

There are main two ways to use this script:
(1) Estimating the proportions of different (microbial) sources to a sample of a (microbial) sink.
(2) Using a leave-one-out (LOO) strategy, predict the metadata class of a given (microbial) sample.

The main functionality (1) is, the estimation of the proportion of sources to a given sink. A source and a sink are both vectors of feature abundances. A source is typically multiple samples that come from an environment of interest, a sink is usually a single sample.

As an example, consider the classic balls in urns problem. There are three urns, each of which contains a set of differently colored balls in different proportions. Imagine that you reach into Urn1 and remove u_1 balls without looking at the colors. You reach into Urn2 and remove u_2 balls, again without looking at the colors. Finally, you reach into Urn3 and remove u_3 balls, without looking at the colors. You then mix your individual samples (u_1, u_2, and u_3) and produce one mixed sample whose color counts you survey.

	Urn1	Urn2	Urn3	Sample
Blue	3	6	100	26
Red	55	12	30	9
Green	10	0	1	1
...
Orange	79	18	0	50

Your goal is to recover the numbers u_1, u_2, and u_3 using only the knowledge of the colors of your mixed sample and the proportions of each color in the sinks.

The Gibb's sampler is a method for estimating u_1, u_2 and u_3. In SourceTracker2, the Gibb's sampler would be used to make an estimate of the source proportions (u_1, u_2, and u_3) plus an estimate of the proportion of balls in the sample that came from an unknown source. In the urns example there is no unknown source; all of the balls came from one of the three urns. In a real set of microbial samples, however, it is common that the source samples assayed are not the only source of microbes found in the sink sample (air, skin, soil or other microbial sources that were not included).

In practice, researchers often take multiple samples from a given source environment (e.g. to learn the true distribution of features in the source). It is desirable to 'collapse' samples from the same source into one representative source sample. This is mainly for interpretability. Consider the urn example above. In practice we would not know the exact contents of any of the urns. The only way to learn the actual source distributions would be to sample them repeatedly. Combining n samples from urn 1 into a single source would make the estimate of the urns true proportions of different colors more accurate and would make interpreting the results easier; there would be only 3 source proportions, plus the unknown, to interpret rather than 3n+1 (assuming n samples from each urn). Please read about (2) below to understand an important limitation of this collapsing processes.

A second function of of this script is (2), the prediction of the metadata class of sample based on the feature abundances in all samples and the metadata classes of all samples.

In practice, this function is useful for checking whether the source groupings you have computed are good groupings. As an example, imagine that you are baking bread and want to know where the microbes are coming from in your dough. You think there are three main sources: flour, water, hands. You take 10 samples from each of those environments (10 different bags of flour, 10 samples from water, 10 samples from your hands on different days). For computing source proportions, you would normally collapse each of the 10 samples from the given class into one source (so you'd end up with a 'Flour', 'Water', and 'Hand' source). However, if the flour you use comes from different facilities, it is likely that the samples will have very different microbial compositions. If this is the case, collapsing the flour samples into a single source would be inappropriate, since there are at least two different sources of microbes from the flour. To check the homogeneity of your source classifications, you can use the LOO strategy to make sure that all sources within each class look the same.

Usage examples

These usage examples expect that you are in the directory
sourcetracker2/data/tiny-test/

Calculate the proportion of each source in each sink
sourcetracker2 gibbs -i otu_table.biom -m map.txt -o example1/

Calculate the proportion of each source in each sink using an alternate sample metadata mapping file where samples are described differently.
sourcetracker2 gibbs -i otu_table.biom -m alt-map.txt -o example2/ --source_sink_column source-or-sink --source_column_value src --sink_column_value snk --source_category_column sample-type

Calculate the class label (i.e. 'Env') of each source using a leave one out strategy
sourcetracker2 gibbs -i otu_table.biom -m map.txt --loo -o example3/

Calculate the proportion of each source in each sink, using 100 burnins
sourcetracker2 gibbs -i otu_table.biom -m map.txt -o example4/ --burnin 100

Calculate the proportion of each source in each sink, using a sink rarefaction depth of 1700 and source rarefaction depth of 1500
sourcetracker2 gibbs -i otu_table.biom -m map.txt -o example5/ --sink_rarefaction_depth 1700 --source_rarefaction_depth 1500

Calculate the proportion of each source in each sink, parallel with 5 jobs
sourcetracker2 gibbs -i otu_table.biom -m map.txt -o example6/ --jobs 5

Calculate the proportion of each source in each sink, with 5 jobs. Write the per sink feature tables (what SourceTracker 1 called 'full output'). These are feature by sample table indicating the origin source of each sequence (each count of a feature).
sourcetracker2 gibbs -i otu_table.biom -m map.txt -o example7/ --jobs 5 --per_sink_feature_assignments

Miscellaneous

The current implementation of SourceTracker 2 does not contain functionality for auto-tuning of the parameters (alpha1, alpha1, etc.).

For auto-tuning functionality, please see the original R code.

sourcetracker2's People

Contributors

Stargazers

Watchers

Forkers

gregcaporaso johnchase mortonjt wdwvt1 lkursell adina jakereps brookeweigel chinaargs heatherdeel-usda afcarl hnumcc siminsadeghi87 mwang87 dnasko cameronmartino zhk8111 kdbrumfield kant jhuanglabtools leungi jtnedoctor tomatopie alexanderpg96 keyuxu francesenid17 jianguozhou3 lananhle liuchen92 bujianbusan wang-alan qianregina yfuruta lokyimandy osamaalien shirai-coder khemlalnirmalkar genomewalker biocore caseyschlenker andy6a cherman2 oddant1 kylielanglois ferronattojose

sourcetracker2's Issues

add test that calls the plot_mpm

This won't do much, but will confirm that the public API hasn't changed. This will address the coverage decrease in #71.

where should sourcetracker 2 live?

refactor discussion / what I think we need before release

@gregcaporaso @justin212k @lkursell @johnchase @nscott1 @mjk1000

Based on working with the code this weekend to try and get @johnchase and @lkursell a simple API for launching jobs I have come to the conclusion that the code has accumulated some cruft. Cruft = technical debt that will come due at some time, so I think we need to consider refactoring. I am not advocating something major, but a set of suggestions that @lkursell and @johnchase have already essentially asked for.

Below are my thoughts, please let me know what you think

The things I think would help and plan on doing once we modify/come to agreement

Refactoring all functions that are not the gibbs_sampler to use pd.DataFrame's. A significant amount of code is spent in sourcetracker/sourcetracker.py as well as sourcetracker/cli/gibbs.py to ensure that arrays representing source data, sink data, and the mapping file data are in the proper order etc. DataFrames can help remove much of this code (@lkursell and @johnchase pointed this out). This could include converting input biom tables to a DataFrame as it would make some operations slightly easier. No necessity of this, but a possibility for a little extra gain.
Rewriting ConditionalProbability to be clearer and more easily extensible. In our efforts to eke out speed improvements I wrote this class so that it would precompute everything that could be precomputed. The speed gains from this are minor based on my current understanding, and it makes understanding whats happening much less straightforward. In addition, as we move to add other layers of probability (e.g. what @lkursell) discussed in his last email, we will likely want to add them in this class. Without refactoring, it's going to be hell trying to do this.
Visualization needs to come with basic call I think (though perhaps leaving it in the ipython notebook is okay). It will take very little work to create a set of simple visualizations and this will make people switching to this code much more likely.

What I need help with

I'd love to get rid of the functools.partial nonsense that we are using to enable passing jobs to an ipyparallel client. I don't think we should need to do this and it would get rid of a fair amount of code, but I don't understand why we need to do this in the first place. If you try and pass a function to the engines that is not wrapped by partial you get errors indicating the local namespace doesn't contain the variables you are trying to pass.

Things I'd like to get a consensus on

Currently we are asking each engine in a cluster to operate on a single sink and write the results to an intermediate file. We did this to allow easy recovery from a failure (if you are running ST2 on 500 samples and it quits at sample 485 you'd be able to save that 97% of the work) but it means more functions and more intermediate text files. Do people think it would be better to just store data in memory and write out a final result when every computation had been completed? @lkursell has suggested that the in-memory approach might be good as he has had no mid-run failures in his entire time using this code.

`gibbs` unintuitively fails if source and sink tables don't exactly match features

hat-tip to @johnchase for pointing this out.

If the input dataframes for _gibbs have different numbers of features (different columns) the function will fail with an IndexError like the following

IndexError: index 11751 is out of bounds for axis 0 with size 11290

_gibbs should check that there are the same number of features in both sources_df and sinks_df. An additional check would be that the identity of those features are the same.

Unexpected behavior with null values in api call

The private api call _gibbs has unexpected behavior when null values are present in the source table.

For example

sink = pd.DataFrame([[1, 2], [1, 0]], index=['x', 'y'])
source = pd.DataFrame([[1, 2], [np.nan, np.nan]], index=['v', 'z'])

mix, std = _gibbs(source, sink)
mix

    v   z   Unknown
x   1.0 0.0 0.0
y   1.0 0.0 0.0

The simplest and probably easiest solution is to return an error if null values are present in the source or sink tables.

Create SourceTracker2 api

It would be really handy to have a SourceTracker2 api, something that could be run in a notebook or easily integrated into a larger analysis pipeline. I'm thinking it would look something like the following:

def sourcetracker(source_ids, sink_ids, table, std=False, cluster=None, loo=False, 
                  alpha1=0.001, alpha2=0.0, beta1=0.0, draws=0.0, 
                  burnin=0.0, delay=0.0)
  '''doc string''

 Parameters
    ----------
    source_ids : list
        Sample IDs that should be used as the source, must correspond to IDs in the OTU_table
    sink_ids : list
        Source IDs...
    table : pd.DataFrame
        Pandas dataframe object where columns are observations and rows are samples
    cluster : ipyparallel.Client
        user created parallel client object

    These rest of the parameters should corrospond to the CL call

Returns
    ----------
    mixing_proportions : pd.DataFrame
        A pandas DataFrame of the mixing proportions

# a bunch of code...
    return mixing_proportions

The overall workflow could look something like this:

$ ipcluster start -n 4

from sourcetracker2 import sourcetracker
import ipyparallel as ipp
c = ipp.Client()

mix_proportions_df = sourcetracker(source_ids, sink_ids, otu_table, cluster=c)

rename plot_mpm to mpm_heatmap

Plot is too general - we'll want to support other types of plotting in the future.

Adding functionality to SourceTracker 2

One of the best things I thought Dan changed in ST1 later version is that it could tell you which specific OTUs were specific for a particular environment. This would be very useful indeed if we could add this functionality or add it to the instructions if easy to do.

_gibbs fails on dataframes with floats

_gibbs fails with:
IndexError: index 6283 is out of bounds for axis 0 with size 6283
When the are floats in the sink_df. The error occurs when trying to create the order to walk through the sequences.

Probably want to add a check so that a less cryptic error message is given to the user.

Create QIIME2 plugin

Create a plugin for Qiime2. Try and use old q2d2 metadata filtering to select sinks and sources.

Add example commands from README.md to tests run by Travis

delay parameter

if the delay parameter in the gibbs_sampler function is set to 1 the function will never take draws (x-y mod 1 always = 0).

this is a legacy from the original sourcetracker code. we need to establish if there is any reason that a delay of 1 is unacceptable and rewrite the checker if not.

Add "method 5"

From @lkursell on March 21, 2016 22:24

Although a more descriptive term for how this works would be handy, like 'sequence_density', such that sourcetracker2 seq_density -i table.biom -m mf.txt -o out/ is callable

Copied from original issue: biota/sourcetracker2_internal#27

do we really need gibbs to be a subcommand?

Since it's the only one, would it be more intuitive if we just called sourcetracker2 rather than sourcetracker2 gibbs? Or are we planning to add others? If we do keep it, for users, the command gibbs probably isn't the most useful. Maybe track (or something along those lines) would be more intuitive (i.e., you don't need to know implementation details of the methods to understand the command name)?

Visualization of mixing probabilities

From @lkursell on January 29, 2016 22:23

Want a quick function, or --flag, to pass so that the results can be visualized. Bar chart is likely most helpful. Might have some upper limits on number of sinks/sources for which the figures make sense.

Copied from original issue: biota/sourcetracker2_internal#8

Display verbose progress

From @lkursell on February 12, 2016 22:39

Displaying ST2's progress, like the original, is very helpful for making sure commands are getting executed correctly, especially given the time required for jobs to finish. And they are a very convenient way for determining progress.

Copied from original issue: biota/sourcetracker2_internal#14

ipcluster can remain open in some circumstances

From @wdwvt1 on February 27, 2016 7:40

If a user deletes the output folder that has been created before all the jobs launched by IPcluster have finished, the cluster will fail to shut itself down. This could cause serious resource leaks.

Copied from original issue: biota/sourcetracker2_internal#23

better error message when passed an existing directory

From @gregcaporaso on March 1, 2016 22:32

Existing directories aren't allowed as parameters to -o, but the error message doesn't tell the user that the directory already exists, just that it's a directory.

$ sourcetracker2 gibbs -o sourcetracker
Usage: sourcetracker2 gibbs [OPTIONS]

Error: Invalid value for "-o" / "--output_dir": Path "sourcetracker" is a directory.

@wdwvt1 asked me to help out with this as he hasn't had success sorting it out.

Copied from original issue: biota/sourcetracker2_internal#25

speed improvement with dictionary indexing

The innermost loop of the gibbs function must access different slices of an array millions of times during a single run. These slices are columns, their order of selection is random (each pass of the loop selects a different column), and they may only be selected one at a time.

If we could speed up the speed at which numpy returned the slices, we might be able to save some time. As a test of this idea I used the following code to simulate drawing from a table with 100 sinks, 2000 features, and a number of inner-loop passes (slices taken) between 2**5 and 2**20 (30-1,000,000).

The three schemes I tried are: an input array that is in row-major format (C-order), and input array in column-major format (F-order) and a dictionary with keys as column indices and values as a 1D-array of the row values.

The results show that the dictionary indexing method takes approximately 90% of the time of the F-order method.

_dict[:, 0] / forder[:, 0]
array([ 0.84039466,  1.01139224,  0.92655346,  0.88919508,  0.83832869,
        0.86242302,  0.87799022,  0.92113187,  0.84468985,  0.89854044,
        0.90343308,  0.89311444,  0.89205119,  0.89685498,  0.9020272 ,
        0.90483933])

This approach seems promising to me. Pending new profiling results we should see if rewriting the internals to save 10% on indexing is worth it.

install fails on linux 64

A QIIME forum user reported an error installing because conda couldn't find scikitbio=0.4.3. Conda auto-suggested scikit-bio instead, which I think should work, but need to investigate this failure.

@gregcaporaso - is this a problem with our install documentation or does that user need to just add e.g. the bioconda channel?

add functions for comparing sourcetracker results

The upcoming SourceTracker 2 comparisons are going to require a function (e.g., compare_sourcetracker_results) that takes an observed_composition and expected_composition of a sample, and returns a single value indicating their similarity. This function should also take a metric parameter which parameterizes how similarity is computed. This is similar in nature to the scikit-bio beta_diversity driver function.

Some of the metrics that we'd want to be able to use are:

spearman correlation
pearson correlation
F-measure (this would be a qualitative metric, where true positives mean that an observed source is an actual source, ...)

I envision observed_composition and expected_composition being dicts mapping source environment type to proportion.

This function would create a pairwise similarity, so we'd also want to be able to summarize many pairwise similarities (e.g., for multiple different samples). We could start with np.median for this, and get more fancy when we need to.

@wdwvt1, you had some other ideas about metrics that would be useful for this. Could you add those here?

I could add this pretty quickly if this sounds like the right plan. It's kind of holding me up on an internal biota project right now. We could start with this being a private internal API and explore making it public another time.

cc @johnchase @wdwvt1 @lkursell

More collapse methods for source samples

From @lkursell on March 24, 2016 17:47

Currently, ST2 adds together sample OTU counts for samples that belong to the same source. This can cause overestimation of contributions from Source environments that have more samples.

One solution would be to be able to pass a --collapse_sources and --collapse_sinks flag with a --collapse_method such as mean or sum. Currently this can be done with the collapse_samples.py script, although that necessitates changes in mapping files and sample names.

Copied from original issue: biota/sourcetracker2_internal#28

documentation errors

Accumulate documentation errors here until enough to justify a PR.

_cli/gibbs.py: line 109 missing a closing parenthesis.

improve script testing approach

In #36 I added the README commands to the travis tests by hard-coding them in .travis.yml. We should come up with a better approach for that, possibly using click's testing framework. It is important that these are tested though, as that's what users will interact with first.

initial (2.0.1) alpha release

Address comments in #38 before building this release (that was accidentally merged before the comments were addressed, but the comments were minor).

Automatic tuning of alphas & fit evaluation

Cross validated alpha tuning and fit evaluation

Update / install travis

the travis build status badge is broken currently. @gregcaporaso can you take a look?

Full output

Include the full output of the average count of each bug from each source in the sink sample

"Python not installed as framework"

When I created a python 3 environment with anaconda and tried to import from sourcetracker.sourcetracker, I got:

**RuntimeError**: Python is not installed as a framework. The Mac OS X backend will not be able to function correctly if Python is not installed as a framework. See the Python documentation for more information on installing Python as a framework on Mac OS X. Please either reinstall Python as a framework, or try one of the other backends.

This SO post had a solution with the adding of backend: TkAgg to a ~/.matplotlib/matplotlibrc

Curious if others run into this problem. If so, we should update the documentation.

possible bug

From @wdwvt1 on January 30, 2016 0:15

Unsure if this is a bug - but if the number of unknown sequences reaches 0 and alpha2 is set to 0, then lines 370-372 of sourcetracker.py might raise a zero division error.

We have not verified this in tests (can't get the unknown that low) but it deserves investigation.

Copied from original issue: biota/sourcetracker2_internal#9

Autogenerate figures

Generate the pie charts and distributions a la R version

resolve different default parameters

We have three sets of default parameters floating around. The CLI defaults for ST2, the CLI defaults for ST1, and another set that are introduced in the API documentation here (the "API doc defaults").

The "API doc defaults" are orders of magnitude faster. In our upcoming parameter evaluation, we should decide on which of these we want to use under which circumstances (e.g., is there a fast setting and a more accurate setting?), and more clearly document that. We should also review the findings presented in Henry et al as that is relevant here.

MemoryError with rarefaction

In my use case, I'm dealing with metabolite counts with on the order of 10^9 counts for a single molecule in a single sample. When I try running sourcetracker2, I'm getting an out of memory error. as follows.

Traceback (most recent call last):
  File "/home/mortonjt/miniconda/envs/st2/bin/sourcetracker2", line 11, in <module>
    sys.exit(cli())
  File "/home/mortonjt/miniconda/envs/st2/lib/python3.5/site-packages/click/core.py", line 716, in __call__
    return self.main(*args, **kwargs)
  File "/home/mortonjt/miniconda/envs/st2/lib/python3.5/site-packages/click/core.py", line 696, in main
    rv = self.invoke(ctx)
  File "/home/mortonjt/miniconda/envs/st2/lib/python3.5/site-packages/click/core.py", line 1060, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/mortonjt/miniconda/envs/st2/lib/python3.5/site-packages/click/core.py", line 889, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/mortonjt/miniconda/envs/st2/lib/python3.5/site-packages/click/core.py", line 534, in invoke
    return callback(*args, **kwargs)
  File "/home/mortonjt/miniconda/envs/st2/lib/python3.5/site-packages/sourcetracker/_cli/gibbs.py", line 182, in gibbs
    sink_rarefaction_depth)
  File "/home/mortonjt/miniconda/envs/st2/lib/python3.5/site-packages/sourcetracker/_sourcetracker.py", line 184, in subsample_sources_sinks
    replace=False)
  File "/home/mortonjt/miniconda/envs/st2/lib/python3.5/site-packages/skbio/stats/_subsample.py", line 255, in subsample_counts
    counts_sum)
  File "skbio/stats/__subsample.pyx", line 22, in skbio.stats.__subsample._subsample_counts_without_replacement (skbio/stats/__subsample.c:1394)
MemoryError

Now when there are this many counts, it is probably good enough to use sample with replacement, rather than sample without replacement. The easiest fix would be to add an argument to allow replace=True here and here

Its also worthwhile re-evaluating the rarefaction benchmarks for sourcetracker. Its possible that removing subsampling completely could actually improve the accuracy.

update readme

Multiple issues here:

The urn example needs to be transposed.
A better description of the algorithm (this can happen in an ipynb, the current one is not good enough).
A discussion of runtimes and including how burnins/restarts etc work so people understand what parameters control the overall runtime.

Investigate memoization for ConditionalProbability.calculate_cp_slice

From @wdwvt1 on January 5, 2016 16:59

@justin212k had an excellent suggestion to use memoization for storing calls in some of the inner loops. Based on discussion with @justin212k and @gregcaporaso we think that memoization should occur on the function ConditionalProbability.calculate_cp_slice. Since this function is deterministic, and merely sets the probabilities so that np.random.choice can draw from them.

We can also use the improved lru_cache function in python3 to inspect the number of hits to the cache which will give us a good estimate of what cache size to use, how much we can save with the cache, etc.

Things to note when we implement this:

No alteration to the test code should be necessary since this isn't changing the number of calls to the PRNG.
The documentation here suggests using a cache size of 2^N - I imagine to avoid resizing or allocating memory problems.

The general cost of the memoization approach for N iterations of the innermost loop will be (this is my guess, please correct me if I am wrong):
N lookups at O(1), growth of dictionary up to maximum size
Assignment to dictionary N - x times where x is the number of times the result is already in the cache
Overhead associated with keeping the cache

The savings will be:
Calculations of full, joint probability vector x times.

Copied from original issue: biota/sourcetracker2_internal#2

42% runtime reduction by replacing np.random.choice

ST2 is slower than ST1 when fewer than ~2500 features are present in the table. Part of this comes from the unreasonably slow np.random.choice. Fully 55% of the time is spent rescaling the joint probability values and choosing a new index as revealed by this line profiler run (line 569).

Function: gibbs_sampler at line 436

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
 #ommitted
   481                                               """
   482                                               # Basic bookkeeping information we will use throughout the function.
   483         6            7      1.2      0.0      num_sources = cp.V
   484         6            3      0.5      0.0      num_features = cp.tau
   485         6           24      4.0      0.0      source_indices = np.arange(num_sources)
   486         6           22      3.7      0.0      sink = sink.astype(np.int32)
   487         6           43      7.2      0.0      sink_sum = sink.sum()
   488                                           
   489                                               # Calculate the number of passes that need to be conducted.
   490         6            6      1.0      0.0      total_draws = restarts * draws_per_restart
   491         6            6      1.0      0.0      total_passes = burnin + (draws_per_restart - 1) * delay + 1
   492                                           
   493                                               # Results containers.
   494         6           25      4.2      0.0      final_envcounts = np.zeros((total_draws, num_sources), dtype=np.int32)
   495         6          492     82.0      0.0      final_env_assignments = np.zeros((total_draws, sink_sum), dtype=np.int32)
   496         6           83     13.8      0.0      final_taxon_assignments = np.zeros((total_draws, sink_sum), dtype=np.int32)
   497                                           
   498                                               # Sequences from the sink will be randomly assigned a source environment
   499                                               # and then reassigned based on an increasingly accurate set of
   500                                               # probabilities. The order in which the sequences are selected for
   501                                               # reassignment must be random to avoid a systematic bias where the
   502                                               # sequences occuring later in the taxon_sequence book-keeping vector
   503                                               # receive more accurate reassignments by virtue of more updates to the
   504                                               # probability model. 'order' will be shuffled each pass, but can be
   505                                               # instantiated here to avoid unnecessary duplication.
   506         6          202     33.7      0.0      order = np.arange(sink_sum, dtype=np.int32)
   507                                           
   508                                               # Create a bookkeeping vector that keeps track of each sequence in the
   509                                               # sink. Each one will be randomly assigned an environment, and then
   510                                               # reassigned based on the increasinly accurate distribution. sink[i] i's
   511                                               # will be placed in the `taxon_sequence` vector to allow each individual
   512                                               # count to be removed and reassigned.
   513         6          887    147.8      0.0      taxon_sequence = np.repeat(np.arange(num_features), sink).astype(np.int32)
   514                                           
   515                                               # Update the conditional probability class now that we have the sink sum.
   516         6           21      3.5      0.0      cp.set_n(sink_sum)
   517         6          211     35.2      0.0      cp.precalculate()
   518                                           
   519                                               # Several bookkeeping variables that are used within the for loops.
   520         6            3      0.5      0.0      drawcount = 0
   521         6            6      1.0      0.0      unknown_idx = num_sources - 1
   522                                           
   523        36           47      1.3      0.0      for restart in range(restarts):
   524                                                   # Generate random source assignments for each sequence in the sink
   525                                                   # using a uniform distribution.
   526                                                   seq_env_assignments, envcounts = \
   527        30         7735    257.8      0.0              generate_environment_assignments(sink_sum, num_sources)
   528                                           
   529                                                   # Initially, the count of each taxon in the 'unknown' source should be
   530                                                   # 0.
   531        30          160      5.3      0.0          unknown_vector = np.zeros(num_features, dtype=np.int32)
   532        30           20      0.7      0.0          unknown_sum = 0
   533                                           
   534                                                   # If a sequence's random environmental assignment is to the 'unknown'
   535                                                   # environment we alter the training data to include those sequences
   536                                                   # in the 'unknown' source.
   537    797765       657500      0.8      0.1          for e, t in zip(seq_env_assignments, taxon_sequence):
   538    797735       629758      0.8      0.1              if e == unknown_idx:
   539    199601       514084      2.6      0.1                  unknown_vector[t] += 1
   540    199601       149597      0.7      0.0                  unknown_sum += 1
   541                                           
   542       660          762      1.2      0.0          for rep in range(1, total_passes + 1):
   543                                                       # Iterate through sequences in a random order so that no
   544                                                       # systematic bias is introduced based on position in the taxon
   545                                                       # vector (i.e. taxa appearing at the end of the vector getting
   546                                                       # better estimates of the probability).
   547       630       507358    805.3      0.1              np.random.shuffle(order)
   548                                           
   549  16753065     14481942      0.9      1.7              for seq_index in order:
   550  16752435     15050110      0.9      1.8                  e = seq_env_assignments[seq_index]
   551  16752435     14324415      0.9      1.7                  t = taxon_sequence[seq_index]
   552                                           
   553                                                           # Remove the ith sequence and update the probability
   554                                                           # associated with that environment.
   555  16752435     20047543      1.2      2.4                  envcounts[e] -= 1
   556  16752435     13817773      0.8      1.6                  if e == unknown_idx:
   557   6568698     20939355      3.2      2.5                      unknown_vector[t] -= 1
   558   6568698      5282662      0.8      0.6                      unknown_sum -= 1
   559                                           
   560                                                           # Calculate the new joint probability vector based on the
   561                                                           # removal of the ith sequence. Scale this probability vector
   562                                                           # for use by np.random.choice.
   563  16752435     16877702      1.0      2.0                  jp = cp.calculate_cp_slice(t, unknown_vector[t], unknown_sum,
   564  16752435    162371578      9.7     19.4                                             envcounts)
   565                                           
   566                                                           # Reassign the sequence to a new source environment and
   567                                                           # update counts for each environment and the unknown source
   568                                                           # if necessary.
   569  16752435    466547414     27.8     55.6                  new_e_idx = np.random.choice(source_indices, p=jp / jp.sum())
   570                                                           #new_e_idx = np.random.multinomial(n=1, pvals=jp/jp.sum()).argmax()
   571                                           
   572                                           
   573  16752435     19320519      1.2      2.3                  seq_env_assignments[seq_index] = new_e_idx
   574  16752435     23089569      1.4      2.8                  envcounts[new_e_idx] += 1
   575                                           
   576  16752435     14819348      0.9      1.8                  if new_e_idx == unknown_idx:
   577   6706687     23665932      3.5      2.8                      unknown_vector[t] += 1
   578   6706687      5595877      0.8      0.7                      unknown_sum += 1
   579                                           
   580       630          560      0.9      0.0              if rep > burnin and ((rep - (burnin + 1)) % delay) == 0:
   581                                                           # Update envcounts array with the assigned envs.
   582        30           75      2.5      0.0                  final_envcounts[drawcount] = envcounts
   583                                           
   584                                                           # Assign vectors necessary for feature table reconstruction.
   585        30         2088     69.6      0.0                  final_env_assignments[drawcount] = seq_env_assignments
   586        30         1653     55.1      0.0                  final_taxon_assignments[drawcount] = taxon_sequence
   587                                           
   588                                                           # We've made a draw, update this index so that the next
   589                                                           # iteration will be placed in the correct index of results.
   590        30           28      0.9      0.0                  drawcount += 1
   591                                           
   592         6            5      0.8      0.0      return (final_envcounts, final_env_assignments, final_taxon_assignments)

If I use the commented out multinomial code (line 570) as suggested here, I reduce the the runtime by 42% (487s vs 838). Thats a substantial improvement from a single line of code. The only thing stopping this from going in, is that np.random.choice makes a different number of calls to the PRNG compared to np.random.multinomial, so the seeded tests all fail.

The question is: if we are going to remove the seeded tests and recalculate them, is it worth it to also implement a faster algorithm (e.g. the alias method that @wasade suggested)? There is a call to implement the alias method in numpy here, but no plan.

refactor `gibbs_sampler`

For simulations to determine convergence rates of gibbs_sampler, I've been setting burnins=0, delay=1, draws_per_restart=X where X is the number of passes I want to make. I think this should be the default way we call gibbs_sampler, by just giving it a max number of passes. So, it'd look like:

def gibbs_sampler(cp, max_passes):

Then, an API user can take the returned data and use any parameters they like for the delay, burnins, etc. This removes draws/restart as a parameter, but that could be replicated by just repeating the call.

Install instructions?

Do we need to add install instructions to the readme?

add Jensen-Shannon calculator for sources

Users will want to know the Jensen-Shannon divergence of their sources. The manuscript will allow users to relate the JSD between the sources and the accuracy/precision of the mixing proportions they get back. This is not a novel idea, based on the work in Dan's original paper.

refactor `gibbs_sampler` to remove proportion calculation

currently gibbs_sampler calculates the mixing proportions from the full draw ( aka contingency table aka full results). this could be moved out into another function, making the code more modular and easier to test.

parameterize mapping file values and column names

This is one of the biggest user frustrations with ST1.

centralize metadata parsing

It would be good to have a parse_sample_metadata function, and then use that instead of calls like sample_metadata = pd.read_table(mapping_fp, index_col=0). This call will be problematic in some cases that we've run into (heres how it should be done), and we want to be able to fix that in one place rather than have the file be potentially parsed differently in different parts of the code base.

42% runtime reduction by replacing np.random.choice

ST2 is slower than ST1 when fewer than ~2500 features are present in the table. Part of this comes from the unreasonably slow np.random.choice. Fully 55% of the time of a run is spent rescaling the joint probability values and choosing a new index (line 569) as revealed by this line profiler run.

Function: gibbs_sampler at line 436

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
 #ommitted
   481                                               """
   482                                               # Basic bookkeeping information we will use throughout the function.
   483         6            7      1.2      0.0      num_sources = cp.V
   484         6            3      0.5      0.0      num_features = cp.tau
   485         6           24      4.0      0.0      source_indices = np.arange(num_sources)
   486         6           22      3.7      0.0      sink = sink.astype(np.int32)
   487         6           43      7.2      0.0      sink_sum = sink.sum()
   488                                           
   489                                               # Calculate the number of passes that need to be conducted.
   490         6            6      1.0      0.0      total_draws = restarts * draws_per_restart
   491         6            6      1.0      0.0      total_passes = burnin + (draws_per_restart - 1) * delay + 1
   492                                           
   493                                               # Results containers.
   494         6           25      4.2      0.0      final_envcounts = np.zeros((total_draws, num_sources), dtype=np.int32)
   495         6          492     82.0      0.0      final_env_assignments = np.zeros((total_draws, sink_sum), dtype=np.int32)
   496         6           83     13.8      0.0      final_taxon_assignments = np.zeros((total_draws, sink_sum), dtype=np.int32)
   497                                           
   498                                               # Sequences from the sink will be randomly assigned a source environment
   499                                               # and then reassigned based on an increasingly accurate set of
   500                                               # probabilities. The order in which the sequences are selected for
   501                                               # reassignment must be random to avoid a systematic bias where the
   502                                               # sequences occuring later in the taxon_sequence book-keeping vector
   503                                               # receive more accurate reassignments by virtue of more updates to the
   504                                               # probability model. 'order' will be shuffled each pass, but can be
   505                                               # instantiated here to avoid unnecessary duplication.
   506         6          202     33.7      0.0      order = np.arange(sink_sum, dtype=np.int32)
   507                                           
   508                                               # Create a bookkeeping vector that keeps track of each sequence in the
   509                                               # sink. Each one will be randomly assigned an environment, and then
   510                                               # reassigned based on the increasinly accurate distribution. sink[i] i's
   511                                               # will be placed in the `taxon_sequence` vector to allow each individual
   512                                               # count to be removed and reassigned.
   513         6          887    147.8      0.0      taxon_sequence = np.repeat(np.arange(num_features), sink).astype(np.int32)
   514                                           
   515                                               # Update the conditional probability class now that we have the sink sum.
   516         6           21      3.5      0.0      cp.set_n(sink_sum)
   517         6          211     35.2      0.0      cp.precalculate()
   518                                           
   519                                               # Several bookkeeping variables that are used within the for loops.
   520         6            3      0.5      0.0      drawcount = 0
   521         6            6      1.0      0.0      unknown_idx = num_sources - 1
   522                                           
   523        36           47      1.3      0.0      for restart in range(restarts):
   524                                                   # Generate random source assignments for each sequence in the sink
   525                                                   # using a uniform distribution.
   526                                                   seq_env_assignments, envcounts = \
   527        30         7735    257.8      0.0              generate_environment_assignments(sink_sum, num_sources)
   528                                           
   529                                                   # Initially, the count of each taxon in the 'unknown' source should be
   530                                                   # 0.
   531        30          160      5.3      0.0          unknown_vector = np.zeros(num_features, dtype=np.int32)
   532        30           20      0.7      0.0          unknown_sum = 0
   533                                           
   534                                                   # If a sequence's random environmental assignment is to the 'unknown'
   535                                                   # environment we alter the training data to include those sequences
   536                                                   # in the 'unknown' source.
   537    797765       657500      0.8      0.1          for e, t in zip(seq_env_assignments, taxon_sequence):
   538    797735       629758      0.8      0.1              if e == unknown_idx:
   539    199601       514084      2.6      0.1                  unknown_vector[t] += 1
   540    199601       149597      0.7      0.0                  unknown_sum += 1
   541                                           
   542       660          762      1.2      0.0          for rep in range(1, total_passes + 1):
   543                                                       # Iterate through sequences in a random order so that no
   544                                                       # systematic bias is introduced based on position in the taxon
   545                                                       # vector (i.e. taxa appearing at the end of the vector getting
   546                                                       # better estimates of the probability).
   547       630       507358    805.3      0.1              np.random.shuffle(order)
   548                                           
   549  16753065     14481942      0.9      1.7              for seq_index in order:
   550  16752435     15050110      0.9      1.8                  e = seq_env_assignments[seq_index]
   551  16752435     14324415      0.9      1.7                  t = taxon_sequence[seq_index]
   552                                           
   553                                                           # Remove the ith sequence and update the probability
   554                                                           # associated with that environment.
   555  16752435     20047543      1.2      2.4                  envcounts[e] -= 1
   556  16752435     13817773      0.8      1.6                  if e == unknown_idx:
   557   6568698     20939355      3.2      2.5                      unknown_vector[t] -= 1
   558   6568698      5282662      0.8      0.6                      unknown_sum -= 1
   559                                           
   560                                                           # Calculate the new joint probability vector based on the
   561                                                           # removal of the ith sequence. Scale this probability vector
   562                                                           # for use by np.random.choice.
   563  16752435     16877702      1.0      2.0                  jp = cp.calculate_cp_slice(t, unknown_vector[t], unknown_sum,
   564  16752435    162371578      9.7     19.4                                             envcounts)
   565                                           
   566                                                           # Reassign the sequence to a new source environment and
   567                                                           # update counts for each environment and the unknown source
   568                                                           # if necessary.
   569  16752435    466547414     27.8     55.6                  new_e_idx = np.random.choice(source_indices, p=jp / jp.sum())
   570                                                           #new_e_idx = np.random.multinomial(n=1, pvals=jp/jp.sum()).argmax()
   571                                           
   572                                           
   573  16752435     19320519      1.2      2.3                  seq_env_assignments[seq_index] = new_e_idx
   574  16752435     23089569      1.4      2.8                  envcounts[new_e_idx] += 1
   575                                           
   576  16752435     14819348      0.9      1.8                  if new_e_idx == unknown_idx:
   577   6706687     23665932      3.5      2.8                      unknown_vector[t] += 1
   578   6706687      5595877      0.8      0.7                      unknown_sum += 1
   579                                           
   580       630          560      0.9      0.0              if rep > burnin and ((rep - (burnin + 1)) % delay) == 0:
   581                                                           # Update envcounts array with the assigned envs.
   582        30           75      2.5      0.0                  final_envcounts[drawcount] = envcounts
   583                                           
   584                                                           # Assign vectors necessary for feature table reconstruction.
   585        30         2088     69.6      0.0                  final_env_assignments[drawcount] = seq_env_assignments
   586        30         1653     55.1      0.0                  final_taxon_assignments[drawcount] = taxon_sequence
   587                                           
   588                                                           # We've made a draw, update this index so that the next
   589                                                           # iteration will be placed in the correct index of results.
   590        30           28      0.9      0.0                  drawcount += 1
   591                                           
   592         6            5      0.8      0.0      return (final_envcounts, final_env_assignments, final_taxon_assignments)

If I use the commented out multinomial code (line 570) as suggested here, I reduce the the runtime by 42% (487s vs 838s). Thats a substantial improvement from a single line of code. The only thing stopping this from going in, is that np.random.choice makes a different number of calls to the PRNG compared to np.random.multinomial, so the seeded tests all fail.

test addition

@gregcaporaso and I pair reviewed and decided there are a couple keys places where test coverage should be improved.

Addition of tests of the entire package against SourceTracker 1. We can do this simply by building several tables, running with ST1, and comparing the distribution of proportions to the distribution of proportions we'd get from ST2. This will be a stochastic test, and we will have to set our acceptable error rate (alpha=.05) where tests will fail but no significant change has taken place. Something like the following.

from scipy.stats import ttest_1samp
st1_ms = np.array([[.3, .4, .3],[.2, .3, .5]])
st2_ms = gibbs(source, sinks)

obs_t, obs_p = ttest_1samp(st2_ms, st1_ms.mean(0))
p_threshhold = .05
self.assertTrue(obs_p > p_threshhold)

Addition of tests for gibbs and gibbs_loo (these are the largest sources of untested code in the entire package). We'll have to make a hand rolled example that we run through from start to finish.
Consistency tests for gibbs and gibbs_loo - these will just test that when a code change happens, a PRNG seeded examples didn't change. These do not test accuracy, just consistency.

migrate functionality that is in sourcetracker2_internal but not sourcetracker2

@wdwvt1, @lkursell mentioned that you have some code that is in sourcetracker2_internal which is not in sourcetracker2. Can you issue a pull request that moves that code to sourcetracker2? We'll then be getting rid of the sourcetracker2_internal repository (email to follow shortly on the new strategy for internal development).

memory leak

There is a memory leak that has become apparent in long-running simulations. The memory usage of python when running _gibbs steadily increases despite there being no clear reason for doing so (the memory is preallocated for results storage in gibbs_sampler).

I believe the error has to do with one of the following:

numpy array copies that occur and are not garbage collected link1, link2.
pandas dataframes not being correctly garbage collected (might be the same bug as 1) link1, link2, link3.
Memory leak when ipyparallel is running link1.

Based on the threads I have read (those linked above) I am guessing that either a bunch of array copies are occurring that are not getting collected, or there is some interaction between the cluster and multiple sinks.

Copied from original issue: biota/sourcetracker2_internal#16

Small doc changes

From @lkursell on February 22, 2016 22:50

Would be nice to put in the default values for all the params. None are listed currently. Also, update doc / install instructions for use with Qiime AWS AMI specific.

Install Anaconda:
Run:
$ bash ~/Anaconda3-2.5.0-MacOSX-x86_64.sh
whereever you downloaded the Anaconda.sh file to

Create the Python 3 env
$conda create -n py3 python=3 numpy scipy h5py hdf5

Get SourceTracker2
$git clone https://github.com/biota/sourcetracker2.git

Make sure to put the SourceTracker2 directory in a user specific directory, rather than a root directory, or it’ll cause problems!

Install SourceTracker2
$python setup.py install
from inside sourcetracker2 directory

Installed successfully!

Copied from original issue: biota/sourcetracker2_internal#21

caporaso-lab / sourcetracker2 Goto Github PK

sourcetracker2's Introduction

SourceTracker2

API vs. CLI

File Formats

Command Line

API

Preprocessing

Command line

API

Output

Command line

API

Documentation

Installation

Theory

Usage examples

Miscellaneous

sourcetracker2's People

Contributors

Stargazers

Watchers

Forkers

sourcetracker2's Issues

Recommend Projects

Recommend Topics

Recommend Org