binpro / concoct Goto Github PK

View Code? Open in Web Editor NEW

122.0 122.0 49.0 21.26 MB

Clustering cONtigs with COverage and ComposiTion

License: Other

Makefile 0.55% Python 64.18% C 18.43% R 2.38% Perl 8.72% Shell 5.28% Dockerfile 0.46%

concoct's People

Contributors

Stargazers

Watchers

Forkers

chrisquince alneberg nickloman antagomir binnisb abremges inodb javiercnav kkozo pereiramemo tipizen fmaguire frederic-mahe mawij2002 skerker python3pkg anqizhang-hit lfaller heavennew experimentaccount0 joschif galambosd kennlu2914 gitter-badger gelomerase linsalrob ravinpoudel chrishah pythseq chequita hsstom001 ryannellie liupfskygre ash1one fischuu 3liv jungbluth sebastien-raguideau rajaldebnath vinisalazar yirenheihei merenlab alienzj chaunceydust fallinwind wook2014 mghotbi

concoct's Issues

Output repsonsibilities for all contigs.

Responsibilities are equally important for the short contigs not used for building the cluster model.

Change default value of iterations

Change it from 100 to 1000, more suitable for big datasets.

Output progress.

Output progress of clustering. Either to a file or stdout, either as default or as an option.

Nose temporary folder should be created when tests are rum.

Remove length column as default from input table.

The length column needs to be removed by an additional call with cut or otherwise the concoct run uses 'length' as an additional sample which will corrupt the results.

Total coverage should be added after per sample normalization

Something in one of the recent revisions has actually stopped the GMM clustering from working correctly. It might have been the introduction of scaling. I have reverted to using the last version in the month of July. In future any revisions should be tested on an example data set that we know the correct output for.

Split pca

Feature that night be more efficient.

Output PCA transformation matrix.

@chrisquince needs the PCA transformation matrix in order to back transform the result. Need to look into the pca options scikit-learn offers.

Output PCA transformed variances

I need the PCA transformed variances outputted in addition to the PCA transformed means to plot the cluster ellipses. We should probably put all the variances and means in separate output directories to prevent cluttering of the working directory.

Do not output original data (optionally?)

For big datasets the output file 'original_data_gt.csv' can become very large, especially if a large k is used for the compositional data. I think it would be useful to have a parameter to disable this output.

Use information about convergence.

The baseline is that CONCOCT should not output a clustering where the algorithm did not converge.

Complete Example Doc - Assembly and mapping reads

Output PCA transformation

We need CONCOCT to output the PCA transform so that we can transform cluster means etc. back into the PCA space. Alternatively we could output PCA transformed means and variances.

MPI logging, only master thread logs convergence

When running with MPI, only the master thread logs to the log file.
There are some ways to fix this:

All write to separate log files and combine in the end
or
Ffind a way to send the logging to the master thread.

I don't think all threads can log to the same file since they can be on different nodes.
Some ideas:
https://groups.google.com/forum/#!topic/mpi4py/SaNzc8bdj6U

Include the C-CONCOCT repo within the CONCOCT repo.

Where should we put it? This does not include implementing the c-python binding, just move the actual repo to a suitable location.

Minor bug on contig filtering

So I think I have found another minor bug. I am running CONCOCT for different sample numbers on the mock. This is with mostly default parameters. Usually the number of contigs with greater > 1000 kmers is 11,002 but on some data sets it finds 11,004. I have looked at the two added contigs and they have a length of 867 bp so they should not pass the filtering step. Can you check that the size filtering is being done correctly?

conda install error - numpy 1.7

Is there a workaround this error : -

Error: Unsatisfiable package specifications
Hint: the following combinations of packages create a conflict with the
remaining packages:

numpy 1.7*

Separate scripts/code used for evaluation/testing from the code meant for production.

I suggest we have scripts used for evaluation separated from the rest of the code since this could otherwise become very confusing. These scripts could either go into the tests folder or into a subdirectory of the scripts dir.

README is outdated

The help message and the requirements need a touch up.

Execution never stops if only one contig is left after filtering

The call to the c-module never finishes if only one contig is left after length filtering. A solution is to not allow clustering of only one contig after filtering.

Move and rename CONCOCT/concoct/CONCOCT

The CONCOCT script should live in the scripts folder and be named concoct

To me it feels more linux-y to call the program so:

concoct coverage composition

rather than

CONCOCT coverage composition

Also, the CONCOCT file is technically not part of the concoct python module (the module is what lives in the folder CONCOCT/concoct), but a script that executes functions from the concoct module. Therefore I feel that it should be moved to the scripts folder.

Please if you have objections, comment on this. Otherwise I will make these changes tomorrow.

Calculation of coverage is incorrect

The calculation of coverage is incorrect due to the contig_lengths vector used is wrong.

Write a test that catches this error, using a contig of length 100 for example
Fix the length calculation and check that this fixes the error.

The final call to GMM can cause twice as long execution time.

Currently, all cluster numbers are tried out in parallel and the best one is redone. The clustering/gmm that correspond to the minimal bic value should instead be kept through the parallel step.

This issue is worst for large datasets where the amount of cluster numbers to fit is about the same as the number of procecssors used.

Output still not deterministic

It is important that the output of CONCOCT is predictable in the sense that running with the same parameters on the same data gives the same results. From testing that does not seem to be the case. I am a bit confused because looking at the code I think it should work. In that we are setting the same numpy random seed (11) every time we run it and sklearn should be calling the numpy random number generators. Perhaps we could ask this on a sklearn forum?

Replace reference result file tests with more flexible tests.

The tests using reference result files will be outdated when we switch the clustering implementation. It is a lot more flexible to use some clustering statistics like recall/precision instead. This should be complemented with other tests like making sure all contigs are in the result files and so on.

--total_percentage_pca working incorrectly?

Was this feature tested? My understanding is that as we increase the value beyond the 90% default more dimensions should remain after the transformation. However, that is not what I see rather that adjusting this value always results in just one dimension remaining after PCA.

N characters in fasta sequence crashes concoct execution.

Put all dependencies in setup.py

Check so that all dependencies are in setup.py and should be installed if they are not already so.

Put info about convergence in logs

Currently, convergence is printed to stderr. Should be logged as well.

Linkage generation does not consider ambiguous links

Currently the linkage generation does not take ambiguous links into account. From the testcases it would seem bowtie2 does not generate these alignments by default unless the -a parameter is specified. Depending on alignment parameters and sequence aligner used, this could be an issue.

Scaling

Scaling has to be done on the data.

Total coverage should be log transformed.

Reverse transformed covariances.

How should the covariances be reverse transformed? The background is that I've compared the output from CONCOCT ran on my laptop and on the UPPMAX cluster. There seems to be a problem with getting the PCA deterministic, which seems to be a known problem on sklearn:

"Due to implementation subtleties of the Singular Value Decomposition (SVD), which is used in this implementation, running fit twice on the same matrix can lead to principal components with signs flipped (change in direction). For this reason, it is important to always use the same estimator object to transform data in a consistent fashion." / http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html

This is happening for me, I get different PCA transformed data on the different setups. The result does not change much, the clusterings and the bic scores are the same, and it's only a sign flip on some of the columns in the PCA transformed data. And my suggestion is just to accept this.

However, while the reverse transformed cluster means are the same on both platforms, the reverse tranformed covariances are very different. This leads me to suspect we're not doing the reverse transformation correctly. Would it make sense to take the square root of the covariances before reverse transforming them, since it could be an issue with wrong dimensionality?

Is CONCOCT deterministic?

I am just testing this but is the CONCOCT clustering deterministic in the sense that running with the same input data and parameters always produces the same output? If not we probably need to seed sklearn with a default random seed. A user defined seed should then be added as command line argument option.

Input format for coverage?

There seems to be a bit confusion about which format to use for the coverage? If a bam-file exists, the script that @inodb created should be used to generate a proper matrix that the CONCOCT script can read. Which format will this be? Log coverage? Coverage? Read counts? At the moment, mean coverage is assumed by CONCOCT.

Linkage generation is extremely slow for large bam files

Not sure why pysam is so slow. Have to investigate further. Either reimplement in awk or find some way to speed-up the parsing of bam files in python.

CSV output

Some of our csv output is actually space separated e.g. the cluster means.

Put variance output files in sub-directories.

The dimension of the variance matrix can sometimes be large and thus produce plenty of files. Good idea to clean this up by putting all variance files in a sub directory.

User defined seeds

Let the user define seeds for the gmm.

Scaling prior to PCA changes optimal cluster number

Scaling prior to the PCA greatly reduces the optimal number of components based on BIC to less than the number of species. This is probably not desirable. Intuitively I believe it must be because scaling weights all dimensions equally. Since there are more composition dimensions than coverage we effectively bias towards composition. Until this is resolved I suggest we only scale after PCA.

Complete example - Data

We need to extract reads from the mock data and likely put it somewhere else than github (github and git is not optimized for large files) and link to it from the example doc.

Different bic outputs depending on the max_number_processors argument.

I found a strange bug: If I run:

CONCOCT tests/test_data/coverage tests/test_data/composition.fa -c 3,5,1 -i 100 -b test_out1/ -m 1

I get

3,35245.9938983
4,36842.5739125
5,38241.0545765

while if I run

 CONCOCT tests/test_data/coverage tests/test_data/composition.fa -c 3,5,1 -i 100 -b test_out2/ -m 2

I get

3,35245.9938983
4,36858.7784806
5,38348.0270248

The result is not random and it is the same for both my laptop and on Uppmax server.

Only create output for successful convergence

Only create output for successful clusterings. Warn that some failed and suggest increasing iterations for those.

How to multithread the code?

This to me is the biggest issue right now. I think I have figured out the best clustering strategy but to test on large data sets we need to parallelise the code in a trivial way. Just running each GMM fit on a different thread. What is the best way to do this in sklearn?

I did find this:

https://github.com/ogrisel/parallel_ml_tutorial

Chris

Error when only subset of samples used

With the new normalisation there is a bug appearing when I run only a subset of the mock samples (see error message below). I thought this may be because some contigs end up with zero total coverage and this gives a NaN when logged but we apply a pseudo-count first so in that case you should still have a non-zero total coverage.

Up and running. Check /home/chrisq/CONCOCT_PUSH2/Mock/SampleRV/Sample8_bk/SampleA_log.txt for progress
/usr/lib64/python2.6/site-packages/pandas/core/frame.py:1943: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
"DataFrame index.", UserWarning)
Traceback (most recent call last):
File "/usr/bin/concoct", line 5, in
pkg_resources.run_script('concoct==0.1', 'concoct')
File "/usr/lib64/python2.6/site-packages/distribute-0.6.34-py2.6.egg/pkg_resources.py", line 505, in run_script
self.require(requires)[0].run_script(script_name, ns)
File "/usr/lib64/python2.6/site-packages/distribute-0.6.34-py2.6.egg/pkg_resources.py", line 1245, in run_script
execfile(script_filename, namespace, namespace)
File "/usr/lib/python2.6/site-packages/concoct-0.1-py2.6.egg/EGG-INFO/scripts/concoct", line 174, in
args)
File "/usr/lib/python2.6/site-packages/concoct-0.1-py2.6.egg/EGG-INFO/scripts/concoct", line 63, in main
transform_filter, pca = perform_pca(joined[threshold_filter], pca_components)
File "/usr/lib/python2.6/site-packages/concoct-0.1-py2.6.egg/concoct/transform.py", line 5, in perform_pca
pca_object = PCA(n_components=nc).fit(d)
File "/usr/lib64/python2.6/site-packages/sklearn/decomposition/pca.py", line 197, in fit
self._fit(X, **params)
File "/usr/lib64/python2.6/site-packages/sklearn/decomposition/pca.py", line 227, in _fit
X = array2d(X)
File "/usr/lib64/python2.6/site-packages/sklearn/utils/validation.py", line 81, in array2d
_assert_all_finite(X_2d)
File "/usr/lib64/python2.6/site-packages/sklearn/utils/validation.py", line 18, in _assert_all_finite
raise ValueError("Array contains NaN or infinity.")
ValueError: Array contains NaN or infinity.

Investigate --executions option.

The --executions option is reported to not work as expected. I guess it makes sense that if a deterministic seed is used, the number of executions does not make a difference to the result. But if CONCOCT is ran in a non deterministic fashion (should be enabled), more executions should improve the results.

What does the last T in CONCOCT represent?

This is quite major... The description doesn't look too good at the moment.

Parallellization in slurm based cluster (e.g UPPMAX)

I'm not sure CONCOCT will be parallel over more than one node. Could this be investigated by someone?

Additional models

Whilst just running the 'full' model should perhaps remain the default we should allow additional models to be specified too. This would add quite a lot of extra complexity as the correct approach is to run all models over all cluster numbers and keep the best combination of both.