binpro / concoct Goto Github PK
View Code? Open in Web Editor NEWClustering cONtigs with COverage and ComposiTion
License: Other
Clustering cONtigs with COverage and ComposiTion
License: Other
Responsibilities are equally important for the short contigs not used for building the cluster model.
Change it from 100 to 1000, more suitable for big datasets.
Output progress of clustering. Either to a file or stdout, either as default or as an option.
The length column needs to be removed by an additional call with cut or otherwise the concoct run uses 'length' as an additional sample which will corrupt the results.
Something in one of the recent revisions has actually stopped the GMM clustering from working correctly. It might have been the introduction of scaling. I have reverted to using the last version in the month of July. In future any revisions should be tested on an example data set that we know the correct output for.
Feature that night be more efficient.
@chrisquince needs the PCA transformation matrix in order to back transform the result. Need to look into the pca options scikit-learn offers.
I need the PCA transformed variances outputted in addition to the PCA transformed means to plot the cluster ellipses. We should probably put all the variances and means in separate output directories to prevent cluttering of the working directory.
For big datasets the output file 'original_data_gt.csv' can become very large, especially if a large k is used for the compositional data. I think it would be useful to have a parameter to disable this output.
The baseline is that CONCOCT should not output a clustering where the algorithm did not converge.
We need CONCOCT to output the PCA transform so that we can transform cluster means etc. back into the PCA space. Alternatively we could output PCA transformed means and variances.
When running with MPI, only the master thread logs to the log file.
There are some ways to fix this:
I don't think all threads can log to the same file since they can be on different nodes.
Some ideas:
https://groups.google.com/forum/#!topic/mpi4py/SaNzc8bdj6U
Where should we put it? This does not include implementing the c-python binding, just move the actual repo to a suitable location.
So I think I have found another minor bug. I am running CONCOCT for different sample numbers on the mock. This is with mostly default parameters. Usually the number of contigs with greater > 1000 kmers is 11,002 but on some data sets it finds 11,004. I have looked at the two added contigs and they have a length of 867 bp so they should not pass the filtering step. Can you check that the size filtering is being done correctly?
Error: Unsatisfiable package specifications
Hint: the following combinations of packages create a conflict with the
remaining packages:
I suggest we have scripts used for evaluation separated from the rest of the code since this could otherwise become very confusing. These scripts could either go into the tests folder or into a subdirectory of the scripts dir.
The help message and the requirements need a touch up.
The call to the c-module never finishes if only one contig is left after length filtering. A solution is to not allow clustering of only one contig after filtering.
The CONCOCT script should live in the scripts folder and be named concoct
To me it feels more linux-y to call the program so:
concoct coverage composition
rather than
CONCOCT coverage composition
Also, the CONCOCT file is technically not part of the concoct python module (the module is what lives in the folder CONCOCT/concoct), but a script that executes functions from the concoct module. Therefore I feel that it should be moved to the scripts folder.
Please if you have objections, comment on this. Otherwise I will make these changes tomorrow.
The calculation of coverage is incorrect due to the contig_lengths vector used is wrong.
Currently, all cluster numbers are tried out in parallel and the best one is redone. The clustering/gmm that correspond to the minimal bic value should instead be kept through the parallel step.
This issue is worst for large datasets where the amount of cluster numbers to fit is about the same as the number of procecssors used.
It is important that the output of CONCOCT is predictable in the sense that running with the same parameters on the same data gives the same results. From testing that does not seem to be the case. I am a bit confused because looking at the code I think it should work. In that we are setting the same numpy random seed (11) every time we run it and sklearn should be calling the numpy random number generators. Perhaps we could ask this on a sklearn forum?
The tests using reference result files will be outdated when we switch the clustering implementation. It is a lot more flexible to use some clustering statistics like recall/precision instead. This should be complemented with other tests like making sure all contigs are in the result files and so on.
Was this feature tested? My understanding is that as we increase the value beyond the 90% default more dimensions should remain after the transformation. However, that is not what I see rather that adjusting this value always results in just one dimension remaining after PCA.
Check so that all dependencies are in setup.py and should be installed if they are not already so.
Currently, convergence is printed to stderr. Should be logged as well.
Currently the linkage generation does not take ambiguous links into account. From the testcases it would seem bowtie2 does not generate these alignments by default unless the -a parameter is specified. Depending on alignment parameters and sequence aligner used, this could be an issue.
Scaling has to be done on the data.
How should the covariances be reverse transformed? The background is that I've compared the output from CONCOCT ran on my laptop and on the UPPMAX cluster. There seems to be a problem with getting the PCA deterministic, which seems to be a known problem on sklearn:
"Due to implementation subtleties of the Singular Value Decomposition (SVD), which is used in this implementation, running fit twice on the same matrix can lead to principal components with signs flipped (change in direction). For this reason, it is important to always use the same estimator object to transform data in a consistent fashion." / http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html
This is happening for me, I get different PCA transformed data on the different setups. The result does not change much, the clusterings and the bic scores are the same, and it's only a sign flip on some of the columns in the PCA transformed data. And my suggestion is just to accept this.
However, while the reverse transformed cluster means are the same on both platforms, the reverse tranformed covariances are very different. This leads me to suspect we're not doing the reverse transformation correctly. Would it make sense to take the square root of the covariances before reverse transforming them, since it could be an issue with wrong dimensionality?
I am just testing this but is the CONCOCT clustering deterministic in the sense that running with the same input data and parameters always produces the same output? If not we probably need to seed sklearn with a default random seed. A user defined seed should then be added as command line argument option.
There seems to be a bit confusion about which format to use for the coverage? If a bam-file exists, the script that @inodb created should be used to generate a proper matrix that the CONCOCT script can read. Which format will this be? Log coverage? Coverage? Read counts? At the moment, mean coverage is assumed by CONCOCT.
Not sure why pysam is so slow. Have to investigate further. Either reimplement in awk or find some way to speed-up the parsing of bam files in python.
Some of our csv output is actually space separated e.g. the cluster means.
The dimension of the variance matrix can sometimes be large and thus produce plenty of files. Good idea to clean this up by putting all variance files in a sub directory.
Let the user define seeds for the gmm.
Scaling prior to the PCA greatly reduces the optimal number of components based on BIC to less than the number of species. This is probably not desirable. Intuitively I believe it must be because scaling weights all dimensions equally. Since there are more composition dimensions than coverage we effectively bias towards composition. Until this is resolved I suggest we only scale after PCA.
We need to extract reads from the mock data and likely put it somewhere else than github (github and git is not optimized for large files) and link to it from the example doc.
I found a strange bug: If I run:
CONCOCT tests/test_data/coverage tests/test_data/composition.fa -c 3,5,1 -i 100 -b test_out1/ -m 1
I get
3,35245.9938983
4,36842.5739125
5,38241.0545765
while if I run
CONCOCT tests/test_data/coverage tests/test_data/composition.fa -c 3,5,1 -i 100 -b test_out2/ -m 2
I get
3,35245.9938983
4,36858.7784806
5,38348.0270248
The result is not random and it is the same for both my laptop and on Uppmax server.
Only create output for successful clusterings. Warn that some failed and suggest increasing iterations for those.
This to me is the biggest issue right now. I think I have figured out the best clustering strategy but to test on large data sets we need to parallelise the code in a trivial way. Just running each GMM fit on a different thread. What is the best way to do this in sklearn?
I did find this:
https://github.com/ogrisel/parallel_ml_tutorial
Chris
With the new normalisation there is a bug appearing when I run only a subset of the mock samples (see error message below). I thought this may be because some contigs end up with zero total coverage and this gives a NaN when logged but we apply a pseudo-count first so in that case you should still have a non-zero total coverage.
Up and running. Check /home/chrisq/CONCOCT_PUSH2/Mock/SampleRV/Sample8_bk/SampleA_log.txt for progress
/usr/lib64/python2.6/site-packages/pandas/core/frame.py:1943: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
"DataFrame index.", UserWarning)
Traceback (most recent call last):
File "/usr/bin/concoct", line 5, in
pkg_resources.run_script('concoct==0.1', 'concoct')
File "/usr/lib64/python2.6/site-packages/distribute-0.6.34-py2.6.egg/pkg_resources.py", line 505, in run_script
self.require(requires)[0].run_script(script_name, ns)
File "/usr/lib64/python2.6/site-packages/distribute-0.6.34-py2.6.egg/pkg_resources.py", line 1245, in run_script
execfile(script_filename, namespace, namespace)
File "/usr/lib/python2.6/site-packages/concoct-0.1-py2.6.egg/EGG-INFO/scripts/concoct", line 174, in
args)
File "/usr/lib/python2.6/site-packages/concoct-0.1-py2.6.egg/EGG-INFO/scripts/concoct", line 63, in main
transform_filter, pca = perform_pca(joined[threshold_filter], pca_components)
File "/usr/lib/python2.6/site-packages/concoct-0.1-py2.6.egg/concoct/transform.py", line 5, in perform_pca
pca_object = PCA(n_components=nc).fit(d)
File "/usr/lib64/python2.6/site-packages/sklearn/decomposition/pca.py", line 197, in fit
self._fit(X, **params)
File "/usr/lib64/python2.6/site-packages/sklearn/decomposition/pca.py", line 227, in _fit
X = array2d(X)
File "/usr/lib64/python2.6/site-packages/sklearn/utils/validation.py", line 81, in array2d
_assert_all_finite(X_2d)
File "/usr/lib64/python2.6/site-packages/sklearn/utils/validation.py", line 18, in _assert_all_finite
raise ValueError("Array contains NaN or infinity.")
ValueError: Array contains NaN or infinity.
The --executions option is reported to not work as expected. I guess it makes sense that if a deterministic seed is used, the number of executions does not make a difference to the result. But if CONCOCT is ran in a non deterministic fashion (should be enabled), more executions should improve the results.
This is quite major... The description doesn't look too good at the moment.
I'm not sure CONCOCT will be parallel over more than one node. Could this be investigated by someone?
Whilst just running the 'full' model should perhaps remain the default we should allow additional models to be specified too. This would add quite a lot of extra complexity as the correct approach is to run all models over all cluster numbers and keep the best combination of both.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.