Giter Site home page Giter Site logo

concoct's People

Contributors

alneberg avatar andand avatar binnisb avatar chrisquince avatar frederic-mahe avatar halflings avatar inodb avatar jungbluth avatar linsalrob avatar linuszj avatar nickloman avatar sjaenick avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

concoct's Issues

Output progress.

Output progress of clustering. Either to a file or stdout, either as default or as an option.

Not correctly clustering

Something in one of the recent revisions has actually stopped the GMM clustering from working correctly. It might have been the introduction of scaling. I have reverted to using the last version in the month of July. In future any revisions should be tested on an example data set that we know the correct output for.

Split pca

Feature that night be more efficient.

Output PCA transformed variances

I need the PCA transformed variances outputted in addition to the PCA transformed means to plot the cluster ellipses. We should probably put all the variances and means in separate output directories to prevent cluttering of the working directory.

Do not output original data (optionally?)

For big datasets the output file 'original_data_gt.csv' can become very large, especially if a large k is used for the compositional data. I think it would be useful to have a parameter to disable this output.

Output PCA transformation

We need CONCOCT to output the PCA transform so that we can transform cluster means etc. back into the PCA space. Alternatively we could output PCA transformed means and variances.

Minor bug on contig filtering

So I think I have found another minor bug. I am running CONCOCT for different sample numbers on the mock. This is with mostly default parameters. Usually the number of contigs with greater > 1000 kmers is 11,002 but on some data sets it finds 11,004. I have looked at the two added contigs and they have a length of 867 bp so they should not pass the filtering step. Can you check that the size filtering is being done correctly?

conda install error - numpy 1.7

Is there a workaround this error : -

Error: Unsatisfiable package specifications
Hint: the following combinations of packages create a conflict with the
remaining packages:

  • numpy 1.7*

Move and rename CONCOCT/concoct/CONCOCT

The CONCOCT script should live in the scripts folder and be named concoct

To me it feels more linux-y to call the program so:

concoct coverage composition

rather than

CONCOCT coverage composition

Also, the CONCOCT file is technically not part of the concoct python module (the module is what lives in the folder CONCOCT/concoct), but a script that executes functions from the concoct module. Therefore I feel that it should be moved to the scripts folder.

Please if you have objections, comment on this. Otherwise I will make these changes tomorrow.

Calculation of coverage is incorrect

The calculation of coverage is incorrect due to the contig_lengths vector used is wrong.

  • Write a test that catches this error, using a contig of length 100 for example
  • Fix the length calculation and check that this fixes the error.

The final call to GMM can cause twice as long execution time.

Currently, all cluster numbers are tried out in parallel and the best one is redone. The clustering/gmm that correspond to the minimal bic value should instead be kept through the parallel step.

This issue is worst for large datasets where the amount of cluster numbers to fit is about the same as the number of procecssors used.

Output still not deterministic

It is important that the output of CONCOCT is predictable in the sense that running with the same parameters on the same data gives the same results. From testing that does not seem to be the case. I am a bit confused because looking at the code I think it should work. In that we are setting the same numpy random seed (11) every time we run it and sklearn should be calling the numpy random number generators. Perhaps we could ask this on a sklearn forum?

Replace reference result file tests with more flexible tests.

The tests using reference result files will be outdated when we switch the clustering implementation. It is a lot more flexible to use some clustering statistics like recall/precision instead. This should be complemented with other tests like making sure all contigs are in the result files and so on.

--total_percentage_pca working incorrectly?

Was this feature tested? My understanding is that as we increase the value beyond the 90% default more dimensions should remain after the transformation. However, that is not what I see rather that adjusting this value always results in just one dimension remaining after PCA.

Linkage generation does not consider ambiguous links

Currently the linkage generation does not take ambiguous links into account. From the testcases it would seem bowtie2 does not generate these alignments by default unless the -a parameter is specified. Depending on alignment parameters and sequence aligner used, this could be an issue.

Scaling

Scaling has to be done on the data.

Reverse transformed covariances.

How should the covariances be reverse transformed? The background is that I've compared the output from CONCOCT ran on my laptop and on the UPPMAX cluster. There seems to be a problem with getting the PCA deterministic, which seems to be a known problem on sklearn:

"Due to implementation subtleties of the Singular Value Decomposition (SVD), which is used in this implementation, running fit twice on the same matrix can lead to principal components with signs flipped (change in direction). For this reason, it is important to always use the same estimator object to transform data in a consistent fashion." / http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html

This is happening for me, I get different PCA transformed data on the different setups. The result does not change much, the clusterings and the bic scores are the same, and it's only a sign flip on some of the columns in the PCA transformed data. And my suggestion is just to accept this.

However, while the reverse transformed cluster means are the same on both platforms, the reverse tranformed covariances are very different. This leads me to suspect we're not doing the reverse transformation correctly. Would it make sense to take the square root of the covariances before reverse transforming them, since it could be an issue with wrong dimensionality?

Is CONCOCT deterministic?

I am just testing this but is the CONCOCT clustering deterministic in the sense that running with the same input data and parameters always produces the same output? If not we probably need to seed sklearn with a default random seed. A user defined seed should then be added as command line argument option.

Input format for coverage?

There seems to be a bit confusion about which format to use for the coverage? If a bam-file exists, the script that @inodb created should be used to generate a proper matrix that the CONCOCT script can read. Which format will this be? Log coverage? Coverage? Read counts? At the moment, mean coverage is assumed by CONCOCT.

CSV output

Some of our csv output is actually space separated e.g. the cluster means.

Scaling prior to PCA changes optimal cluster number

Scaling prior to the PCA greatly reduces the optimal number of components based on BIC to less than the number of species. This is probably not desirable. Intuitively I believe it must be because scaling weights all dimensions equally. Since there are more composition dimensions than coverage we effectively bias towards composition. Until this is resolved I suggest we only scale after PCA.

Complete example - Data

We need to extract reads from the mock data and likely put it somewhere else than github (github and git is not optimized for large files) and link to it from the example doc.

Different bic outputs depending on the max_number_processors argument.

I found a strange bug: If I run:

CONCOCT tests/test_data/coverage tests/test_data/composition.fa -c 3,5,1 -i 100 -b test_out1/ -m 1

I get

3,35245.9938983
4,36842.5739125
5,38241.0545765

while if I run

 CONCOCT tests/test_data/coverage tests/test_data/composition.fa -c 3,5,1 -i 100 -b test_out2/ -m 2

I get

3,35245.9938983
4,36858.7784806
5,38348.0270248

The result is not random and it is the same for both my laptop and on Uppmax server.

Error when only subset of samples used

With the new normalisation there is a bug appearing when I run only a subset of the mock samples (see error message below). I thought this may be because some contigs end up with zero total coverage and this gives a NaN when logged but we apply a pseudo-count first so in that case you should still have a non-zero total coverage.


Up and running. Check /home/chrisq/CONCOCT_PUSH2/Mock/SampleRV/Sample8_bk/SampleA_log.txt for progress
/usr/lib64/python2.6/site-packages/pandas/core/frame.py:1943: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
"DataFrame index.", UserWarning)
Traceback (most recent call last):
File "/usr/bin/concoct", line 5, in
pkg_resources.run_script('concoct==0.1', 'concoct')
File "/usr/lib64/python2.6/site-packages/distribute-0.6.34-py2.6.egg/pkg_resources.py", line 505, in run_script
self.require(requires)[0].run_script(script_name, ns)
File "/usr/lib64/python2.6/site-packages/distribute-0.6.34-py2.6.egg/pkg_resources.py", line 1245, in run_script
execfile(script_filename, namespace, namespace)
File "/usr/lib/python2.6/site-packages/concoct-0.1-py2.6.egg/EGG-INFO/scripts/concoct", line 174, in
args)
File "/usr/lib/python2.6/site-packages/concoct-0.1-py2.6.egg/EGG-INFO/scripts/concoct", line 63, in main
transform_filter, pca = perform_pca(joined[threshold_filter], pca_components)
File "/usr/lib/python2.6/site-packages/concoct-0.1-py2.6.egg/concoct/transform.py", line 5, in perform_pca
pca_object = PCA(n_components=nc).fit(d)
File "/usr/lib64/python2.6/site-packages/sklearn/decomposition/pca.py", line 197, in fit
self._fit(X, **params)
File "/usr/lib64/python2.6/site-packages/sklearn/decomposition/pca.py", line 227, in _fit
X = array2d(X)
File "/usr/lib64/python2.6/site-packages/sklearn/utils/validation.py", line 81, in array2d
_assert_all_finite(X_2d)
File "/usr/lib64/python2.6/site-packages/sklearn/utils/validation.py", line 18, in _assert_all_finite
raise ValueError("Array contains NaN or infinity.")
ValueError: Array contains NaN or infinity.

Investigate --executions option.

The --executions option is reported to not work as expected. I guess it makes sense that if a deterministic seed is used, the number of executions does not make a difference to the result. But if CONCOCT is ran in a non deterministic fashion (should be enabled), more executions should improve the results.

Additional models

Whilst just running the 'full' model should perhaps remain the default we should allow additional models to be specified too. This would add quite a lot of extra complexity as the correct approach is to run all models over all cluster numbers and keep the best combination of both.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.