bacpop / poppunk Goto Github PK

View Code? Open in Web Editor NEW

80.0 80.0 16.0 117.83 MB

PopPUNK 👨‍🎤 (POPulation Partitioning Using Nucleotide Kmers)

Home Page: https://www.bacpop.org/poppunk

License: Apache License 2.0

Python 94.53% CMake 0.33% C++ 4.50% Dockerfile 0.30% Shell 0.33%

bacteria genomics k-mer population-genetics sketching

poppunk's People

Contributors

Stargazers

Watchers

Forkers

tw7649116 yemilawal hupef ja-lacey vikash84 bede hyphaltip tauqeer9 danderson123 lauren-jacocks yananzh bzhao95 samhorsfield96 ccoulombe justicengom nickjcroucher

poppunk's Issues

Dependencies

Hi John,

We're getting some dependency errors after installing with pip (below). Using gcc-4.8.1 seemed to fix it for us.

Hope this helps!

Victoria
Pathogen Informatics

poppunk -h
Traceback (most recent call last):
File "/software/pathogen/external/apps/usr/local/Python-3.6.0/lib/python3.6/site-packages/llvmlite/binding/ffi.py", line 119, in
lib = ctypes.CDLL(os.path.join(_lib_dir, _lib_name))
File "/software/pathogen/external/apps/usr/local/Python-3.6.0/lib/python3.6/ctypes/init.py", line 344, in init
self._handle = _dlopen(self._name, mode)
OSError: /usr/lib/x86_64-linux-gnu/libstdc++.so.6: version `GLIBCXX_3.4.17' not found (required by /software/pathogen/external/apps/usr/local/Python-3.6.0/lib/python3.6/site-packages/llvmlite/binding/libllvmlite.so)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/software/pathogen/external/apps/usr/local/Python-3.6.0/bin/poppunk", line 11, in
load_entry_point('poppunk==1.1.3', 'console_scripts', 'poppunk')()
File "/software/pathogen/external/apps/usr/local/Python-3.6.0/lib/python3.6/site-packages/pkg_resources/init.py", line 487, in load_entry_point
return get_distribution(dist).load_entry_point(group, name)
File "/software/pathogen/external/apps/usr/local/Python-3.6.0/lib/python3.6/site-packages/pkg_resources/init.py", line 2728, in load_entry_point
return ep.load()
File "/software/pathogen/external/apps/usr/local/Python-3.6.0/lib/python3.6/site-packages/pkg_resources/init.py", line 2346, in load
return self.resolve()
File "/software/pathogen/external/apps/usr/local/Python-3.6.0/lib/python3.6/site-packages/pkg_resources/init.py", line 2352, in resolve
module = import(self.module_name, fromlist=['name'], level=0)
File "/software/pathogen/external/apps/usr/local/Python-3.6.0/lib/python3.6/site-packages/PopPUNK/main.py", line 23, in
from .models import *
File "/software/pathogen/external/apps/usr/local/Python-3.6.0/lib/python3.6/site-packages/PopPUNK/models.py", line 35, in
from .refine import refineFit
File "/software/pathogen/external/apps/usr/local/Python-3.6.0/lib/python3.6/site-packages/PopPUNK/refine.py", line 11, in
from numba import jit
File "/software/pathogen/external/apps/usr/local/Python-3.6.0/lib/python3.6/site-packages/numba/init.py", line 10, in
from . import config, errors, runtests, types
File "/software/pathogen/external/apps/usr/local/Python-3.6.0/lib/python3.6/site-packages/numba/config.py", line 11, in
import llvmlite.binding as ll
File "/software/pathogen/external/apps/usr/local/Python-3.6.0/lib/python3.6/site-packages/llvmlite/binding/init.py", line 6, in
from .dylib import *
File "/software/pathogen/external/apps/usr/local/Python-3.6.0/lib/python3.6/site-packages/llvmlite/binding/dylib.py", line 4, in
from . import ffi
File "/software/pathogen/external/apps/usr/local/Python-3.6.0/lib/python3.6/site-packages/llvmlite/binding/ffi.py", line 124, in
lib = ctypes.CDLL(_lib_name)
File "/software/pathogen/external/apps/usr/local/Python-3.6.0/lib/python3.6/ctypes/init.py", line 344, in init
self._handle = _dlopen(self._name, mode)
OSError: libllvmlite.so: cannot open shared object file: No such file or directory

Consider use of cython for distance parsing

I think mash dist commands when used with lots of threads are faster than we process their output, which increases memory use. Perhaps writing the parsing section in cython would improve speed and memory here.

Errors in microreact output with reference-based querying

Isolate 11657_5#30 is correctly assigned to strain 6_14 in the reference-based (https://microreact.org/project/BkXBGJuFm) and complete (https://microreact.org/project/rJGAHaPtm) query-based clustering. However, it is in the wrong place in the tree and t-SNE projection in the reference-based clustering only. It seems there is a misalignment of names and distances that only applies to the core and accessory distance matrices when querying a reference-based database.

bioconda package?

Struggling to get poppunk installed here.

Wondering if you have plans to write a bioconda package for it?

1D refinement mode

For within a strain and e.g. gono

Start from the 2D fit, as currently doing, but also fit vertical and horizontal lines (thereby using core only/accessory only respectively).
Output three clusterings to microreact

No reduced database created after references identified

Command was:
for f in maela mass nijmegen soton; do bsub -n 4 -M 32000 -R 'select[mem>32000] rusage[mem=32000] span[hosts=1]' -o bgmm.refine.ref.${f}.o -e bgmm.refine.ref.${f}.e "source /nfs/pathogen005/jl11/large_installations/miniconda3/bin/activate /lustre/scratch118/infgen/team81/nc3/nc3; python3 ./PopPUNK/poppunk-runner.py --refine-model --distances ${f}_db/${f}_db.dists --output ${f}_refined_bgmm_ref --ref-db ${f}_bgmm_full --threads 4"; done

BGMM refinement

Refinement with BGMM is complaining:

PopPUNK (POPulation Partitioning Using Nucleotide Kmers)
Mode: Refining model fit using network properties

Loading BGMM 2D Gaussian model
Initial model-based network construction based on Gaussian fit
Traceback (most recent call last):
File "./PopPUNK/poppunk-runner.py", line 9, in
main()
File "/lustre/scratch118/infgen/team81/nc3/GPS/ST_core/PopPUNK/PopPUNK/main.py", line 243, in main
args.no_local, args.threads)
File "/lustre/scratch118/infgen/team81/nc3/GPS/ST_core/PopPUNK/PopPUNK/models.py", line 516, in fit
args = (model, self.mean0, self.mean1, model.within_label, model.between_label))
File "/lustre/scratch118/infgen/team81/nc3/nc3/lib/python3.6/site-packages/scipy/optimize/zeros.py", line 510, in brentq
r = _zeros._brentq(f,a,b,xtol,rtol,maxiter,args,full_output,disp)
ValueError: f(a) and f(b) must have different signs

I've attached the image of the fit here; the files are all in:

/lustre/scratch118/infgen/team81/nc3/GPS/ST_core/GPS_bgmm

Add validation for samples in network

With very large DBs we've observed a small number of samples not being present in the final network but being in the sketches and distances, we think due to I/O error on particular systems.
Add in a warning message (at least) when this is the case.

Write documentation

Docstrings for functions
Usage and tutorial on readthedocs

Please help me

ubuntu@vibrio:~$ poppunk -h
/home/ubuntu/.linuxbrew/opt/python/lib/python3.7/site-packages/sklearn/externals/joblib/externals/cloudpickle/cloudpickle.py:47: DeprecationWarning: the imp module is deprecated in favour of importlib; see the module's documentation for alternative uses
import imp
usage: PopPUNK [-h]
(--easy-run | --create-db | --fit-model | --refine-model | --assign-query)
[--ref-db REF_DB] [--r-files R_FILES] [--q-files Q_FILES]
[--distances DISTANCES]
[--external-clustering EXTERNAL_CLUSTERING] --output OUTPUT
[--plot-fit PLOT_FIT] [--full-db] [--update-db] [--overwrite]
[--min-k MIN_K] [--max-k MAX_K] [--k-step K_STEP]
[--sketch-size SKETCH_SIZE] [--K K] [--dbscan] [--D D]
[--min-cluster-prop MIN_CLUSTER_PROP] [--pos-shift POS_SHIFT]
[--neg-shift NEG_SHIFT] [--manual-start MANUAL_START]
[--indiv-refine] [--no-local] [--model-dir MODEL_DIR]
[--previous-clustering PREVIOUS_CLUSTERING] [--core-only]
[--accessory-only] [--microreact] [--cytoscape] [--phandango]
[--grapetree] [--rapidnj RAPIDNJ] [--perplexity PERPLEXITY]
[--info-csv INFO_CSV] [--mash MASH] [--threads THREADS]
[--no-stream] [--version]

How i can install a old library for poppunk using pip.

thank you

Add mode to remove sequences from DB

Given a list of bad/unwanted sequences, remove these from distMat (read from pickle) and rewrite the .pkl and .npy files. Probably best as another program

some issues with conda version

Hi, I have done some testing to see if we can use poppunk at my work-place: I have some concerns that might be bugs, and some suggestions for improvement

I get an error message when assigning queries: the original database is not updated

`Updating reference database to our_samples_query2
Traceback (most recent call last):
  File "/home/evezeyl/anaconda3/envs/poppunk/bin/poppunk", line 10, in <module>
    sys.exit(main())
  File "/home/evezeyl/anaconda3/envs/poppunk/lib/python3.7/site-packages/PopPUNK/__main__.py", line 381, in main
    args.rapidnj, args.perplexity)
  File "/home/evezeyl/anaconda3/envs/poppunk/lib/python3.7/site-packages/PopPUNK/__main__.py", line 488, in assign_query
    threads, mash, True) # overwrite old db
  File "/home/evezeyl/anaconda3/envs/poppunk/lib/python3.7/site-packages/PopPUNK/mash.py", line 282, in constructDatabase
    assert(genome_length > 0)`
UnboundLocalError: local variable 'genome_length' referenced before assignment

The command I used: poppunk --assign-query --ref-db all_in_one3 --distances all_in_one3/all_in_one3.dist --q-files query_list.txt --output our_samples_query2 --update-db

I might have problems to reproduce the DSCAN clustering as puplished (I used the listeria test data set as I want to work with listeria)

the interlineage blob is not recognized, I have to put it manually as manual start and change boundaries (which is difficult just from the graph view with the listeria dataset as the diversity scale ... as I cannot se the graph after refinement, I am not sure it I have a lot of outliers or not)
Network statistics appear to be ok.

I will have to look how to use the git-hub version if I want to go further to be able to reuse your functions

Generaly it would be nice to replot those graphs after refitting (good for people that are not yet totally proefficient with all bioinformatic tools...)

I looked at the files containing means (x,y) possitions for the blobs in DPGMM - it seemed to me some blobs were not represented (and the scalling was strange: some huge difference in distances estimated...) but I will have to look at that further if we decide to use poppunk) -
I used a dataset for listeria downloaded from pasteur institute (diversity dataset - public isolates)
The graph is attached. The means reported are bellow.
What Am I missing?

means :
[[0.72720722 0.73200029]
[0.09182613 0.21442411]
[0.16055147 0.37754502]
[0.82941529 0.94804593]
[0.02714905 0.08603958]
[0.60009867 0.8491825 ]]

Best regards
Eve

Picking references when using update-db

When running --update-db after --assign-query the sketches produced are inconsistent with the distances and the network.

I think this is because the reference is still randomly sampled from cliques rather than ensuring the previous reference is used when available. It's definitely desirable to avoid needing to resketch the original refs when updating both computationally and because it would mean the assemblies would need to be distributed with the database.

Need to update extractReferences() in network.py to use existing references from cliques where already defined.

Output directory with full path causes mash to fail

mash command fails to run with full path in -o command. Single directory works

Tag a release

When ready to be made public

Integrate generate_microreact.py script

The script at scripts/generate_microreact.py should be integrated as an exported command in the main package to avoid duplicating functions.

Additionally, the following should be fixed:

Use of rapidnj isn't supported (function definition not included)
--output isn't clear if it is a prefix/directory/file, and fails if a folder with that name does not exist

Incorporate DBSCAN into code + refactor

I want to make the following changes:

Have single model object which has functions fit, plot, load, save and predict. This will be used for GMM, DBSCAN and refine, inheriting with polymorphism as necessary.
Remove redundancy in start of fit function for GMM and DBSCAN.
Remove redundancy in some of refinement.
Remove the BGMM option, and t-distributions (and from the documentation).
Update docstrings and tutorial + options as necessary

Reference database followed by assigning - inconsistent clusters

Running with --use-model, followed by --assign-query gave the following errors:

Network loaded: 344 samples
WARNING: Old cluster 9 split across multiple new clusters
WARNING: Old cluster 28 split across multiple new clusters

etc etc

Looking at cluster 9, the definition in the pruned network and the original cluster file are not consistent:

set(G.nodes()).intersection(oldClusters['9'])
{'assemblies_r4/5386_1_12.fa', 'assemblies_r4/9262_5_53.fa', 'assemblies_r4/19944_7_53.fa', 'assemblies_r4/SP2LAU.fa', 'assemblies_r4/19944_7_117.fa', 'assemblies_r4/19944_6_170.fa'}

list(nx.connected_components(G))
[{'assemblies_r4/11791_7_71.fa'}, {'assemblies_r4/11791_7_74.fa'}, {'assemblies_r4/11791_7_78.fa'}, {'assemblies_r4/11791_7_79.fa'}, {'assemblies_r4/8898_3_73.fa', 'assemblies_r4/11791_7_81.fa'}, {'assemblies_r4/11791_7_83.fa'}, {'assemblies_r4/11791_7_86.fa'}, {'assemblies_r4/11791_7_89.fa'}, {'assemblies_r4/11791_7_90.fa'}, {'assemblies_r4/11791_7_93.fa'}, {'assemblies_r4/11791_7_94.fa'}, {'assemblies_r4/11791_7_95.fa'}, {'assemblies_r4/11822_8_16.fa'}, {'assemblies_r4/11822_8_25.fa'}, {'assemblies_r4/11822_8_27.fa'}, {'assemblies_r4/11822_8_4.fa'}, {'assemblies_r4/11822_8_9.fa'}, {'assemblies_r4/5731_7_7.fa', 'assemblies_r4/11826_4_63.fa'}, {'assemblies_r4/11826_4_72.fa'}, {'assemblies_r4/11826_4_73.fa'}, {'assemblies_r4/11826_4_85.fa'}, {'assemblies_r4/11826_4_87.fa'}, {'assemblies_r4/19944_6_139.fa'}, {'assemblies_r4/19944_6_148.fa'}, {'assemblies_r4/19944_6_150.fa'}, {'assemblies_r4/19944_6_161.fa'}, {'assemblies_r4/19944_6_162.fa'}, {'assemblies_r4/19944_6_164.fa'}, {'assemblies_r4/19944_6_169.fa'}, {'assemblies_r4/5386_1_12.fa', 'assemblies_r4/19944_6_170.fa', 'assemblies_r4/SP2LAU.fa'}, {'assemblies_r4/19944_6_174.fa'}, {'assemblies_r4/19944_7_116.fa'}, {'assemblies_r4/19944_7_117.fa', 'assemblies_r4/19944_7_53.fa'}, {'assemblies_r4/19944_7_11.fa'}, {'assemblies_r4/19944_7_122.fa'}, {'assemblies_r4/ERR1733243.fa', 'assemblies_r4/19944_7_123.fa'}, {'assemblies_r4/19944_7_130.fa'}, {'assemblies_r4/SRS2376552.fa', 'assemblies_r4/SRS2372224.fa', 'assemblies_r4/19944_7_134.fa', 'assemblies_r4/SRS2376192.fa'}, {'assemblies_r4/19944_7_15.fa'}, {'assemblies_r4/19944_7_22.fa'}, {'assemblies_r4/19944_7_23.fa'}, {'assemblies_r4/SRS2372700.fa', 'assemblies_r4/19944_7_24.fa'}, {'assemblies_r4/19944_7_27.fa'}, {'assemblies_r4/19944_7_28.fa'}, ...

I thought this was fixed in #28 / #29, but I seem to still get this kind of error with the current code. Using --full-db everything is fine.

Look into:

The network pruning step - is this working ok?
What do the clusters look like generated pre- and post- this step?

Put on pypi

Need setup.py and dependencies.txt

Formatting check on input names

When processing all the way to microreact, any file name with more than one '.' will fail, because different amounts of the filename will be treated as a suffix and lost.

When using rapidnj, underscores are converted to spaces.

Should we convert these characters to dashes automatically, or warn the user to change the names themselves?

KeyError when trying to generate combined Microreact output

Something strange seems to happen when trying to build a combined microreact output. When running the command:

python3 ./PopPUNK/poppunk-runner.py --assign-query --ref-db mass_ref --model-dir mass_refined_gmm --q-files pmen14.list --output mass_query_pmen14 --full-db --threads 4 --cytoscape --microreact --rapidnj ./rapidNJ/bin/ --update-db

I get the error:

Traceback (most recent call last):
File "./PopPUNK/poppunk-runner.py", line 9, in
main()
File "/lustre/scratch118/infgen/team81/nc3/strain_structure/querying_examples/PopPUNK/PopPUNK/main.py", line 383, in main
outputsForCytoscape(genomeNetwork, isolateClustering, args.output, args.info_csv, ordered_queryList)
File "/lustre/scratch118/infgen/team81/nc3/strain_structure/querying_examples/PopPUNK/PopPUNK/plot.py", line 51, in outputsForCytoscape
queryList)
File "/lustre/scratch118/infgen/team81/nc3/strain_structure/querying_examples/PopPUNK/PopPUNK/plot.py", line 94, in writeClusterCsv
if name in clustering['combined']:
KeyError: 'combined'

The reference database contains these files:

-rw-r--r-- 1 nc3 team81 49363944 Jun 14 14:00 mass_ref.17.msh
-rw-r--r-- 1 nc3 team81 24723936 Jun 14 14:00 mass_ref.13.msh
-rw-r--r-- 1 nc3 team81 49363944 Jun 14 14:00 mass_ref.21.msh
-rw-r--r-- 1 nc3 team81 49363944 Jun 14 14:00 mass_ref.25.msh
-rw-r--r-- 1 nc3 team81 49363944 Jun 14 14:04 mass_ref.29.msh
-rw-r--r-- 1 nc3 team81 44556 Jun 14 14:12 mass_ref.dists.pkl
-rw-r--r-- 1 nc3 team81 3030848 Jun 14 14:12 mass_ref.dists.npy

The model directory contains these files:

-rw-r--r-- 1 nc3 team81 269906 Jun 14 17:44 mass_refined_gmm_refined_fit.png
-rw-r--r-- 1 nc3 team81 22 Jun 14 17:44 mass_refined_gmm_fit.pkl
-rw-r--r-- 1 nc3 team81 772 Jun 14 17:44 mass_refined_gmm_fit.npz
-rw-r--r-- 1 nc3 team81 1755 Jun 14 17:44 mass_refined_gmm.refs
-rw-r--r-- 1 nc3 team81 18569 Jun 14 17:44 mass_refined_gmm_clusters.csv
-rw-r--r-- 1 nc3 team81 5047704 Jun 14 17:44 mass_refined_gmm.25.msh
-rw-r--r-- 1 nc3 team81 5047704 Jun 14 17:44 mass_refined_gmm.21.msh
-rw-r--r-- 1 nc3 team81 5047704 Jun 14 17:44 mass_refined_gmm.17.msh
-rw-r--r-- 1 nc3 team81 2527704 Jun 14 17:44 mass_refined_gmm.13.msh
-rw-r--r-- 1 nc3 team81 2541 Jun 14 17:45 mass_refined_gmm_graph.gpickle
-rw-r--r-- 1 nc3 team81 5047704 Jun 14 17:45 mass_refined_gmm.29.msh

Am I missing something obvious here?

Use scaling in BGMM

This may affect priors on covariances of components - I got good fits when using MinMaxScaler previously

Potential memory issue (createDB)

Hi!

This might be a problem my end but I seem to be having a memory issue when --easy-run or --createdb get to the calculate core and accessory distances stage. I've successfully run --easy-run on ~70 cdiff genomes but increasing to much more (200+) gives an error (end of this message).

When I'm running --easy-run I'm using:
poppunk --easy-run --r-files reference_list.txt --output lm_example --full-db --min-k 15

And I've tried this command for createdb:
poppunk --create-db --r-files reference_list.txt --output poppunk_db --k-step 2 --min-k 20 --plot-fit 5
Also threaded and also both with --no-stream

I've also provided as much as 300Gb memory for the run

Thanks in advance for any help with this!

Calculating core and accessory distances
Traceback (most recent call last):
File "/well/bag/moorem/anaconda/lib/python3.6/site-packages/sharedmem/sharedmem.py", line 415, in get
return Q.get(timeout=1)
File "/well/bag/moorem/anaconda/lib/python3.6/multiprocessing/queues.py", line 105, in get
raise Empty
queue.Empty

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/well/bag/moorem/anaconda/lib/python3.6/site-packages/sharedmem/sharedmem.py", line 423, in get
return Q.get(timeout=0)
File "/well/bag/moorem/anaconda/lib/python3.6/multiprocessing/queues.py", line 105, in get
raise Empty
queue.Empty

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/well/bag/moorem/anaconda/lib/python3.6/site-packages/sharedmem/sharedmem.py", line 757, in map
capsule = pg.get(R)
File "/well/bag/moorem/anaconda/lib/python3.6/site-packages/sharedmem/sharedmem.py", line 425, in get
raise StopProcessGroup
sharedmem.sharedmem.StopProcessGroup: StopProcessGroup

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/well/bag/moorem/anaconda/bin/poppunk", line 11, in
sys.exit(main())
File "/well/bag/moorem/anaconda/lib/python3.6/site-packages/PopPUNK/main.py", line 210, in main
args.plot_fit, args.no_stream, args.mash, args.threads)
File "/well/bag/moorem/anaconda/lib/python3.6/site-packages/PopPUNK/mash.py", line 520, in queryDatabase
pool.map(partial(fitKmerBlock, distMat=distMat, raw = raw, klist=klist, jacobian=jacobian), mat_chunks)
File "/well/bag/moorem/anaconda/lib/python3.6/site-packages/sharedmem/sharedmem.py", line 761, in map
raise pg.get_exception()
sharedmem.sharedmem.SlaveException: Residuals are not finite in the initial point.
Traceback (most recent call last):
File "/well/bag/moorem/anaconda/lib/python3.6/site-packages/sharedmem/sharedmem.py", line 294, in _slaveMain
self.main(self, *self.args)
File "/well/bag/moorem/anaconda/lib/python3.6/site-packages/sharedmem/sharedmem.py", line 628, in _main
r = realfunc(work)
File "/well/bag/moorem/anaconda/lib/python3.6/site-packages/sharedmem/sharedmem.py", line 703, in realfunc
else: return func(i)
File "/well/bag/moorem/anaconda/lib/python3.6/site-packages/PopPUNK/mash.py", line 543, in fitKmerBlock
distMat[start:end, :] = np.apply_along_axis(fitKmerCurve, 1, raw[start:end, :], klist, jacobian)
File "/well/bag/moorem/anaconda/lib/python3.6/site-packages/numpy/lib/shape_base.py", line 380, in apply_along_axis
buff[ind] = asanyarray(func1d(inarr_view[ind], *args, **kwargs))
File "/well/bag/moorem/anaconda/lib/python3.6/site-packages/PopPUNK/mash.py", line 570, in fitKmerCurve
bounds=([-np.inf, -np.inf], [0, 0]))
File "/well/bag/moorem/anaconda/lib/python3.6/site-packages/scipy/optimize/_lsq/least_squares.py", line 804, in least_squares
raise ValueError("Residuals are not finite in the initial point.")
ValueError: Residuals are not finite in the initial point.

Mash info buffering

WIth GPS-sized datasets, the mash info command fills the subprocess buffer, resulting in this deadlocking due to this issue: https://thraxil.org/users/anders/posts/2008/03/13/Subprocess-Hanging-PIPE-is-your-enemy/.

I've hacked a fix for GPS using head:

mash_info = subprocess.Popen(mash_exec + " info -t " + dbname + " | head -n 100", shell=True, stdout=subprocess.PIPE)

Not very pythonic. I've updated the git with altered buffer size but haven't validated this on a large dataset yet:

mash_info = subprocess.Popen(mash_exec + " info -t " + dbname, bufsize = 0, shell=True, stdout=subprocess.PIPE)

Or we could just write to a tmp file and read that?

Report entropy of responsibilities

This may be informative of the amount of recombination, if the mixture fit is good
Can be reported along with network stats

Use of absolute/relative paths

Due to the way we name output files, if a path to another directory is specified:

poppunk --assign-query --ref-db /lustre/scratch118/infgen/team81/nc3/GPS/ST_core/GPS --distances /lustre/scratch118/infgen/team81/nc3/GPS/ST_core/GPS/GPS.dists --model-dir /lustre/scratch118/infgen/team81/nc3/GPS/ST_core/GPS_dbscan_refine --q-files query.list --output GPSC_assignment --threads 16

We get errors:

ERROR: could not open ".//lustre/scratch118/infgen/team81/nc3/GPS/ST_core/GPS//lustre/scratch118/infgen/team81/nc3/GPS/ST_core/GPS.9.msh" for reading.

Need to add a general util function for writing output and parsing these paths correctly (using os.path).

When rebuilding databases, use original kmer lengths

Make easy run mode

Run with defaults in single step

Output distance plot as unscaled distances

Use pi and accessory, and label axes

Contour levels wrong (sometimes?)

Look at this horrible plot!

The levels are wrong. This can be fixed by removing the contour at zero. Line 299 of plot.py becomes:

plt.contour(xx*scale[0], yy*scale[1], z, levels=levels[1:], cmap='plasma')

However I am not sure this is a wise global fix, as I've only seen it on this dataset. I'll keep my eye out though

Updating the cluster file with "--full-db"

In "--assign-query" mode, running a query with "--full-db" and "--update-db" doesn't seem to return the full clustering CSV - nor it seems the full network. Command used was:

poppunk-runner.py --assign-query --ref-db GPS_query --distances GPS_query/GPS_query.dists --model-dir GPS_query --q-files maela_transmission.rfiles --output maela_transmission_GPSC --threads 4 --no-stream --full-db --update-db --external-clustering gpsc_definitive.csv --overwrite

All the queries are in the CSV, just not all the original references. It would be useful to keep everything for iterative querying.

Memory problems when applying to GPS

Command was:
bsub -n 8 -M 64000 -R 'select[mem>64000] rusage[mem=64000] span[hosts=1]' -o newer.gps.refine.o -e newer.gps.refine.e "source /nfs/pathogen005/jl11/large_installations/miniconda3/bin/activate /lustre/scratch118/infgen/team81/nc3/nc3; python3 ./PopPUNK/poppunk-runner.py --refine-model --distances GPS/GPS.dists --output GPS_refine --full-db --ref-db GPS --threads 8"
Output received was:

TERM_MEMLIMIT: job killed after reaching LSF memory usage limit.
Exited with exit code 1.

Resource usage summary:

CPU time :                                   26982.73 sec.
Max Memory :                                 64298 MB
Average Memory :                             13995.88 MB
Total Requested Memory :                     64000.00 MB
Delta Memory :                               -298.00 MB
Max Swap :                                   124570 MB
Max Processes :                              12
Max Threads :                                46

The output (if any) is above this job summary.

The stage reached was:

Mode: Refining model fit using network properties

Initial model-based network construction
Network summary:
Components 1
Density 0.0318
Transitivity 0.4419
Score 0.4278
Initial boundary based network construction
Decision boundary starts at (0.44,0.53)
Network summary:
Components 44
Density 0.0279
Transitivity 0.8644
Score 0.8403
Trying to optimise score globally
Traceback (most recent call last):
File "./PopPUNK/poppunk-runner.py", line 9, in
main()
File "/lustre/scratch118/infgen/team81/nc3/GPS/ST_core/PopPUNK/PopPUNK/main.py", line 246, in main
model, args.pos_shift, args.neg_shift, args.manual_start, args.no_local, args.threads)
File "/lustre/scratch118/infgen/team81/nc3/GPS/ST_core/PopPUNK/PopPUNK/refine.py", line 97, in refineFit
s_range)
File "/lustre/scratch118/infgen/team81/nc3/nc3/lib/python3.6/site-packages/sharedmem/sharedmem.py", line 757, in map
capsule = pg.get(R)
File "/lustre/scratch118/infgen/team81/nc3/nc3/lib/python3.6/site-packages/sharedmem/sharedmem.py", line 415, in get
return Q.get(timeout=1)
File "/lustre/scratch118/infgen/team81/nc3/nc3/lib/python3.6/multiprocessing/queues.py", line 104, in get
if not self._poll(timeout):
File "/lustre/scratch118/infgen/team81/nc3/nc3/lib/python3.6/multiprocessing/connection.py", line 257, in poll
return self._poll(timeout)
File "/lustre/scratch118/infgen/team81/nc3/nc3/lib/python3.6/multiprocessing/connection.py", line 414, in _poll
r = wait([self], timeout)
File "/lustre/scratch118/infgen/team81/nc3/nc3/lib/python3.6/multiprocessing/connection.py", line 911, in wait
ready = selector.select(timeout)
File "/lustre/scratch118/infgen/team81/nc3/nc3/lib/python3.6/selectors.py", line 376, in select
fd_event_list = self._poll.poll(timeout)
KeyboardInterrupt

Add basic QC

As noted in #41 it would probably be useful to do some basic QC using stats that are already, or can easily, be calculated. This will avoid potentially costly or erroring database creation/assigning attempts.

Suggestions:

Genome length outliers (by Tukey criterion?) - can be done before fit
Overall distance outliers - between mash dist and regression

Final two modes

We need to make sure --create-query-db and --assign-query work, after all the new changes (probably after doing #14)

Plotting function causing crash when bsubbed?

Just recording this because I've seen it a couple of times, will try to have a look asap.

Traceback (most recent call last):
File "./PopPUNK/poppunk-runner.py", line 9, in
main()
File "/lustre/scratch118/infgen/team81/nc3/GPS/ST_core/PopPUNK/PopPUNK/main.py", line 246, in main
model, args.pos_shift, args.neg_shift, args.manual_start, args.no_local, args.threads)
File "/lustre/scratch118/infgen/team81/nc3/GPS/ST_core/PopPUNK/PopPUNK/refine.py", line 132, in refineFit
"Refined fit boundary", outPrefix + "/" + outPrefix + "_refined_fit")
File "/lustre/scratch118/infgen/team81/nc3/GPS/ST_core/PopPUNK/PopPUNK/plot.py", line 428, in plot_refined_results
fig=plt.figure(figsize=(22, 16), dpi= 160, facecolor='w', edgecolor='k')
File "/lustre/scratch118/infgen/team81/nc3/nc3/lib/python3.6/site-packages/matplotlib/pyplot.py", line 539, in figure
**kwargs)
File "/lustre/scratch118/infgen/team81/nc3/nc3/lib/python3.6/site-packages/matplotlib/backend_bases.py", line 171, in new_figure_manager
return cls.new_figure_manager_given_figure(num, fig)
File "/lustre/scratch118/infgen/team81/nc3/nc3/lib/python3.6/site-packages/matplotlib/backends/backend_tkagg.py", line 1049, in new_figure_manager_given_figure
window = Tk.Tk(className="matplotlib")
File "/lustre/scratch118/infgen/team81/nc3/nc3/lib/python3.6/tkinter/init.py", line 2017, in init
self.tk = _tkinter.create(screenName, baseName, className, interactive, wantobjects, useTk, sync, use)
_tkinter.TclError: couldn't connect to display "farm3-head5:15.0"

Add --threshold option

Apply a provided core/accessory/combined distance cutoff (like a single refine fit boundary, provided by manual start, with no optimisation) to form clusters. May be useful for comparisons when ruling out outbreaks

Write tests

At least basic tests, and use travis-ci

One-indexing of cluster IDs

It would be apparently be preferred to have cluster assignments start from one rather than zero.
See print_clusters() in network.py

Add multithreading to mash steps

Needed to speed up db constructions for large datasets
Either in python, or using the mash interface (depending on the resolution to #1)

Allow assemblies to have separately defined names

At the moment (due to the mash interface) samples are referred to by the filename of their assembly, but it would be convenient to be able to give each sample an arbitrary name in the --r-files or --q-files input:

sample1 assemblies/sample1.contigs.fa

Easy enough to update the PopPUNK side, but the mash sketch names are always the file name. Two possible ways of doing this I think:

Create temporary symlinks of the assemblies at their given names. This would best be done inside a tmp directory to avoid any clashes (copying the resulting sketches into the output dir)
Keep a table of file names to sample names as part of the database. Less hacky but more work.

error in assign query

Dear team POPPUNK

I run the comands:

1$ poppunk --create-db --r-files lista2.txt --output strain_db --min-k 15 --k-step 2 --sketch-size 1000000 --threads 7 --plot-fit 5

2$ poppunk --fit-model --distances strain_db/strain_db.dists --output strain_db --full-db --ref-db strain_db --dbscan --microreact --rapidnj ../../../Programs/rapidNJ/bin/

3$ poppunk_references --network strain_db/strain_db_graph.gpickle --ref-db strain_db --distances strain_db/strain_db.dists --model strain_db --output strain_references --threads 7

4$ poppunk --refine-model --distances strain_references/strain_references.dists --output strain_references --full-db --ref-db strain_references --threads 7 --indiv-refine

5$ poppunk --assign-query --ref-db strain_references --q-files lista_willy.txt --output strain_query --threads 7 --update-db (Error)

ubuntu@enteropatogenos:~/Results/willy/poppunk$ poppunk --assign-query --ref-db strain_references --q-files lista_willy.txt --output strain_query --threads 7 --update-db /home/ubuntu/.linuxbrew/opt/python/lib/python3.6/importlib/_bootstrap.py:219: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88
return f(*args, **kwds)
PopPUNK (POPulation Partitioning Using Nucleotide Kmers)
Mode: Assigning clusters of query sequences

Creating mash database for k = 15
Random 15-mer probability: 0.00
Creating mash database for k = 17
Random 17-mer probability: 0.00
Creating mash database for k = 19
Random 19-mer probability: 0.00
Creating mash database for k = 21
Random 21-mer probability: 0.00
Creating mash database for k = 23
Random 23-mer probability: 0.00
Creating mash database for k = 25
Random 25-mer probability: 0.00
Creating mash database for k = 27
Random 27-mer probability: 0.00

Creating mash database for k = 29
Random 29-mer probability: 0.00
mash dist -p 7 strain_references/strain_references.15.msh strain_query/strain_query.15.msh 2> strain_references.err.log
mash dist -p 7 strain_references/strain_references.17.msh strain_query/strain_query.17.msh 2> strain_references.err.log
mash dist -p 7 strain_references/strain_references.19.msh strain_query/strain_query.19.msh 2> strain_references.err.log
mash dist -p 7 strain_references/strain_references.21.msh strain_query/strain_query.21.msh 2> strain_references.err.log
mash dist -p 7 strain_references/strain_references.23.msh strain_query/strain_query.23.msh 2> strain_references.err.log
mash dist -p 7 strain_references/strain_references.25.msh strain_query/strain_query.25.msh 2> strain_references.err.log
mash dist -p 7 strain_references/strain_references.27.msh strain_query/strain_query.27.msh 2> strain_references.err.log
mash dist -p 7 strain_references/strain_references.29.msh strain_query/strain_query.29.msh 2> strain_references.err.log
Calculating core and accessory distances
Loading previously refined model
Network loaded: 7 samples
Calculating all query-query distances
mash dist -p 7 strain_query/strain_query.15.msh strain_query/strain_query.15.msh 2> strain_query.err.log
mash dist -p 7 strain_query/strain_query.17.msh strain_query/strain_query.17.msh 2> strain_query.err.log
mash dist -p 7 strain_query/strain_query.19.msh strain_query/strain_query.19.msh 2> strain_query.err.log
mash dist -p 7 strain_query/strain_query.21.msh strain_query/strain_query.21.msh 2> strain_query.err.log
mash dist -p 7 strain_query/strain_query.23.msh strain_query/strain_query.23.msh 2> strain_query.err.log
mash dist -p 7 strain_query/strain_query.25.msh strain_query/strain_query.25.msh 2> strain_query.err.log
mash dist -p 7 strain_query/strain_query.27.msh strain_query/strain_query.27.msh 2> strain_query.err.log
mash dist -p 7 strain_query/strain_query.29.msh strain_query/strain_query.29.msh 2> strain_query.err.log
Calculating core and accessory distances
Updating reference database to strain_query
Creating mash database for k = 15
Random 15-mer probability: 0.00
Overwriting db: strain_query/strain_query.15.msh
Creating mash database for k = 17
Random 17-mer probability: 0.00
Overwriting db: strain_query/strain_query.17.msh
Creating mash database for k = 19
Random 19-mer probability: 0.00
Overwriting db: strain_query/strain_query.19.msh
Creating mash database for k = 21
Random 21-mer probability: 0.00
Overwriting db: strain_query/strain_query.21.msh
Creating mash database for k = 23
Random 23-mer probability: 0.00
Overwriting db: strain_query/strain_query.23.msh
Creating mash database for k = 25
Random 25-mer probability: 0.00
Overwriting db: strain_query/strain_query.25.msh
Creating mash database for k = 27
Random 27-mer probability: 0.00
Overwriting db: strain_query/strain_query.27.msh
Creating mash database for k = 29
Random 29-mer probability: 0.00
Overwriting db: strain_query/strain_query.29.msh
Writing strain_query/strain_query.15.joined.msh...
Writing strain_query/strain_query.17.joined.msh...
Writing strain_query/strain_query.19.joined.msh...
Writing strain_query/strain_query.21.joined.msh...
Writing strain_query/strain_query.23.joined.msh...
Writing strain_query/strain_query.25.joined.msh...
Writing strain_query/strain_query.27.joined.msh...
Writing strain_query/strain_query.29.joined.msh...
Traceback (most recent call last):
File "/home/ubuntu/.linuxbrew/bin/poppunk", line 11, in
sys.exit(main())
File "/home/ubuntu/.linuxbrew/opt/python/lib/python3.6/site-packages/PopPUNK/main.py", line 422, in main
refList, refList_copy, self, ref_distMat = readPickle(args.distances)
File "/home/ubuntu/.linuxbrew/opt/python/lib/python3.6/site-packages/PopPUNK/utils.py", line 53, in readPickle
with open(pklName + ".pkl", 'rb') as pickle_file:
TypeError: unsupported operand type(s) for +: 'NoneType' and 'str´|

I hope that you can help me.

No full clustering returned after querying

Clusters.csv just contains reference database. Command was:

for r in nijmegen soton mass; do for q in maela nijmegen soton mass; do if [[ $r != $q ]]; then bsub -n 4 -M 2000 -R 'select[mem>2000] rusage[mem=2000] span[hosts=1]' -o ref.${r}.query.${q}.o -e ref.${r}.query.${q}.e "source /nfs/pathogen005/jl11/large_installations/miniconda3/bin/activate /lustre/scratch118/infgen/team81/nc3/nc3; python3 ./PopPUNK/poppunk-runner.py --assign-query --ref-db ${r}_db --distances ${r}_db/${r}_db.dists --model-dir ${r}_refined_bgmm_full --q-files ${q}.list --output ${r}_query_${q} --full-db --threads 4 --update-db"; fi; done; done

Error creating tmp dir with no-stream and relative paths

e.g. run with --no-stream --ref-db ../query_db
Will fail with:

  File "PopPUNK/mash.py", line 462, in queryDatabase
    suffix=".tmp", dir="./" + os.path.basename(dbPrefix))
  File "tempfile.py", line 340, in mkstemp
    return _mkstemp_inner(dir, prefix, suffix, flags, output_type)
  File "tempfile.py", line 258, in _mkstemp_inner
    fd = _os.open(file, flags, 0o600)
FileNotFoundError: [Errno 2] No such file or directory: './query_db/query_dby6aic3ql.tmp'

Need to fix lines:

    if no_stream:
            tmpHandle, tmpName = mkstemp(prefix=os.path.basename(dbPrefix),
                                         suffix=".tmp", dir="./" + os.path.basename(dbPrefix))
            mash_cmd += " > " + tmpName

residuals are not finite in the initial point

Hi John,

I'm getting the following issue running on ~ 3000 E. coli assemblies (which all look reasonable from a QC perspective). I previously managed to run poppunk with no issues but had to reinstall conda and hence poppunk and since then I can't get it to work. Do you have any idea what the issue is?

poppunk --create-db --r-files refence_list.txt --output strain_db --threads 8 --plot-fit 5 --min-k 14

Traceback (most recent call last):
File "/home/sam/miniconda3/lib/python3.6/site-packages/sharedmem/sharedmem.py", line 757, in map
capsule = pg.get(R)
File "/home/sam/miniconda3/lib/python3.6/site-packages/sharedmem/sharedmem.py", line 429, in get
raise StopProcessGroup
sharedmem.sharedmem.StopProcessGroup: StopProcessGroup

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home/sam/miniconda3/bin/poppunk", line 11, in
sys.exit(main())
File "/home/sam/miniconda3/lib/python3.6/site-packages/PopPUNK/main.py", line 210, in main
args.plot_fit, args.no_stream, args.mash, args.threads)
File "/home/sam/miniconda3/lib/python3.6/site-packages/PopPUNK/mash.py", line 520, in queryDatabase
pool.map(partial(fitKmerBlock, distMat=distMat, raw = raw, klist=klist, jacobian=jacobian), mat_chunks)
File "/home/sam/miniconda3/lib/python3.6/site-packages/sharedmem/sharedmem.py", line 761, in map
raise pg.get_exception()
sharedmem.sharedmem.SlaveException: Residuals are not finite in the initial point.
Traceback (most recent call last):
File "/home/sam/miniconda3/lib/python3.6/site-packages/sharedmem/sharedmem.py", line 294, in _slaveMain
self.main(self, *self.args)
File "/home/sam/miniconda3/lib/python3.6/site-packages/sharedmem/sharedmem.py", line 628, in _main
r = realfunc(work)
File "/home/sam/miniconda3/lib/python3.6/site-packages/sharedmem/sharedmem.py", line 703, in realfunc
else: return func(i)
File "/home/sam/miniconda3/lib/python3.6/site-packages/PopPUNK/mash.py", line 543, in fitKmerBlock
distMat[start:end, :] = np.apply_along_axis(fitKmerCurve, 1, raw[start:end, :], klist, jacobian)
File "/home/sam/.local/lib/python3.6/site-packages/numpy/lib/shape_base.py", line 380, in apply_along_axis
buff[ind] = asanyarray(func1d(inarr_view[ind], *args, **kwargs))
File "/home/sam/miniconda3/lib/python3.6/site-packages/PopPUNK/mash.py", line 570, in fitKmerCurve
bounds=([-np.inf, -np.inf], [0, 0]))
File "/home/sam/.local/lib/python3.6/site-packages/scipy/optimize/_lsq/least_squares.py", line 805, in least_squares
raise ValueError("Residuals are not finite in the initial point.")
ValueError: Residuals are not finite in the initial point.

PopPUNK (POPulation Partitioning Using Nucleotide Kmers)
Mode: Fitting model to reference database

Traceback (most recent call last):
File "./PopPUNK/poppunk-runner.py", line 9, in
main()
File "/lustre/scratch118/infgen/team81/nc3/GPS/ST_core/PopPUNK/PopPUNK/main.py", line 248, in main
assignments = model.fit(distMat, args.D, args.min_cluster_prop, args.threads)
File "/lustre/scratch118/infgen/team81/nc3/GPS/ST_core/PopPUNK/PopPUNK/models.py", line 316, in fit
self.hdb, self.labels, self.n_clusters = fitDbScan(self.subsampled_X, self.outPrefix, min_samples, min_cluster_size, cache_out, threads)
File "/lustre/scratch118/infgen/team81/nc3/GPS/ST_core/PopPUNK/PopPUNK/dbscan.py", line 50, in fitDbScan
).fit(X)
File "/lustre/scratch118/infgen/team81/nc3/nc3/lib/python3.6/site-packages/hdbscan/hdbscan_.py", line 851, in fit
self.min_spanning_tree) = hdbscan(X, **kwargs)
File "/lustre/scratch118/infgen/team81/nc3/nc3/lib/python3.6/site-packages/hdbscan/hdbscan.py", line 546, in hdbscan
core_dist_n_jobs, **kwargs)
File "/lustre/scratch118/infgen/team81/nc3/nc3/lib/python3.6/site-packages/sklearn/externals/joblib/memory.py", line 362, in call
return self.func(*args, **kwargs)
File "/lustre/scratch118/infgen/team81/nc3/nc3/lib/python3.6/site-packages/hdbscan/hdbscan_.py", line 285, in _hdbscan_boruvka_balltree
n_jobs=core_dist_n_jobs, **kwargs)
File "hdbscan/_hdbscan_boruvka.pyx", line 984, in hdbscan._hdbscan_boruvka.BallTreeBoruvkaAlgorithm.init
File "hdbscan/_hdbscan_boruvka.pyx", line 1015, in hdbscan._hdbscan_boruvka.BallTreeBoruvkaAlgorithm._compute_bounds
File "/lustre/scratch118/infgen/team81/nc3/nc3/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py", line 789, in call
self.retrieve()
File "/lustre/scratch118/infgen/team81/nc3/nc3/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py", line 699, in retrieve
self._output.extend(job.get(timeout=self.timeout))
File "/lustre/scratch118/infgen/team81/nc3/nc3/lib/python3.6/multiprocessing/pool.py", line 644, in get
raise self._value
multiprocessing.pool.MaybeEncodingError: Error sending result: '[(array([[0.00000000e+00, 4.03246668e-04, 4.05466814e-04, ...,
3.45438025e-02, 3.45439840e-02, 3.45456626e-02],
[0.00000000e+00, 1.74505436e-04, 2.04864122e-04, ...,
2.66212775e-02, 2.66223376e-02, 2.66228534e-02],
[0.00000000e+00, 2.33849996e-04, 2.44379060e-04, ...,
3.66543389e-02, 3.66559647e-02, 3.66573385e-02],
...,
[0.00000000e+00, 7.88384237e-05, 1.32052839e-04, ...,
4.68294336e-02, 4.68303438e-02, 4.68309294e-02],
[0.00000000e+00, 1.04485943e-04, 2.06512190e-04, ...,
2.64343423e-02, 2.64372834e-02, 2.64386719e-02],
[0.00000000e+00, 1.87643709e-04, 2.02630717e-04, ...,
2.65259452e-02, 2.65293704e-02, 2.65309182e-02]]), array([[ 0, 21411, 61521, ..., 74665, 33889, 25600],
[ 1, 89127, 69051, ..., 21044, 27497, 84593],
[ 2, 41269, 85304, ..., 4793, 61086, 11021],
...,
[24997, 7094, 57682, ..., 26199, 13061, 51331],
[24998, 42754, 77802, ..., 96494, 7710, 7146],
[24999, 77949, 18152, ..., 14254, 39465, 95775]]))]'. Reason: 'error("'i' format requires -2147483648 <= number <= 2147483647",)'

bacpop / poppunk Goto Github PK

poppunk's People

Contributors

Stargazers

Watchers

Forkers

poppunk's Issues

Recommend Projects

Recommend Topics

Recommend Org