bacpop / poppunk Goto Github PK
View Code? Open in Web Editor NEWPopPUNK π¨βπ€ (POPulation Partitioning Using Nucleotide Kmers)
Home Page: https://www.bacpop.org/poppunk
License: Apache License 2.0
PopPUNK π¨βπ€ (POPulation Partitioning Using Nucleotide Kmers)
Home Page: https://www.bacpop.org/poppunk
License: Apache License 2.0
Hi John,
We're getting some dependency errors after installing with pip (below). Using gcc-4.8.1 seemed to fix it for us.
Hope this helps!
Victoria
Pathogen Informatics
poppunk -h
Traceback (most recent call last):
File "/software/pathogen/external/apps/usr/local/Python-3.6.0/lib/python3.6/site-packages/llvmlite/binding/ffi.py", line 119, in
lib = ctypes.CDLL(os.path.join(_lib_dir, _lib_name))
File "/software/pathogen/external/apps/usr/local/Python-3.6.0/lib/python3.6/ctypes/init.py", line 344, in init
self._handle = _dlopen(self._name, mode)
OSError: /usr/lib/x86_64-linux-gnu/libstdc++.so.6: version `GLIBCXX_3.4.17' not found (required by /software/pathogen/external/apps/usr/local/Python-3.6.0/lib/python3.6/site-packages/llvmlite/binding/libllvmlite.so)
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/software/pathogen/external/apps/usr/local/Python-3.6.0/bin/poppunk", line 11, in
load_entry_point('poppunk==1.1.3', 'console_scripts', 'poppunk')()
File "/software/pathogen/external/apps/usr/local/Python-3.6.0/lib/python3.6/site-packages/pkg_resources/init.py", line 487, in load_entry_point
return get_distribution(dist).load_entry_point(group, name)
File "/software/pathogen/external/apps/usr/local/Python-3.6.0/lib/python3.6/site-packages/pkg_resources/init.py", line 2728, in load_entry_point
return ep.load()
File "/software/pathogen/external/apps/usr/local/Python-3.6.0/lib/python3.6/site-packages/pkg_resources/init.py", line 2346, in load
return self.resolve()
File "/software/pathogen/external/apps/usr/local/Python-3.6.0/lib/python3.6/site-packages/pkg_resources/init.py", line 2352, in resolve
module = import(self.module_name, fromlist=['name'], level=0)
File "/software/pathogen/external/apps/usr/local/Python-3.6.0/lib/python3.6/site-packages/PopPUNK/main.py", line 23, in
from .models import *
File "/software/pathogen/external/apps/usr/local/Python-3.6.0/lib/python3.6/site-packages/PopPUNK/models.py", line 35, in
from .refine import refineFit
File "/software/pathogen/external/apps/usr/local/Python-3.6.0/lib/python3.6/site-packages/PopPUNK/refine.py", line 11, in
from numba import jit
File "/software/pathogen/external/apps/usr/local/Python-3.6.0/lib/python3.6/site-packages/numba/init.py", line 10, in
from . import config, errors, runtests, types
File "/software/pathogen/external/apps/usr/local/Python-3.6.0/lib/python3.6/site-packages/numba/config.py", line 11, in
import llvmlite.binding as ll
File "/software/pathogen/external/apps/usr/local/Python-3.6.0/lib/python3.6/site-packages/llvmlite/binding/init.py", line 6, in
from .dylib import *
File "/software/pathogen/external/apps/usr/local/Python-3.6.0/lib/python3.6/site-packages/llvmlite/binding/dylib.py", line 4, in
from . import ffi
File "/software/pathogen/external/apps/usr/local/Python-3.6.0/lib/python3.6/site-packages/llvmlite/binding/ffi.py", line 124, in
lib = ctypes.CDLL(_lib_name)
File "/software/pathogen/external/apps/usr/local/Python-3.6.0/lib/python3.6/ctypes/init.py", line 344, in init
self._handle = _dlopen(self._name, mode)
OSError: libllvmlite.so: cannot open shared object file: No such file or directory
I think mash dist
commands when used with lots of threads are faster than we process their output, which increases memory use. Perhaps writing the parsing section in cython would improve speed and memory here.
Isolate 11657_5#30 is correctly assigned to strain 6_14 in the reference-based (https://microreact.org/project/BkXBGJuFm) and complete (https://microreact.org/project/rJGAHaPtm) query-based clustering. However, it is in the wrong place in the tree and t-SNE projection in the reference-based clustering only. It seems there is a misalignment of names and distances that only applies to the core and accessory distance matrices when querying a reference-based database.
Struggling to get poppunk installed here.
Wondering if you have plans to write a bioconda package for it?
For within a strain and e.g. gono
Start from the 2D fit, as currently doing, but also fit vertical and horizontal lines (thereby using core only/accessory only respectively).
Output three clusterings to microreact
Command was:
for f in maela mass nijmegen soton; do bsub -n 4 -M 32000 -R 'select[mem>32000] rusage[mem=32000] span[hosts=1]' -o bgmm.refine.ref.${f}.o -e bgmm.refine.ref.${f}.e "source /nfs/pathogen005/jl11/large_installations/miniconda3/bin/activate /lustre/scratch118/infgen/team81/nc3/nc3; python3 ./PopPUNK/poppunk-runner.py --refine-model --distances ${f}_db/${f}_db.dists --output ${f}_refined_bgmm_ref --ref-db ${f}_bgmm_full --threads 4"; done
Refinement with BGMM is complaining:
PopPUNK (POPulation Partitioning Using Nucleotide Kmers)
Mode: Refining model fit using network propertiesLoading BGMM 2D Gaussian model
Initial model-based network construction based on Gaussian fit
Traceback (most recent call last):
File "./PopPUNK/poppunk-runner.py", line 9, in
main()
File "/lustre/scratch118/infgen/team81/nc3/GPS/ST_core/PopPUNK/PopPUNK/main.py", line 243, in main
args.no_local, args.threads)
File "/lustre/scratch118/infgen/team81/nc3/GPS/ST_core/PopPUNK/PopPUNK/models.py", line 516, in fit
args = (model, self.mean0, self.mean1, model.within_label, model.between_label))
File "/lustre/scratch118/infgen/team81/nc3/nc3/lib/python3.6/site-packages/scipy/optimize/zeros.py", line 510, in brentq
r = _zeros._brentq(f,a,b,xtol,rtol,maxiter,args,full_output,disp)
ValueError: f(a) and f(b) must have different signs
I've attached the image of the fit here; the files are all in:
With very large DBs we've observed a small number of samples not being present in the final network but being in the sketches and distances, we think due to I/O error on particular systems.
Add in a warning message (at least) when this is the case.
Docstrings for functions
Usage and tutorial on readthedocs
ubuntu@vibrio:~$ poppunk -h
/home/ubuntu/.linuxbrew/opt/python/lib/python3.7/site-packages/sklearn/externals/joblib/externals/cloudpickle/cloudpickle.py:47: DeprecationWarning: the imp module is deprecated in favour of importlib; see the module's documentation for alternative uses
import imp
usage: PopPUNK [-h]
(--easy-run | --create-db | --fit-model | --refine-model | --assign-query)
[--ref-db REF_DB] [--r-files R_FILES] [--q-files Q_FILES]
[--distances DISTANCES]
[--external-clustering EXTERNAL_CLUSTERING] --output OUTPUT
[--plot-fit PLOT_FIT] [--full-db] [--update-db] [--overwrite]
[--min-k MIN_K] [--max-k MAX_K] [--k-step K_STEP]
[--sketch-size SKETCH_SIZE] [--K K] [--dbscan] [--D D]
[--min-cluster-prop MIN_CLUSTER_PROP] [--pos-shift POS_SHIFT]
[--neg-shift NEG_SHIFT] [--manual-start MANUAL_START]
[--indiv-refine] [--no-local] [--model-dir MODEL_DIR]
[--previous-clustering PREVIOUS_CLUSTERING] [--core-only]
[--accessory-only] [--microreact] [--cytoscape] [--phandango]
[--grapetree] [--rapidnj RAPIDNJ] [--perplexity PERPLEXITY]
[--info-csv INFO_CSV] [--mash MASH] [--threads THREADS]
[--no-stream] [--version]
How i can install a old library for poppunk using pip.
thank you
Given a list of bad/unwanted sequences, remove these from distMat (read from pickle) and rewrite the .pkl and .npy files. Probably best as another program
Hi, I have done some testing to see if we can use poppunk at my work-place: I have some concerns that might be bugs, and some suggestions for improvement
`Updating reference database to our_samples_query2
Traceback (most recent call last):
File "/home/evezeyl/anaconda3/envs/poppunk/bin/poppunk", line 10, in <module>
sys.exit(main())
File "/home/evezeyl/anaconda3/envs/poppunk/lib/python3.7/site-packages/PopPUNK/__main__.py", line 381, in main
args.rapidnj, args.perplexity)
File "/home/evezeyl/anaconda3/envs/poppunk/lib/python3.7/site-packages/PopPUNK/__main__.py", line 488, in assign_query
threads, mash, True) # overwrite old db
File "/home/evezeyl/anaconda3/envs/poppunk/lib/python3.7/site-packages/PopPUNK/mash.py", line 282, in constructDatabase
assert(genome_length > 0)`
UnboundLocalError: local variable 'genome_length' referenced before assignment
The command I used: poppunk --assign-query --ref-db all_in_one3 --distances all_in_one3/all_in_one3.dist --q-files query_list.txt --output our_samples_query2 --update-db
I will have to look how to use the git-hub version if I want to go further to be able to reuse your functions
Generaly it would be nice to replot those graphs after refitting (good for people that are not yet totally proefficient with all bioinformatic tools...)
Best regards
Eve
When running --update-db
after --assign-query
the sketches produced are inconsistent with the distances and the network.
I think this is because the reference is still randomly sampled from cliques rather than ensuring the previous reference is used when available. It's definitely desirable to avoid needing to resketch the original refs when updating both computationally and because it would mean the assemblies would need to be distributed with the database.
Need to update extractReferences()
in network.py
to use existing references from cliques where already defined.
mash command fails to run with full path in -o command. Single directory works
When ready to be made public
The script at scripts/generate_microreact.py
should be integrated as an exported command in the main package to avoid duplicating functions.
Additionally, the following should be fixed:
--output
isn't clear if it is a prefix/directory/file, and fails if a folder with that name does not existI want to make the following changes:
Running with --use-model
, followed by --assign-query
gave the following errors:
Network loaded: 344 samples
WARNING: Old cluster 9 split across multiple new clusters
WARNING: Old cluster 28 split across multiple new clusters
etc etc
Looking at cluster 9, the definition in the pruned network and the original cluster file are not consistent:
set(G.nodes()).intersection(oldClusters['9'])
{'assemblies_r4/5386_1_12.fa', 'assemblies_r4/9262_5_53.fa', 'assemblies_r4/19944_7_53.fa', 'assemblies_r4/SP2LAU.fa', 'assemblies_r4/19944_7_117.fa', 'assemblies_r4/19944_6_170.fa'}
list(nx.connected_components(G))
[{'assemblies_r4/11791_7_71.fa'}, {'assemblies_r4/11791_7_74.fa'}, {'assemblies_r4/11791_7_78.fa'}, {'assemblies_r4/11791_7_79.fa'}, {'assemblies_r4/8898_3_73.fa', 'assemblies_r4/11791_7_81.fa'}, {'assemblies_r4/11791_7_83.fa'}, {'assemblies_r4/11791_7_86.fa'}, {'assemblies_r4/11791_7_89.fa'}, {'assemblies_r4/11791_7_90.fa'}, {'assemblies_r4/11791_7_93.fa'}, {'assemblies_r4/11791_7_94.fa'}, {'assemblies_r4/11791_7_95.fa'}, {'assemblies_r4/11822_8_16.fa'}, {'assemblies_r4/11822_8_25.fa'}, {'assemblies_r4/11822_8_27.fa'}, {'assemblies_r4/11822_8_4.fa'}, {'assemblies_r4/11822_8_9.fa'}, {'assemblies_r4/5731_7_7.fa', 'assemblies_r4/11826_4_63.fa'}, {'assemblies_r4/11826_4_72.fa'}, {'assemblies_r4/11826_4_73.fa'}, {'assemblies_r4/11826_4_85.fa'}, {'assemblies_r4/11826_4_87.fa'}, {'assemblies_r4/19944_6_139.fa'}, {'assemblies_r4/19944_6_148.fa'}, {'assemblies_r4/19944_6_150.fa'}, {'assemblies_r4/19944_6_161.fa'}, {'assemblies_r4/19944_6_162.fa'}, {'assemblies_r4/19944_6_164.fa'}, {'assemblies_r4/19944_6_169.fa'}, {'assemblies_r4/5386_1_12.fa', 'assemblies_r4/19944_6_170.fa', 'assemblies_r4/SP2LAU.fa'}, {'assemblies_r4/19944_6_174.fa'}, {'assemblies_r4/19944_7_116.fa'}, {'assemblies_r4/19944_7_117.fa', 'assemblies_r4/19944_7_53.fa'}, {'assemblies_r4/19944_7_11.fa'}, {'assemblies_r4/19944_7_122.fa'}, {'assemblies_r4/ERR1733243.fa', 'assemblies_r4/19944_7_123.fa'}, {'assemblies_r4/19944_7_130.fa'}, {'assemblies_r4/SRS2376552.fa', 'assemblies_r4/SRS2372224.fa', 'assemblies_r4/19944_7_134.fa', 'assemblies_r4/SRS2376192.fa'}, {'assemblies_r4/19944_7_15.fa'}, {'assemblies_r4/19944_7_22.fa'}, {'assemblies_r4/19944_7_23.fa'}, {'assemblies_r4/SRS2372700.fa', 'assemblies_r4/19944_7_24.fa'}, {'assemblies_r4/19944_7_27.fa'}, {'assemblies_r4/19944_7_28.fa'}, ...
I thought this was fixed in #28 / #29, but I seem to still get this kind of error with the current code. Using --full-db
everything is fine.
Look into:
Need setup.py
and dependencies.txt
When processing all the way to microreact, any file name with more than one '.' will fail, because different amounts of the filename will be treated as a suffix and lost.
When using rapidnj, underscores are converted to spaces.
Should we convert these characters to dashes automatically, or warn the user to change the names themselves?
Something strange seems to happen when trying to build a combined microreact output. When running the command:
python3 ./PopPUNK/poppunk-runner.py --assign-query --ref-db mass_ref --model-dir mass_refined_gmm --q-files pmen14.list --output mass_query_pmen14 --full-db --threads 4 --cytoscape --microreact --rapidnj ./rapidNJ/bin/ --update-db
I get the error:
Traceback (most recent call last):
File "./PopPUNK/poppunk-runner.py", line 9, in
main()
File "/lustre/scratch118/infgen/team81/nc3/strain_structure/querying_examples/PopPUNK/PopPUNK/main.py", line 383, in main
outputsForCytoscape(genomeNetwork, isolateClustering, args.output, args.info_csv, ordered_queryList)
File "/lustre/scratch118/infgen/team81/nc3/strain_structure/querying_examples/PopPUNK/PopPUNK/plot.py", line 51, in outputsForCytoscape
queryList)
File "/lustre/scratch118/infgen/team81/nc3/strain_structure/querying_examples/PopPUNK/PopPUNK/plot.py", line 94, in writeClusterCsv
if name in clustering['combined']:
KeyError: 'combined'
The reference database contains these files:
-rw-r--r-- 1 nc3 team81 49363944 Jun 14 14:00 mass_ref.17.msh
-rw-r--r-- 1 nc3 team81 24723936 Jun 14 14:00 mass_ref.13.msh
-rw-r--r-- 1 nc3 team81 49363944 Jun 14 14:00 mass_ref.21.msh
-rw-r--r-- 1 nc3 team81 49363944 Jun 14 14:00 mass_ref.25.msh
-rw-r--r-- 1 nc3 team81 49363944 Jun 14 14:04 mass_ref.29.msh
-rw-r--r-- 1 nc3 team81 44556 Jun 14 14:12 mass_ref.dists.pkl
-rw-r--r-- 1 nc3 team81 3030848 Jun 14 14:12 mass_ref.dists.npy
The model directory contains these files:
-rw-r--r-- 1 nc3 team81 269906 Jun 14 17:44 mass_refined_gmm_refined_fit.png
-rw-r--r-- 1 nc3 team81 22 Jun 14 17:44 mass_refined_gmm_fit.pkl
-rw-r--r-- 1 nc3 team81 772 Jun 14 17:44 mass_refined_gmm_fit.npz
-rw-r--r-- 1 nc3 team81 1755 Jun 14 17:44 mass_refined_gmm.refs
-rw-r--r-- 1 nc3 team81 18569 Jun 14 17:44 mass_refined_gmm_clusters.csv
-rw-r--r-- 1 nc3 team81 5047704 Jun 14 17:44 mass_refined_gmm.25.msh
-rw-r--r-- 1 nc3 team81 5047704 Jun 14 17:44 mass_refined_gmm.21.msh
-rw-r--r-- 1 nc3 team81 5047704 Jun 14 17:44 mass_refined_gmm.17.msh
-rw-r--r-- 1 nc3 team81 2527704 Jun 14 17:44 mass_refined_gmm.13.msh
-rw-r--r-- 1 nc3 team81 2541 Jun 14 17:45 mass_refined_gmm_graph.gpickle
-rw-r--r-- 1 nc3 team81 5047704 Jun 14 17:45 mass_refined_gmm.29.msh
Am I missing something obvious here?
This may affect priors on covariances of components - I got good fits when using MinMaxScaler previously
Hi!
This might be a problem my end but I seem to be having a memory issue when --easy-run or --createdb get to the calculate core and accessory distances stage. I've successfully run --easy-run on ~70 cdiff genomes but increasing to much more (200+) gives an error (end of this message).
When I'm running --easy-run I'm using:
poppunk --easy-run --r-files reference_list.txt --output lm_example --full-db --min-k 15
And I've tried this command for createdb:
poppunk --create-db --r-files reference_list.txt --output poppunk_db --k-step 2 --min-k 20 --plot-fit 5
Also threaded and also both with --no-stream
I've also provided as much as 300Gb memory for the run
Thanks in advance for any help with this!
Calculating core and accessory distances
Traceback (most recent call last):
File "/well/bag/moorem/anaconda/lib/python3.6/site-packages/sharedmem/sharedmem.py", line 415, in get
return Q.get(timeout=1)
File "/well/bag/moorem/anaconda/lib/python3.6/multiprocessing/queues.py", line 105, in get
raise Empty
queue.Empty
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/well/bag/moorem/anaconda/lib/python3.6/site-packages/sharedmem/sharedmem.py", line 423, in get
return Q.get(timeout=0)
File "/well/bag/moorem/anaconda/lib/python3.6/multiprocessing/queues.py", line 105, in get
raise Empty
queue.Empty
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/well/bag/moorem/anaconda/lib/python3.6/site-packages/sharedmem/sharedmem.py", line 757, in map
capsule = pg.get(R)
File "/well/bag/moorem/anaconda/lib/python3.6/site-packages/sharedmem/sharedmem.py", line 425, in get
raise StopProcessGroup
sharedmem.sharedmem.StopProcessGroup: StopProcessGroup
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/well/bag/moorem/anaconda/bin/poppunk", line 11, in
sys.exit(main())
File "/well/bag/moorem/anaconda/lib/python3.6/site-packages/PopPUNK/main.py", line 210, in main
args.plot_fit, args.no_stream, args.mash, args.threads)
File "/well/bag/moorem/anaconda/lib/python3.6/site-packages/PopPUNK/mash.py", line 520, in queryDatabase
pool.map(partial(fitKmerBlock, distMat=distMat, raw = raw, klist=klist, jacobian=jacobian), mat_chunks)
File "/well/bag/moorem/anaconda/lib/python3.6/site-packages/sharedmem/sharedmem.py", line 761, in map
raise pg.get_exception()
sharedmem.sharedmem.SlaveException: Residuals are not finite in the initial point.
Traceback (most recent call last):
File "/well/bag/moorem/anaconda/lib/python3.6/site-packages/sharedmem/sharedmem.py", line 294, in _slaveMain
self.main(self, *self.args)
File "/well/bag/moorem/anaconda/lib/python3.6/site-packages/sharedmem/sharedmem.py", line 628, in _main
r = realfunc(work)
File "/well/bag/moorem/anaconda/lib/python3.6/site-packages/sharedmem/sharedmem.py", line 703, in realfunc
else: return func(i)
File "/well/bag/moorem/anaconda/lib/python3.6/site-packages/PopPUNK/mash.py", line 543, in fitKmerBlock
distMat[start:end, :] = np.apply_along_axis(fitKmerCurve, 1, raw[start:end, :], klist, jacobian)
File "/well/bag/moorem/anaconda/lib/python3.6/site-packages/numpy/lib/shape_base.py", line 380, in apply_along_axis
buff[ind] = asanyarray(func1d(inarr_view[ind], *args, **kwargs))
File "/well/bag/moorem/anaconda/lib/python3.6/site-packages/PopPUNK/mash.py", line 570, in fitKmerCurve
bounds=([-np.inf, -np.inf], [0, 0]))
File "/well/bag/moorem/anaconda/lib/python3.6/site-packages/scipy/optimize/_lsq/least_squares.py", line 804, in least_squares
raise ValueError("Residuals are not finite in the initial point.")
ValueError: Residuals are not finite in the initial point.
WIth GPS-sized datasets, the mash info command fills the subprocess buffer, resulting in this deadlocking due to this issue: https://thraxil.org/users/anders/posts/2008/03/13/Subprocess-Hanging-PIPE-is-your-enemy/.
I've hacked a fix for GPS using head:
mash_info = subprocess.Popen(mash_exec + " info -t " + dbname + " | head -n 100", shell=True, stdout=subprocess.PIPE)
Not very pythonic. I've updated the git with altered buffer size but haven't validated this on a large dataset yet:
mash_info = subprocess.Popen(mash_exec + " info -t " + dbname, bufsize = 0, shell=True, stdout=subprocess.PIPE)
Or we could just write to a tmp file and read that?
This may be informative of the amount of recombination, if the mixture fit is good
Can be reported along with network stats
Due to the way we name output files, if a path to another directory is specified:
poppunk --assign-query --ref-db /lustre/scratch118/infgen/team81/nc3/GPS/ST_core/GPS --distances /lustre/scratch118/infgen/team81/nc3/GPS/ST_core/GPS/GPS.dists --model-dir /lustre/scratch118/infgen/team81/nc3/GPS/ST_core/GPS_dbscan_refine --q-files query.list --output GPSC_assignment --threads 16
We get errors:
ERROR: could not open ".//lustre/scratch118/infgen/team81/nc3/GPS/ST_core/GPS//lustre/scratch118/infgen/team81/nc3/GPS/ST_core/GPS.9.msh" for reading.
Need to add a general util function for writing output and parsing these paths correctly (using os.path
).
Run with defaults in single step
Use pi and accessory, and label axes
The levels are wrong. This can be fixed by removing the contour at zero. Line 299 of plot.py
becomes:
plt.contour(xx*scale[0], yy*scale[1], z, levels=levels[1:], cmap='plasma')
However I am not sure this is a wise global fix, as I've only seen it on this dataset. I'll keep my eye out though
In "--assign-query" mode, running a query with "--full-db" and "--update-db" doesn't seem to return the full clustering CSV - nor it seems the full network. Command used was:
poppunk-runner.py --assign-query --ref-db GPS_query --distances GPS_query/GPS_query.dists --model-dir GPS_query --q-files maela_transmission.rfiles --output maela_transmission_GPSC --threads 4 --no-stream --full-db --update-db --external-clustering gpsc_definitive.csv --overwrite
All the queries are in the CSV, just not all the original references. It would be useful to keep everything for iterative querying.
Command was:
bsub -n 8 -M 64000 -R 'select[mem>64000] rusage[mem=64000] span[hosts=1]' -o newer.gps.refine.o -e newer.gps.refine.e "source /nfs/pathogen005/jl11/large_installations/miniconda3/bin/activate /lustre/scratch118/infgen/team81/nc3/nc3; python3 ./PopPUNK/poppunk-runner.py --refine-model --distances GPS/GPS.dists --output GPS_refine --full-db --ref-db GPS --threads 8"
Output received was:
TERM_MEMLIMIT: job killed after reaching LSF memory usage limit.
Exited with exit code 1.Resource usage summary:
CPU time : 26982.73 sec. Max Memory : 64298 MB Average Memory : 13995.88 MB Total Requested Memory : 64000.00 MB Delta Memory : -298.00 MB Max Swap : 124570 MB Max Processes : 12 Max Threads : 46
The output (if any) is above this job summary.
The stage reached was:
Mode: Refining model fit using network properties
Initial model-based network construction
Network summary:
Components 1
Density 0.0318
Transitivity 0.4419
Score 0.4278
Initial boundary based network construction
Decision boundary starts at (0.44,0.53)
Network summary:
Components 44
Density 0.0279
Transitivity 0.8644
Score 0.8403
Trying to optimise score globally
Traceback (most recent call last):
File "./PopPUNK/poppunk-runner.py", line 9, in
main()
File "/lustre/scratch118/infgen/team81/nc3/GPS/ST_core/PopPUNK/PopPUNK/main.py", line 246, in main
model, args.pos_shift, args.neg_shift, args.manual_start, args.no_local, args.threads)
File "/lustre/scratch118/infgen/team81/nc3/GPS/ST_core/PopPUNK/PopPUNK/refine.py", line 97, in refineFit
s_range)
File "/lustre/scratch118/infgen/team81/nc3/nc3/lib/python3.6/site-packages/sharedmem/sharedmem.py", line 757, in map
capsule = pg.get(R)
File "/lustre/scratch118/infgen/team81/nc3/nc3/lib/python3.6/site-packages/sharedmem/sharedmem.py", line 415, in get
return Q.get(timeout=1)
File "/lustre/scratch118/infgen/team81/nc3/nc3/lib/python3.6/multiprocessing/queues.py", line 104, in get
if not self._poll(timeout):
File "/lustre/scratch118/infgen/team81/nc3/nc3/lib/python3.6/multiprocessing/connection.py", line 257, in poll
return self._poll(timeout)
File "/lustre/scratch118/infgen/team81/nc3/nc3/lib/python3.6/multiprocessing/connection.py", line 414, in _poll
r = wait([self], timeout)
File "/lustre/scratch118/infgen/team81/nc3/nc3/lib/python3.6/multiprocessing/connection.py", line 911, in wait
ready = selector.select(timeout)
File "/lustre/scratch118/infgen/team81/nc3/nc3/lib/python3.6/selectors.py", line 376, in select
fd_event_list = self._poll.poll(timeout)
KeyboardInterrupt
As noted in #41 it would probably be useful to do some basic QC using stats that are already, or can easily, be calculated. This will avoid potentially costly or erroring database creation/assigning attempts.
Suggestions:
mash dist
and regressionWe need to make sure --create-query-db
and --assign-query
work, after all the new changes (probably after doing #14)
Just recording this because I've seen it a couple of times, will try to have a look asap.
Traceback (most recent call last):
File "./PopPUNK/poppunk-runner.py", line 9, in
main()
File "/lustre/scratch118/infgen/team81/nc3/GPS/ST_core/PopPUNK/PopPUNK/main.py", line 246, in main
model, args.pos_shift, args.neg_shift, args.manual_start, args.no_local, args.threads)
File "/lustre/scratch118/infgen/team81/nc3/GPS/ST_core/PopPUNK/PopPUNK/refine.py", line 132, in refineFit
"Refined fit boundary", outPrefix + "/" + outPrefix + "_refined_fit")
File "/lustre/scratch118/infgen/team81/nc3/GPS/ST_core/PopPUNK/PopPUNK/plot.py", line 428, in plot_refined_results
fig=plt.figure(figsize=(22, 16), dpi= 160, facecolor='w', edgecolor='k')
File "/lustre/scratch118/infgen/team81/nc3/nc3/lib/python3.6/site-packages/matplotlib/pyplot.py", line 539, in figure
**kwargs)
File "/lustre/scratch118/infgen/team81/nc3/nc3/lib/python3.6/site-packages/matplotlib/backend_bases.py", line 171, in new_figure_manager
return cls.new_figure_manager_given_figure(num, fig)
File "/lustre/scratch118/infgen/team81/nc3/nc3/lib/python3.6/site-packages/matplotlib/backends/backend_tkagg.py", line 1049, in new_figure_manager_given_figure
window = Tk.Tk(className="matplotlib")
File "/lustre/scratch118/infgen/team81/nc3/nc3/lib/python3.6/tkinter/init.py", line 2017, in init
self.tk = _tkinter.create(screenName, baseName, className, interactive, wantobjects, useTk, sync, use)
_tkinter.TclError: couldn't connect to display "farm3-head5:15.0"
Apply a provided core/accessory/combined distance cutoff (like a single refine fit boundary, provided by manual start, with no optimisation) to form clusters. May be useful for comparisons when ruling out outbreaks
At least basic tests, and use travis-ci
It would be apparently be preferred to have cluster assignments start from one rather than zero.
See print_clusters()
in network.py
Needed to speed up db constructions for large datasets
Either in python, or using the mash interface (depending on the resolution to #1)
At the moment (due to the mash interface) samples are referred to by the filename of their assembly, but it would be convenient to be able to give each sample an arbitrary name in the --r-files
or --q-files
input:
sample1 assemblies/sample1.contigs.fa
Easy enough to update the PopPUNK side, but the mash sketch names are always the file name. Two possible ways of doing this I think:
Dear team POPPUNK
I run the comands:
1$ poppunk --create-db --r-files lista2.txt --output strain_db --min-k 15 --k-step 2 --sketch-size 1000000 --threads 7 --plot-fit 5
2$ poppunk --fit-model --distances strain_db/strain_db.dists --output strain_db --full-db --ref-db strain_db --dbscan --microreact --rapidnj ../../../Programs/rapidNJ/bin/
3$ poppunk_references --network strain_db/strain_db_graph.gpickle --ref-db strain_db --distances strain_db/strain_db.dists --model strain_db --output strain_references --threads 7
4$ poppunk --refine-model --distances strain_references/strain_references.dists --output strain_references --full-db --ref-db strain_references --threads 7 --indiv-refine
5$ poppunk --assign-query --ref-db strain_references --q-files lista_willy.txt --output strain_query --threads 7 --update-db (Error)
ubuntu@enteropatogenos:~/Results/willy/poppunk$ poppunk --assign-query --ref-db strain_references --q-files lista_willy.txt --output strain_query --threads 7 --update-db /home/ubuntu/.linuxbrew/opt/python/lib/python3.6/importlib/_bootstrap.py:219: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88
return f(*args, **kwds)
PopPUNK (POPulation Partitioning Using Nucleotide Kmers)
Mode: Assigning clusters of query sequences
Creating mash database for k = 15
Random 15-mer probability: 0.00
Creating mash database for k = 17
Random 17-mer probability: 0.00
Creating mash database for k = 19
Random 19-mer probability: 0.00
Creating mash database for k = 21
Random 21-mer probability: 0.00
Creating mash database for k = 23
Random 23-mer probability: 0.00
Creating mash database for k = 25
Random 25-mer probability: 0.00
Creating mash database for k = 27
Random 27-mer probability: 0.00
Creating mash database for k = 29
Random 29-mer probability: 0.00
mash dist -p 7 strain_references/strain_references.15.msh strain_query/strain_query.15.msh 2> strain_references.err.log
mash dist -p 7 strain_references/strain_references.17.msh strain_query/strain_query.17.msh 2> strain_references.err.log
mash dist -p 7 strain_references/strain_references.19.msh strain_query/strain_query.19.msh 2> strain_references.err.log
mash dist -p 7 strain_references/strain_references.21.msh strain_query/strain_query.21.msh 2> strain_references.err.log
mash dist -p 7 strain_references/strain_references.23.msh strain_query/strain_query.23.msh 2> strain_references.err.log
mash dist -p 7 strain_references/strain_references.25.msh strain_query/strain_query.25.msh 2> strain_references.err.log
mash dist -p 7 strain_references/strain_references.27.msh strain_query/strain_query.27.msh 2> strain_references.err.log
mash dist -p 7 strain_references/strain_references.29.msh strain_query/strain_query.29.msh 2> strain_references.err.log
Calculating core and accessory distances
Loading previously refined model
Network loaded: 7 samples
Calculating all query-query distances
mash dist -p 7 strain_query/strain_query.15.msh strain_query/strain_query.15.msh 2> strain_query.err.log
mash dist -p 7 strain_query/strain_query.17.msh strain_query/strain_query.17.msh 2> strain_query.err.log
mash dist -p 7 strain_query/strain_query.19.msh strain_query/strain_query.19.msh 2> strain_query.err.log
mash dist -p 7 strain_query/strain_query.21.msh strain_query/strain_query.21.msh 2> strain_query.err.log
mash dist -p 7 strain_query/strain_query.23.msh strain_query/strain_query.23.msh 2> strain_query.err.log
mash dist -p 7 strain_query/strain_query.25.msh strain_query/strain_query.25.msh 2> strain_query.err.log
mash dist -p 7 strain_query/strain_query.27.msh strain_query/strain_query.27.msh 2> strain_query.err.log
mash dist -p 7 strain_query/strain_query.29.msh strain_query/strain_query.29.msh 2> strain_query.err.log
Calculating core and accessory distances
Updating reference database to strain_query
Creating mash database for k = 15
Random 15-mer probability: 0.00
Overwriting db: strain_query/strain_query.15.msh
Creating mash database for k = 17
Random 17-mer probability: 0.00
Overwriting db: strain_query/strain_query.17.msh
Creating mash database for k = 19
Random 19-mer probability: 0.00
Overwriting db: strain_query/strain_query.19.msh
Creating mash database for k = 21
Random 21-mer probability: 0.00
Overwriting db: strain_query/strain_query.21.msh
Creating mash database for k = 23
Random 23-mer probability: 0.00
Overwriting db: strain_query/strain_query.23.msh
Creating mash database for k = 25
Random 25-mer probability: 0.00
Overwriting db: strain_query/strain_query.25.msh
Creating mash database for k = 27
Random 27-mer probability: 0.00
Overwriting db: strain_query/strain_query.27.msh
Creating mash database for k = 29
Random 29-mer probability: 0.00
Overwriting db: strain_query/strain_query.29.msh
Writing strain_query/strain_query.15.joined.msh...
Writing strain_query/strain_query.17.joined.msh...
Writing strain_query/strain_query.19.joined.msh...
Writing strain_query/strain_query.21.joined.msh...
Writing strain_query/strain_query.23.joined.msh...
Writing strain_query/strain_query.25.joined.msh...
Writing strain_query/strain_query.27.joined.msh...
Writing strain_query/strain_query.29.joined.msh...
Traceback (most recent call last):
File "/home/ubuntu/.linuxbrew/bin/poppunk", line 11, in
sys.exit(main())
File "/home/ubuntu/.linuxbrew/opt/python/lib/python3.6/site-packages/PopPUNK/main.py", line 422, in main
refList, refList_copy, self, ref_distMat = readPickle(args.distances)
File "/home/ubuntu/.linuxbrew/opt/python/lib/python3.6/site-packages/PopPUNK/utils.py", line 53, in readPickle
with open(pklName + ".pkl", 'rb') as pickle_file:
TypeError: unsupported operand type(s) for +: 'NoneType' and 'strΒ΄|
I hope that you can help me.
Clusters.csv just contains reference database. Command was:
for r in nijmegen soton mass; do for q in maela nijmegen soton mass; do if [[ $r != $q ]]; then bsub -n 4 -M 2000 -R 'select[mem>2000] rusage[mem=2000] span[hosts=1]' -o ref.${r}.query.${q}.o -e ref.${r}.query.${q}.e "source /nfs/pathogen005/jl11/large_installations/miniconda3/bin/activate /lustre/scratch118/infgen/team81/nc3/nc3; python3 ./PopPUNK/poppunk-runner.py --assign-query --ref-db ${r}_db --distances ${r}_db/${r}_db.dists --model-dir ${r}_refined_bgmm_full --q-files ${q}.list --output ${r}_query_${q} --full-db --threads 4 --update-db"; fi; done; done
e.g. run with --no-stream --ref-db ../query_db
Will fail with:
File "PopPUNK/mash.py", line 462, in queryDatabase
suffix=".tmp", dir="./" + os.path.basename(dbPrefix))
File "tempfile.py", line 340, in mkstemp
return _mkstemp_inner(dir, prefix, suffix, flags, output_type)
File "tempfile.py", line 258, in _mkstemp_inner
fd = _os.open(file, flags, 0o600)
FileNotFoundError: [Errno 2] No such file or directory: './query_db/query_dby6aic3ql.tmp'
Need to fix lines:
if no_stream:
tmpHandle, tmpName = mkstemp(prefix=os.path.basename(dbPrefix),
suffix=".tmp", dir="./" + os.path.basename(dbPrefix))
mash_cmd += " > " + tmpName
Hi John,
I'm getting the following issue running on ~ 3000 E. coli assemblies (which all look reasonable from a QC perspective). I previously managed to run poppunk with no issues but had to reinstall conda and hence poppunk and since then I can't get it to work. Do you have any idea what the issue is?
poppunk --create-db --r-files refence_list.txt --output strain_db --threads 8 --plot-fit 5 --min-k 14
Traceback (most recent call last):
File "/home/sam/miniconda3/lib/python3.6/site-packages/sharedmem/sharedmem.py", line 757, in map
capsule = pg.get(R)
File "/home/sam/miniconda3/lib/python3.6/site-packages/sharedmem/sharedmem.py", line 429, in get
raise StopProcessGroup
sharedmem.sharedmem.StopProcessGroup: StopProcessGroup
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/sam/miniconda3/bin/poppunk", line 11, in
sys.exit(main())
File "/home/sam/miniconda3/lib/python3.6/site-packages/PopPUNK/main.py", line 210, in main
args.plot_fit, args.no_stream, args.mash, args.threads)
File "/home/sam/miniconda3/lib/python3.6/site-packages/PopPUNK/mash.py", line 520, in queryDatabase
pool.map(partial(fitKmerBlock, distMat=distMat, raw = raw, klist=klist, jacobian=jacobian), mat_chunks)
File "/home/sam/miniconda3/lib/python3.6/site-packages/sharedmem/sharedmem.py", line 761, in map
raise pg.get_exception()
sharedmem.sharedmem.SlaveException: Residuals are not finite in the initial point.
Traceback (most recent call last):
File "/home/sam/miniconda3/lib/python3.6/site-packages/sharedmem/sharedmem.py", line 294, in _slaveMain
self.main(self, *self.args)
File "/home/sam/miniconda3/lib/python3.6/site-packages/sharedmem/sharedmem.py", line 628, in _main
r = realfunc(work)
File "/home/sam/miniconda3/lib/python3.6/site-packages/sharedmem/sharedmem.py", line 703, in realfunc
else: return func(i)
File "/home/sam/miniconda3/lib/python3.6/site-packages/PopPUNK/mash.py", line 543, in fitKmerBlock
distMat[start:end, :] = np.apply_along_axis(fitKmerCurve, 1, raw[start:end, :], klist, jacobian)
File "/home/sam/.local/lib/python3.6/site-packages/numpy/lib/shape_base.py", line 380, in apply_along_axis
buff[ind] = asanyarray(func1d(inarr_view[ind], *args, **kwargs))
File "/home/sam/miniconda3/lib/python3.6/site-packages/PopPUNK/mash.py", line 570, in fitKmerCurve
bounds=([-np.inf, -np.inf], [0, 0]))
File "/home/sam/.local/lib/python3.6/site-packages/scipy/optimize/_lsq/least_squares.py", line 805, in least_squares
raise ValueError("Residuals are not finite in the initial point.")
ValueError: Residuals are not finite in the initial point.
Add an option to use previous cluster names not necessarily generated from PopPUNK. Choose the closest sequence label (core/combined distance?) and return that cluster's ID in an additional output column/file.
A wrapper script to run iteratively on smaller subsets of samples, until all samples have been assigned. To produce a smaller initial db to conserve resource use for huge datasets
Still sorted by taxon rather than frequency
Have a max memory argument for large datasets. All pairs never need to be in main memory at the same time, but this may need the mash dbs to be refactored.
For the mixture model, reservoir sampling can be used upon reading in.
A more general default for --easy-run
would be to use HDBSCAN followed by network refinement, rather than the 2D GMM
There seems to be an issue with DBSCAN fitting that only arises when multithreaded - looks to be an issue regarding joblib and multiprocessor package interfaces:
PopPUNK (POPulation Partitioning Using Nucleotide Kmers)
Mode: Fitting model to reference databaseTraceback (most recent call last):
File "./PopPUNK/poppunk-runner.py", line 9, in
main()
File "/lustre/scratch118/infgen/team81/nc3/GPS/ST_core/PopPUNK/PopPUNK/main.py", line 248, in main
assignments = model.fit(distMat, args.D, args.min_cluster_prop, args.threads)
File "/lustre/scratch118/infgen/team81/nc3/GPS/ST_core/PopPUNK/PopPUNK/models.py", line 316, in fit
self.hdb, self.labels, self.n_clusters = fitDbScan(self.subsampled_X, self.outPrefix, min_samples, min_cluster_size, cache_out, threads)
File "/lustre/scratch118/infgen/team81/nc3/GPS/ST_core/PopPUNK/PopPUNK/dbscan.py", line 50, in fitDbScan
).fit(X)
File "/lustre/scratch118/infgen/team81/nc3/nc3/lib/python3.6/site-packages/hdbscan/hdbscan_.py", line 851, in fit
self.min_spanning_tree) = hdbscan(X, **kwargs)
File "/lustre/scratch118/infgen/team81/nc3/nc3/lib/python3.6/site-packages/hdbscan/hdbscan.py", line 546, in hdbscan
core_dist_n_jobs, **kwargs)
File "/lustre/scratch118/infgen/team81/nc3/nc3/lib/python3.6/site-packages/sklearn/externals/joblib/memory.py", line 362, in call
return self.func(*args, **kwargs)
File "/lustre/scratch118/infgen/team81/nc3/nc3/lib/python3.6/site-packages/hdbscan/hdbscan_.py", line 285, in _hdbscan_boruvka_balltree
n_jobs=core_dist_n_jobs, **kwargs)
File "hdbscan/_hdbscan_boruvka.pyx", line 984, in hdbscan._hdbscan_boruvka.BallTreeBoruvkaAlgorithm.init
File "hdbscan/_hdbscan_boruvka.pyx", line 1015, in hdbscan._hdbscan_boruvka.BallTreeBoruvkaAlgorithm._compute_bounds
File "/lustre/scratch118/infgen/team81/nc3/nc3/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py", line 789, in call
self.retrieve()
File "/lustre/scratch118/infgen/team81/nc3/nc3/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py", line 699, in retrieve
self._output.extend(job.get(timeout=self.timeout))
File "/lustre/scratch118/infgen/team81/nc3/nc3/lib/python3.6/multiprocessing/pool.py", line 644, in get
raise self._value
multiprocessing.pool.MaybeEncodingError: Error sending result: '[(array([[0.00000000e+00, 4.03246668e-04, 4.05466814e-04, ...,
3.45438025e-02, 3.45439840e-02, 3.45456626e-02],
[0.00000000e+00, 1.74505436e-04, 2.04864122e-04, ...,
2.66212775e-02, 2.66223376e-02, 2.66228534e-02],
[0.00000000e+00, 2.33849996e-04, 2.44379060e-04, ...,
3.66543389e-02, 3.66559647e-02, 3.66573385e-02],
...,
[0.00000000e+00, 7.88384237e-05, 1.32052839e-04, ...,
4.68294336e-02, 4.68303438e-02, 4.68309294e-02],
[0.00000000e+00, 1.04485943e-04, 2.06512190e-04, ...,
2.64343423e-02, 2.64372834e-02, 2.64386719e-02],
[0.00000000e+00, 1.87643709e-04, 2.02630717e-04, ...,
2.65259452e-02, 2.65293704e-02, 2.65309182e-02]]), array([[ 0, 21411, 61521, ..., 74665, 33889, 25600],
[ 1, 89127, 69051, ..., 21044, 27497, 84593],
[ 2, 41269, 85304, ..., 4793, 61086, 11021],
...,
[24997, 7094, 57682, ..., 26199, 13061, 51331],
[24998, 42754, 77802, ..., 96494, 7710, 7146],
[24999, 77949, 18152, ..., 14254, 39465, 95775]]))]'. Reason: 'error("'i' format requires -2147483648 <= number <= 2147483647",)'
A declarative, efficient, and flexible JavaScript library for building user interfaces.
π Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. πππ
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google β€οΈ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.