artpoon / pangolin Goto Github PK
View Code? Open in Web Editor NEWThis project forked from cov-lineages/pangolin
Software package for assigning SARS-CoV-2 genome sequences to global lineages.
License: GNU General Public License v3.0
This project forked from cov-lineages/pangolin
Software package for assigning SARS-CoV-2 genome sequences to global lineages.
License: GNU General Public License v3.0
Generated a small test file of 10 records from Northern Ireland - only one record processed by pangolearn.py
:
(pangolin) art@orolo:~/git/covizu/data$ grep -A1 "Northern" gisaid-filtered.fa | head -n20 > northern.fa
(pangolin) art@orolo:~/git/covizu/data$ head -n1 northern.fa
>hCoV-19/Northern Ireland/NIRE-FADA8/2020|EPI_ISL_448918|2020-03-26
(pangolin) art@orolo:~/git/covizu/data$ grep -c ">" northern.fa
10
(pangolin) art@orolo:~/git/covizu/data$ pangolin --outfile northern.pangolin.csv northern.fa
Found the snakefile
The query file is /home/art/git/covizu/data/northern.fa
Number of threads is 1
Looking in /home/art/miniconda3/envs/pangolin/lib/python3.6/site-packages/pangoLEARN/data for data files...
Data files found
Trained model: /home/art/miniconda3/envs/pangolin/lib/python3.6/site-packages/pangoLEARN/data/multinomialLogReg_v1.joblib
Header file: /home/art/miniconda3/envs/pangolin/lib/python3.6/site-packages/pangoLEARN/data/multinomialLogRegHeaders_v1.joblib
Lineages csv: /home/art/miniconda3/envs/pangolin/lib/python3.6/site-packages/pangoLEARN/data/lineages.metadata.csv
Job counts:
count jobs
1 add_failed_seqs
1 all
1 datafunk_trim_and_pad
1 minimap2_check_distance
1 minimap2_to_reference
1 pangolearn
1 parse_paf
7
minimap2 -x asm5 /home/art/miniconda3/envs/pangolin/lib/python3.6/site-packages/pangolin-2.0.4-py3.6.egg/pangolin/data/reference.fasta /tmp/tmpan14e88z/query.post_qc.fasta -o /tmp/tmpan14e88z/reference_mapped.paf &> /tmp/tmpan14e88z/logs/minimap2_check.log
Job counts:
count jobs
1 parse_paf
1
minimap2 -a -x asm5 /home/art/miniconda3/envs/pangolin/lib/python3.6/site-packages/pangolin-2.0.4-py3.6.egg/pangolin/data/reference.fasta /tmp/tmpan14e88z/mappable.fasta -o /tmp/tmpan14e88z/reference_mapped.sam &> /tmp/tmpan14e88z/logs/minimap2_sam.log
datafunk sam_2_fasta -s /tmp/tmpan14e88z/reference_mapped.sam -r /home/art/miniconda3/envs/pangolin/lib/python3.6/site-packages/pangolin-2.0.4-py3.6.egg/pangolin/data/reference.fasta -o /tmp/tmpan14e88z/post_qc_query.aligned.fasta -t [265:29674] --pad --log-inserts
/home/art/miniconda3/envs/pangolin/lib/python3.6/site-packages/datafunk/sam_2_fasta.py:158: UserWarning: ambiguous overlapping alignment
warnings.warn('ambiguous overlapping alignment')
pangolearn.py --header-file /home/art/miniconda3/envs/pangolin/lib/python3.6/site-packages/pangoLEARN/data/multinomialLogRegHeaders_v1.joblib --model-file /home/art/miniconda3/envs/pangolin/lib/python3.6/site-packages/pangoLEARN/data/multinomialLogReg_v1.joblib --fasta /tmp/tmpan14e88z/post_qc_query.aligned.fasta -o /tmp/tmpan14e88z/lineage_report.pass_qc.csv
loading model 07/28/2020, 12:30:58
generating predictions 07/28/2020, 12:30:59
processing block of 1 sequences 07/28/2020, 12:30:59
complete 07/28/2020, 12:31:13
These lines:
# for each entry in dataList, remove the irrelevant columns
while len(dataList) > 0:
line = dataList.pop(0)
finalLine = []
for index in range(len(line)):
if index in indiciesToKeep:
finalLine.extend(line[index].vector)
finalList.append(finalLine)
are unnecessarily iterating over every position of each genome - it should be faster to iterate over indiciesToKeep
only:
for index in indiciesToKeep:
if index < len(line):
finalLine.extend(line[index].vector)
Encountered the following exception while attempting to run a recent dump of the GISAID CoV database:
(pangolin) art@orolo:~/git/covizu/data$ datafunk sam_2_fasta -s /home/art/git/covizu/data/reference_mapped.sam -r /home/art/miniconda3/envs/pangolin/lib/python3.6/site-packages/pangolin-2.0.4-py3.6.egg/pangolin/data/reference.fasta -o /home/art/git/covizu/data/post_qc_query.aligned.fasta -t [265:29674] --pad --log-inserts
Traceback (most recent call last):
File "/home/art/miniconda3/envs/pangolin/bin/datafunk", line 8, in <module>
sys.exit(main())
File "/home/art/miniconda3/envs/pangolin/lib/python3.6/site-packages/datafunk/__main__.py", line 1010, in main
args.func(args)
File "/home/art/miniconda3/envs/pangolin/lib/python3.6/site-packages/datafunk/subcommands/sam_2_fasta.py", line 87, in run
trimend = trimend)
File "/home/art/miniconda3/envs/pangolin/lib/python3.6/site-packages/datafunk/sam_2_fasta.py", line 269, in sam_2_fasta
seq = get_seq_from_block(sam_block = one_querys_alignment_lines, rlen = RLEN, log_inserts = log, pad = pad)
File "/home/art/miniconda3/envs/pangolin/lib/python3.6/site-packages/datafunk/sam_2_fasta.py", line 201, in get_seq_from_block
seq_flat_no_internal_gaps = swap_in_gaps_Ns(block_lines_sites_list[0], pad = pad)
File "/home/art/miniconda3/envs/pangolin/lib/python3.6/site-packages/datafunk/sam_2_fasta.py", line 172, in swap_in_gaps_Ns
for x in re.findall(r_internal, seq):
File "/home/art/miniconda3/envs/pangolin/lib/python3.6/re.py", line 222, in findall
return _compile(pattern, flags).findall(string)
TypeError: expected string or bytes-like object
I think my package managers are all fouled up on my home Mac. I can install and run my modified code on my remote workstation running Ubuntu, but the executable is not being updated on my Mac.
(pangolin) art@Wernstrom pangolin % grep -r xz . | head -n3
./pangolin/command.py: compression.add_argument("--xz", action="store_true", help="Query files are xz-compressed.")
./pangolin/command.py: if args.xz:
Binary file ./docs/logo.png matches
(pangolin) art@Wernstrom pangolin % python setup.py install
running install
running bdist_egg
running egg_info
writing pangolin.egg-info/PKG-INFO
writing dependency_links to pangolin.egg-info/dependency_links.txt
writing entry points to pangolin.egg-info/entry_points.txt
writing requirements to pangolin.egg-info/requires.txt
writing top-level names to pangolin.egg-info/top_level.txt
reading manifest file 'pangolin.egg-info/SOURCES.txt'
writing manifest file 'pangolin.egg-info/SOURCES.txt'
installing library code to build/bdist.macosx-10.9-x86_64/egg
running install_lib
running build_py
creating build/bdist.macosx-10.9-x86_64/egg
creating build/bdist.macosx-10.9-x86_64/egg/pangolin
copying build/lib/pangolin/command.py -> build/bdist.macosx-10.9-x86_64/egg/pangolin
copying build/lib/pangolin/__init__.py -> build/bdist.macosx-10.9-x86_64/egg/pangolin
creating build/bdist.macosx-10.9-x86_64/egg/pangolin/scripts
copying build/lib/pangolin/scripts/type_variants.py -> build/bdist.macosx-10.9-x86_64/egg/pangolin/scripts
copying build/lib/pangolin/scripts/report_classes.py -> build/bdist.macosx-10.9-x86_64/egg/pangolin/scripts
copying build/lib/pangolin/scripts/pangofunks.py -> build/bdist.macosx-10.9-x86_64/egg/pangolin/scripts
copying build/lib/pangolin/scripts/__init__.py -> build/bdist.macosx-10.9-x86_64/egg/pangolin/scripts
copying build/lib/pangolin/scripts/custom_logger.py -> build/bdist.macosx-10.9-x86_64/egg/pangolin/scripts
copying build/lib/pangolin/scripts/pangolearn.py -> build/bdist.macosx-10.9-x86_64/egg/pangolin/scripts
copying build/lib/pangolin/scripts/utils.py -> build/bdist.macosx-10.9-x86_64/egg/pangolin/scripts
copying build/lib/pangolin/scripts/report_results.py -> build/bdist.macosx-10.9-x86_64/egg/pangolin/scripts
copying build/lib/pangolin/scripts/pangolearn.smk -> build/bdist.macosx-10.9-x86_64/egg/pangolin/scripts
copying build/lib/pangolin/scripts/log_handler_handle.py -> build/bdist.macosx-10.9-x86_64/egg/pangolin/scripts
creating build/bdist.macosx-10.9-x86_64/egg/pangolin/data
copying build/lib/pangolin/data/config_p.3.csv -> build/bdist.macosx-10.9-x86_64/egg/pangolin/data
copying build/lib/pangolin/data/config_p.2.csv -> build/bdist.macosx-10.9-x86_64/egg/pangolin/data
copying build/lib/pangolin/data/config_p.1.csv -> build/bdist.macosx-10.9-x86_64/egg/pangolin/data
copying build/lib/pangolin/data/config_b.1.1.7.csv -> build/bdist.macosx-10.9-x86_64/egg/pangolin/data
copying build/lib/pangolin/data/config_b.1.351.csv -> build/bdist.macosx-10.9-x86_64/egg/pangolin/data
copying build/lib/pangolin/data/reference.fasta -> build/bdist.macosx-10.9-x86_64/egg/pangolin/data
copying build/lib/pangolin/data/config_b.1.214.2.csv -> build/bdist.macosx-10.9-x86_64/egg/pangolin/data
byte-compiling build/bdist.macosx-10.9-x86_64/egg/pangolin/command.py to command.cpython-37.pyc
byte-compiling build/bdist.macosx-10.9-x86_64/egg/pangolin/__init__.py to __init__.cpython-37.pyc
byte-compiling build/bdist.macosx-10.9-x86_64/egg/pangolin/scripts/type_variants.py to type_variants.cpython-37.pyc
byte-compiling build/bdist.macosx-10.9-x86_64/egg/pangolin/scripts/report_classes.py to report_classes.cpython-37.pyc
byte-compiling build/bdist.macosx-10.9-x86_64/egg/pangolin/scripts/pangofunks.py to pangofunks.cpython-37.pyc
byte-compiling build/bdist.macosx-10.9-x86_64/egg/pangolin/scripts/__init__.py to __init__.cpython-37.pyc
byte-compiling build/bdist.macosx-10.9-x86_64/egg/pangolin/scripts/custom_logger.py to custom_logger.cpython-37.pyc
byte-compiling build/bdist.macosx-10.9-x86_64/egg/pangolin/scripts/pangolearn.py to pangolearn.cpython-37.pyc
byte-compiling build/bdist.macosx-10.9-x86_64/egg/pangolin/scripts/utils.py to utils.cpython-37.pyc
byte-compiling build/bdist.macosx-10.9-x86_64/egg/pangolin/scripts/report_results.py to report_results.cpython-37.pyc
byte-compiling build/bdist.macosx-10.9-x86_64/egg/pangolin/scripts/log_handler_handle.py to log_handler_handle.cpython-37.pyc
creating build/bdist.macosx-10.9-x86_64/egg/EGG-INFO
installing scripts to build/bdist.macosx-10.9-x86_64/egg/EGG-INFO/scripts
running install_scripts
running build_scripts
creating build/bdist.macosx-10.9-x86_64/egg/EGG-INFO/scripts
copying build/scripts-3.7/type_variants.py -> build/bdist.macosx-10.9-x86_64/egg/EGG-INFO/scripts
copying build/scripts-3.7/report_classes.py -> build/bdist.macosx-10.9-x86_64/egg/EGG-INFO/scripts
copying build/scripts-3.7/pangofunks.py -> build/bdist.macosx-10.9-x86_64/egg/EGG-INFO/scripts
copying build/scripts-3.7/custom_logger.py -> build/bdist.macosx-10.9-x86_64/egg/EGG-INFO/scripts
copying build/scripts-3.7/pangolearn.py -> build/bdist.macosx-10.9-x86_64/egg/EGG-INFO/scripts
copying build/scripts-3.7/utils.py -> build/bdist.macosx-10.9-x86_64/egg/EGG-INFO/scripts
copying build/scripts-3.7/report_results.py -> build/bdist.macosx-10.9-x86_64/egg/EGG-INFO/scripts
copying build/scripts-3.7/pangolearn.smk -> build/bdist.macosx-10.9-x86_64/egg/EGG-INFO/scripts
copying build/scripts-3.7/log_handler_handle.py -> build/bdist.macosx-10.9-x86_64/egg/EGG-INFO/scripts
changing mode of build/bdist.macosx-10.9-x86_64/egg/EGG-INFO/scripts/type_variants.py to 755
changing mode of build/bdist.macosx-10.9-x86_64/egg/EGG-INFO/scripts/report_classes.py to 755
changing mode of build/bdist.macosx-10.9-x86_64/egg/EGG-INFO/scripts/pangofunks.py to 755
changing mode of build/bdist.macosx-10.9-x86_64/egg/EGG-INFO/scripts/custom_logger.py to 755
changing mode of build/bdist.macosx-10.9-x86_64/egg/EGG-INFO/scripts/pangolearn.py to 755
changing mode of build/bdist.macosx-10.9-x86_64/egg/EGG-INFO/scripts/utils.py to 755
changing mode of build/bdist.macosx-10.9-x86_64/egg/EGG-INFO/scripts/report_results.py to 755
changing mode of build/bdist.macosx-10.9-x86_64/egg/EGG-INFO/scripts/pangolearn.smk to 755
changing mode of build/bdist.macosx-10.9-x86_64/egg/EGG-INFO/scripts/log_handler_handle.py to 755
copying pangolin.egg-info/PKG-INFO -> build/bdist.macosx-10.9-x86_64/egg/EGG-INFO
copying pangolin.egg-info/SOURCES.txt -> build/bdist.macosx-10.9-x86_64/egg/EGG-INFO
copying pangolin.egg-info/dependency_links.txt -> build/bdist.macosx-10.9-x86_64/egg/EGG-INFO
copying pangolin.egg-info/entry_points.txt -> build/bdist.macosx-10.9-x86_64/egg/EGG-INFO
copying pangolin.egg-info/not-zip-safe -> build/bdist.macosx-10.9-x86_64/egg/EGG-INFO
copying pangolin.egg-info/requires.txt -> build/bdist.macosx-10.9-x86_64/egg/EGG-INFO
copying pangolin.egg-info/top_level.txt -> build/bdist.macosx-10.9-x86_64/egg/EGG-INFO
creating 'dist/pangolin-2.4.2-py3.7.egg' and adding 'build/bdist.macosx-10.9-x86_64/egg' to it
removing 'build/bdist.macosx-10.9-x86_64/egg' (and everything under it)
Processing pangolin-2.4.2-py3.7.egg
removing '/usr/local/Caskroom/miniconda/base/envs/pangolin/lib/python3.7/site-packages/pangolin-2.4.2-py3.7.egg' (and everything under it)
creating /usr/local/Caskroom/miniconda/base/envs/pangolin/lib/python3.7/site-packages/pangolin-2.4.2-py3.7.egg
Extracting pangolin-2.4.2-py3.7.egg to /usr/local/Caskroom/miniconda/base/envs/pangolin/lib/python3.7/site-packages
pangolin 2.4.2 is already the active version in easy-install.pth
Installing type_variants.py script to /usr/local/Caskroom/miniconda/base/envs/pangolin/bin
Installing report_classes.py script to /usr/local/Caskroom/miniconda/base/envs/pangolin/bin
Installing pangofunks.py script to /usr/local/Caskroom/miniconda/base/envs/pangolin/bin
Installing custom_logger.py script to /usr/local/Caskroom/miniconda/base/envs/pangolin/bin
Installing pangolearn.py script to /usr/local/Caskroom/miniconda/base/envs/pangolin/bin
Installing utils.py script to /usr/local/Caskroom/miniconda/base/envs/pangolin/bin
Installing report_results.py script to /usr/local/Caskroom/miniconda/base/envs/pangolin/bin
Installing pangolearn.smk script to /usr/local/Caskroom/miniconda/base/envs/pangolin/bin
Installing log_handler_handle.py script to /usr/local/Caskroom/miniconda/base/envs/pangolin/bin
Installing pangolin script to /usr/local/Caskroom/miniconda/base/envs/pangolin/bin
Installed /usr/local/Caskroom/miniconda/base/envs/pangolin/lib/python3.7/site-packages/pangolin-2.4.2-py3.7.egg
Processing dependencies for pangolin==2.4.2
...
Using /usr/local/Caskroom/miniconda/base/envs/pangolin/lib/python3.7/site-packages
Finished processing dependencies for pangolin==2.4.2
(pangolin) art@Wernstrom pangolin % pangolin
usage: pangolin <query> [options]
pangolin: Phylogenetic Assignment of Named Global Outbreak LINeages
positional arguments:
query Query fasta file of sequences to analyse.
optional arguments:
-h, --help show this help message and exit
-o OUTDIR, --outdir OUTDIR
Output directory. Default: current working directory
--outfile OUTFILE Optional output file name. Default: lineage_report.csv
--alignment Optional alignment output.
-d DATADIR, --datadir DATADIR
Data directory minimally containing a fasta alignment
and guide tree
--tempdir TEMPDIR Specify where you want the temp stuff to go. Default:
$TMPDIR
--no-temp Output all intermediate files, for dev purposes.
--decompress-model Permanently decompress the model file to save time
running pangolin.
--max-ambig MAXAMBIG Maximum proportion of Ns allowed for pangolin to
attempt assignment. Default: 0.5
--min-length MINLEN Minimum query length allowed for pangolin to attempt
assignment. Default: 25000
--panGUIlin Run web-app version of pangolin
--verbose Print lots of stuff to screen
-t THREADS, --threads THREADS
Number of threads
-v, --version show program's version number and exit
-pv, --pangoLEARN-version
show pangoLEARN's version number and exit
--update Automatically updates to latest release of pangolin
and pangoLEARN, then exits
--gzip Query files are gzip-compressed.
(pangolin) art@Wernstrom pangolin %
Note --xz
option is missing.
Originally I was not able to process 20K+ sequences because my workstation ran out of memory while procssing pangolearn.py
. There seem to be two memory intensive steps in this script:
I am attempting to reduce the RAM consumed by this script by (1) filtering sequences to the required sites (indices) on load, and (2) processing subsets of the filtered data as pandas data frames of a fixed maximum size.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.