Giter Site home page Giter Site logo

artpoon / pangolin Goto Github PK

View Code? Open in Web Editor NEW

This project forked from cov-lineages/pangolin

0.0 0.0 0.0 17.2 MB

Software package for assigning SARS-CoV-2 genome sequences to global lineages.

License: GNU General Public License v3.0

Python 98.54% Dockerfile 1.46%

pangolin's People

pangolin's Issues

Something funny happening for records with spaces in header

Generated a small test file of 10 records from Northern Ireland - only one record processed by pangolearn.py:

(pangolin) art@orolo:~/git/covizu/data$ grep -A1 "Northern" gisaid-filtered.fa | head -n20 > northern.fa
(pangolin) art@orolo:~/git/covizu/data$ head -n1 northern.fa
>hCoV-19/Northern Ireland/NIRE-FADA8/2020|EPI_ISL_448918|2020-03-26
(pangolin) art@orolo:~/git/covizu/data$ grep -c ">" northern.fa
10
(pangolin) art@orolo:~/git/covizu/data$ pangolin --outfile northern.pangolin.csv northern.fa 
Found the snakefile
The query file is /home/art/git/covizu/data/northern.fa
Number of threads is 1
Looking in /home/art/miniconda3/envs/pangolin/lib/python3.6/site-packages/pangoLEARN/data for data files...

Data files found
Trained model:	/home/art/miniconda3/envs/pangolin/lib/python3.6/site-packages/pangoLEARN/data/multinomialLogReg_v1.joblib
Header file:	/home/art/miniconda3/envs/pangolin/lib/python3.6/site-packages/pangoLEARN/data/multinomialLogRegHeaders_v1.joblib
Lineages csv:	/home/art/miniconda3/envs/pangolin/lib/python3.6/site-packages/pangoLEARN/data/lineages.metadata.csv
Job counts:
	count	jobs
	1	add_failed_seqs
	1	all
	1	datafunk_trim_and_pad
	1	minimap2_check_distance
	1	minimap2_to_reference
	1	pangolearn
	1	parse_paf
	7

        minimap2 -x asm5 /home/art/miniconda3/envs/pangolin/lib/python3.6/site-packages/pangolin-2.0.4-py3.6.egg/pangolin/data/reference.fasta /tmp/tmpan14e88z/query.post_qc.fasta -o /tmp/tmpan14e88z/reference_mapped.paf &> /tmp/tmpan14e88z/logs/minimap2_check.log
        
Job counts:
	count	jobs
	1	parse_paf
	1

        minimap2 -a -x asm5 /home/art/miniconda3/envs/pangolin/lib/python3.6/site-packages/pangolin-2.0.4-py3.6.egg/pangolin/data/reference.fasta /tmp/tmpan14e88z/mappable.fasta -o /tmp/tmpan14e88z/reference_mapped.sam &> /tmp/tmpan14e88z/logs/minimap2_sam.log
        

        datafunk sam_2_fasta           -s /tmp/tmpan14e88z/reference_mapped.sam           -r /home/art/miniconda3/envs/pangolin/lib/python3.6/site-packages/pangolin-2.0.4-py3.6.egg/pangolin/data/reference.fasta           -o /tmp/tmpan14e88z/post_qc_query.aligned.fasta           -t [265:29674]           --pad           --log-inserts 
        
/home/art/miniconda3/envs/pangolin/lib/python3.6/site-packages/datafunk/sam_2_fasta.py:158: UserWarning: ambiguous overlapping alignment
  warnings.warn('ambiguous overlapping alignment')

        pangolearn.py --header-file /home/art/miniconda3/envs/pangolin/lib/python3.6/site-packages/pangoLEARN/data/multinomialLogRegHeaders_v1.joblib --model-file /home/art/miniconda3/envs/pangolin/lib/python3.6/site-packages/pangoLEARN/data/multinomialLogReg_v1.joblib --fasta /tmp/tmpan14e88z/post_qc_query.aligned.fasta -o /tmp/tmpan14e88z/lineage_report.pass_qc.csv
        
loading model 07/28/2020, 12:30:58
generating predictions 07/28/2020, 12:30:59
processing block of 1 sequences 07/28/2020, 12:30:59
complete 07/28/2020, 12:31:13

Optimize removeIndices

These lines:

    # for each entry in dataList, remove the irrelevant columns
    while len(dataList) > 0:
        line = dataList.pop(0)

        finalLine = []

        for index in range(len(line)):
            if index in indiciesToKeep:
                finalLine.extend(line[index].vector)

        finalList.append(finalLine)

are unnecessarily iterating over every position of each genome - it should be faster to iterate over indiciesToKeep only:

        for index in indiciesToKeep:
            if index < len(line):
                finalLine.extend(line[index].vector)

TypeError in datafunk

Encountered the following exception while attempting to run a recent dump of the GISAID CoV database:

(pangolin) art@orolo:~/git/covizu/data$ datafunk sam_2_fasta           -s /home/art/git/covizu/data/reference_mapped.sam           -r /home/art/miniconda3/envs/pangolin/lib/python3.6/site-packages/pangolin-2.0.4-py3.6.egg/pangolin/data/reference.fasta           -o /home/art/git/covizu/data/post_qc_query.aligned.fasta           -t [265:29674]           --pad           --log-inserts 
Traceback (most recent call last):
  File "/home/art/miniconda3/envs/pangolin/bin/datafunk", line 8, in <module>
    sys.exit(main())
  File "/home/art/miniconda3/envs/pangolin/lib/python3.6/site-packages/datafunk/__main__.py", line 1010, in main
    args.func(args)
  File "/home/art/miniconda3/envs/pangolin/lib/python3.6/site-packages/datafunk/subcommands/sam_2_fasta.py", line 87, in run
    trimend = trimend)
  File "/home/art/miniconda3/envs/pangolin/lib/python3.6/site-packages/datafunk/sam_2_fasta.py", line 269, in sam_2_fasta
    seq = get_seq_from_block(sam_block = one_querys_alignment_lines, rlen = RLEN, log_inserts = log, pad = pad)
  File "/home/art/miniconda3/envs/pangolin/lib/python3.6/site-packages/datafunk/sam_2_fasta.py", line 201, in get_seq_from_block
    seq_flat_no_internal_gaps = swap_in_gaps_Ns(block_lines_sites_list[0], pad = pad)
  File "/home/art/miniconda3/envs/pangolin/lib/python3.6/site-packages/datafunk/sam_2_fasta.py", line 172, in swap_in_gaps_Ns
    for x in re.findall(r_internal, seq):
  File "/home/art/miniconda3/envs/pangolin/lib/python3.6/re.py", line 222, in findall
    return _compile(pattern, flags).findall(string)
TypeError: expected string or bytes-like object

Problems with local install on macOS

I think my package managers are all fouled up on my home Mac. I can install and run my modified code on my remote workstation running Ubuntu, but the executable is not being updated on my Mac.

(pangolin) art@Wernstrom pangolin % grep -r xz . | head -n3
./pangolin/command.py:    compression.add_argument("--xz", action="store_true", help="Query files are xz-compressed.")
./pangolin/command.py:    if args.xz:
Binary file ./docs/logo.png matches
(pangolin) art@Wernstrom pangolin % python setup.py install
running install
running bdist_egg
running egg_info
writing pangolin.egg-info/PKG-INFO
writing dependency_links to pangolin.egg-info/dependency_links.txt
writing entry points to pangolin.egg-info/entry_points.txt
writing requirements to pangolin.egg-info/requires.txt
writing top-level names to pangolin.egg-info/top_level.txt
reading manifest file 'pangolin.egg-info/SOURCES.txt'
writing manifest file 'pangolin.egg-info/SOURCES.txt'
installing library code to build/bdist.macosx-10.9-x86_64/egg
running install_lib
running build_py
creating build/bdist.macosx-10.9-x86_64/egg
creating build/bdist.macosx-10.9-x86_64/egg/pangolin
copying build/lib/pangolin/command.py -> build/bdist.macosx-10.9-x86_64/egg/pangolin
copying build/lib/pangolin/__init__.py -> build/bdist.macosx-10.9-x86_64/egg/pangolin
creating build/bdist.macosx-10.9-x86_64/egg/pangolin/scripts
copying build/lib/pangolin/scripts/type_variants.py -> build/bdist.macosx-10.9-x86_64/egg/pangolin/scripts
copying build/lib/pangolin/scripts/report_classes.py -> build/bdist.macosx-10.9-x86_64/egg/pangolin/scripts
copying build/lib/pangolin/scripts/pangofunks.py -> build/bdist.macosx-10.9-x86_64/egg/pangolin/scripts
copying build/lib/pangolin/scripts/__init__.py -> build/bdist.macosx-10.9-x86_64/egg/pangolin/scripts
copying build/lib/pangolin/scripts/custom_logger.py -> build/bdist.macosx-10.9-x86_64/egg/pangolin/scripts
copying build/lib/pangolin/scripts/pangolearn.py -> build/bdist.macosx-10.9-x86_64/egg/pangolin/scripts
copying build/lib/pangolin/scripts/utils.py -> build/bdist.macosx-10.9-x86_64/egg/pangolin/scripts
copying build/lib/pangolin/scripts/report_results.py -> build/bdist.macosx-10.9-x86_64/egg/pangolin/scripts
copying build/lib/pangolin/scripts/pangolearn.smk -> build/bdist.macosx-10.9-x86_64/egg/pangolin/scripts
copying build/lib/pangolin/scripts/log_handler_handle.py -> build/bdist.macosx-10.9-x86_64/egg/pangolin/scripts
creating build/bdist.macosx-10.9-x86_64/egg/pangolin/data
copying build/lib/pangolin/data/config_p.3.csv -> build/bdist.macosx-10.9-x86_64/egg/pangolin/data
copying build/lib/pangolin/data/config_p.2.csv -> build/bdist.macosx-10.9-x86_64/egg/pangolin/data
copying build/lib/pangolin/data/config_p.1.csv -> build/bdist.macosx-10.9-x86_64/egg/pangolin/data
copying build/lib/pangolin/data/config_b.1.1.7.csv -> build/bdist.macosx-10.9-x86_64/egg/pangolin/data
copying build/lib/pangolin/data/config_b.1.351.csv -> build/bdist.macosx-10.9-x86_64/egg/pangolin/data
copying build/lib/pangolin/data/reference.fasta -> build/bdist.macosx-10.9-x86_64/egg/pangolin/data
copying build/lib/pangolin/data/config_b.1.214.2.csv -> build/bdist.macosx-10.9-x86_64/egg/pangolin/data
byte-compiling build/bdist.macosx-10.9-x86_64/egg/pangolin/command.py to command.cpython-37.pyc
byte-compiling build/bdist.macosx-10.9-x86_64/egg/pangolin/__init__.py to __init__.cpython-37.pyc
byte-compiling build/bdist.macosx-10.9-x86_64/egg/pangolin/scripts/type_variants.py to type_variants.cpython-37.pyc
byte-compiling build/bdist.macosx-10.9-x86_64/egg/pangolin/scripts/report_classes.py to report_classes.cpython-37.pyc
byte-compiling build/bdist.macosx-10.9-x86_64/egg/pangolin/scripts/pangofunks.py to pangofunks.cpython-37.pyc
byte-compiling build/bdist.macosx-10.9-x86_64/egg/pangolin/scripts/__init__.py to __init__.cpython-37.pyc
byte-compiling build/bdist.macosx-10.9-x86_64/egg/pangolin/scripts/custom_logger.py to custom_logger.cpython-37.pyc
byte-compiling build/bdist.macosx-10.9-x86_64/egg/pangolin/scripts/pangolearn.py to pangolearn.cpython-37.pyc
byte-compiling build/bdist.macosx-10.9-x86_64/egg/pangolin/scripts/utils.py to utils.cpython-37.pyc
byte-compiling build/bdist.macosx-10.9-x86_64/egg/pangolin/scripts/report_results.py to report_results.cpython-37.pyc
byte-compiling build/bdist.macosx-10.9-x86_64/egg/pangolin/scripts/log_handler_handle.py to log_handler_handle.cpython-37.pyc
creating build/bdist.macosx-10.9-x86_64/egg/EGG-INFO
installing scripts to build/bdist.macosx-10.9-x86_64/egg/EGG-INFO/scripts
running install_scripts
running build_scripts
creating build/bdist.macosx-10.9-x86_64/egg/EGG-INFO/scripts
copying build/scripts-3.7/type_variants.py -> build/bdist.macosx-10.9-x86_64/egg/EGG-INFO/scripts
copying build/scripts-3.7/report_classes.py -> build/bdist.macosx-10.9-x86_64/egg/EGG-INFO/scripts
copying build/scripts-3.7/pangofunks.py -> build/bdist.macosx-10.9-x86_64/egg/EGG-INFO/scripts
copying build/scripts-3.7/custom_logger.py -> build/bdist.macosx-10.9-x86_64/egg/EGG-INFO/scripts
copying build/scripts-3.7/pangolearn.py -> build/bdist.macosx-10.9-x86_64/egg/EGG-INFO/scripts
copying build/scripts-3.7/utils.py -> build/bdist.macosx-10.9-x86_64/egg/EGG-INFO/scripts
copying build/scripts-3.7/report_results.py -> build/bdist.macosx-10.9-x86_64/egg/EGG-INFO/scripts
copying build/scripts-3.7/pangolearn.smk -> build/bdist.macosx-10.9-x86_64/egg/EGG-INFO/scripts
copying build/scripts-3.7/log_handler_handle.py -> build/bdist.macosx-10.9-x86_64/egg/EGG-INFO/scripts
changing mode of build/bdist.macosx-10.9-x86_64/egg/EGG-INFO/scripts/type_variants.py to 755
changing mode of build/bdist.macosx-10.9-x86_64/egg/EGG-INFO/scripts/report_classes.py to 755
changing mode of build/bdist.macosx-10.9-x86_64/egg/EGG-INFO/scripts/pangofunks.py to 755
changing mode of build/bdist.macosx-10.9-x86_64/egg/EGG-INFO/scripts/custom_logger.py to 755
changing mode of build/bdist.macosx-10.9-x86_64/egg/EGG-INFO/scripts/pangolearn.py to 755
changing mode of build/bdist.macosx-10.9-x86_64/egg/EGG-INFO/scripts/utils.py to 755
changing mode of build/bdist.macosx-10.9-x86_64/egg/EGG-INFO/scripts/report_results.py to 755
changing mode of build/bdist.macosx-10.9-x86_64/egg/EGG-INFO/scripts/pangolearn.smk to 755
changing mode of build/bdist.macosx-10.9-x86_64/egg/EGG-INFO/scripts/log_handler_handle.py to 755
copying pangolin.egg-info/PKG-INFO -> build/bdist.macosx-10.9-x86_64/egg/EGG-INFO
copying pangolin.egg-info/SOURCES.txt -> build/bdist.macosx-10.9-x86_64/egg/EGG-INFO
copying pangolin.egg-info/dependency_links.txt -> build/bdist.macosx-10.9-x86_64/egg/EGG-INFO
copying pangolin.egg-info/entry_points.txt -> build/bdist.macosx-10.9-x86_64/egg/EGG-INFO
copying pangolin.egg-info/not-zip-safe -> build/bdist.macosx-10.9-x86_64/egg/EGG-INFO
copying pangolin.egg-info/requires.txt -> build/bdist.macosx-10.9-x86_64/egg/EGG-INFO
copying pangolin.egg-info/top_level.txt -> build/bdist.macosx-10.9-x86_64/egg/EGG-INFO
creating 'dist/pangolin-2.4.2-py3.7.egg' and adding 'build/bdist.macosx-10.9-x86_64/egg' to it
removing 'build/bdist.macosx-10.9-x86_64/egg' (and everything under it)
Processing pangolin-2.4.2-py3.7.egg
removing '/usr/local/Caskroom/miniconda/base/envs/pangolin/lib/python3.7/site-packages/pangolin-2.4.2-py3.7.egg' (and everything under it)
creating /usr/local/Caskroom/miniconda/base/envs/pangolin/lib/python3.7/site-packages/pangolin-2.4.2-py3.7.egg
Extracting pangolin-2.4.2-py3.7.egg to /usr/local/Caskroom/miniconda/base/envs/pangolin/lib/python3.7/site-packages
pangolin 2.4.2 is already the active version in easy-install.pth
Installing type_variants.py script to /usr/local/Caskroom/miniconda/base/envs/pangolin/bin
Installing report_classes.py script to /usr/local/Caskroom/miniconda/base/envs/pangolin/bin
Installing pangofunks.py script to /usr/local/Caskroom/miniconda/base/envs/pangolin/bin
Installing custom_logger.py script to /usr/local/Caskroom/miniconda/base/envs/pangolin/bin
Installing pangolearn.py script to /usr/local/Caskroom/miniconda/base/envs/pangolin/bin
Installing utils.py script to /usr/local/Caskroom/miniconda/base/envs/pangolin/bin
Installing report_results.py script to /usr/local/Caskroom/miniconda/base/envs/pangolin/bin
Installing pangolearn.smk script to /usr/local/Caskroom/miniconda/base/envs/pangolin/bin
Installing log_handler_handle.py script to /usr/local/Caskroom/miniconda/base/envs/pangolin/bin
Installing pangolin script to /usr/local/Caskroom/miniconda/base/envs/pangolin/bin

Installed /usr/local/Caskroom/miniconda/base/envs/pangolin/lib/python3.7/site-packages/pangolin-2.4.2-py3.7.egg
Processing dependencies for pangolin==2.4.2

...

Using /usr/local/Caskroom/miniconda/base/envs/pangolin/lib/python3.7/site-packages
Finished processing dependencies for pangolin==2.4.2
(pangolin) art@Wernstrom pangolin % pangolin
usage: pangolin <query> [options]

pangolin: Phylogenetic Assignment of Named Global Outbreak LINeages

positional arguments:
  query                 Query fasta file of sequences to analyse.

optional arguments:
  -h, --help            show this help message and exit
  -o OUTDIR, --outdir OUTDIR
                        Output directory. Default: current working directory
  --outfile OUTFILE     Optional output file name. Default: lineage_report.csv
  --alignment           Optional alignment output.
  -d DATADIR, --datadir DATADIR
                        Data directory minimally containing a fasta alignment
                        and guide tree
  --tempdir TEMPDIR     Specify where you want the temp stuff to go. Default:
                        $TMPDIR
  --no-temp             Output all intermediate files, for dev purposes.
  --decompress-model    Permanently decompress the model file to save time
                        running pangolin.
  --max-ambig MAXAMBIG  Maximum proportion of Ns allowed for pangolin to
                        attempt assignment. Default: 0.5
  --min-length MINLEN   Minimum query length allowed for pangolin to attempt
                        assignment. Default: 25000
  --panGUIlin           Run web-app version of pangolin
  --verbose             Print lots of stuff to screen
  -t THREADS, --threads THREADS
                        Number of threads
  -v, --version         show program's version number and exit
  -pv, --pangoLEARN-version
                        show pangoLEARN's version number and exit
  --update              Automatically updates to latest release of pangolin
                        and pangoLEARN, then exits
  --gzip                Query files are gzip-compressed.
(pangolin) art@Wernstrom pangolin % 

Note --xz option is missing.

Reduce memory footprint

Originally I was not able to process 20K+ sequences because my workstation ran out of memory while procssing pangolearn.py. There seem to be two memory intensive steps in this script:

  1. loading and encoding the sequence data as "one-hot" vectors
  2. generating a pandas data frame from these vectors

I am attempting to reduce the RAM consumed by this script by (1) filtering sequences to the required sites (indices) on load, and (2) processing subsets of the filtered data as pandas data frames of a fixed maximum size.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.