nbisweden / contigtax Goto Github PK

View Code? Open in Web Editor NEW

5.0 41.0 6.0 4.44 MB

Taxonomic classification of metagenomic contigs

License: MIT License

Python 99.32% Dockerfile 0.68%

metagenomics taxonomy-assignment sequence-analysis metatranscriptomics

contigtax's People

Contributors

Stargazers

Watchers

Forkers

vikash84 linxingchen dagahren katjako ayixon

contigtax's Issues

Assignment fails due to duplicate taxonomic rank entries

Entries that have multiple definitions at a certain rank causes contigtax assign to fail with the error:

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

For instance, this can be the result of a taxonomic id having:

rank   superkingdom phylum order genus species  class  family    class
182248         2759   7711  9443  9499   30589  40674  378855  1338369

This will be fixed in the next release by only selecting unique columns when setting up the lineage dataframe.

Make it possible to extract proteins for certain taxa before building database

After the reformatting (or during, if working with the uniref databases) it should be possible to extract only proteins matching a certain taxon, say only Bacteria. This could be done on the command line with --taxlimit superkingdom:Bacteria for instance or --taxlimit superkingdom Bacteria Archaea

Unclear instructions

My student is following the instructions as given in the Readme and got stuck on step 5:

$ contigtax search -p 4 53.fa uniref100/diamond.dmnd assembly.tsv.gz
ERROR: This diamond version requires you to supply a taxonmap file with --taxonmap at this stage

I managed to figure out she needed to add --taxonmap uniref100/prot.accession2taxid.gz, but it's probably good to be explicit about this!

Add option to supply protein id to taxonomy id mapping for custom database

If a user has already performed the diamond search step it should be possible to supply a protein id to taxonomy id file to create the lineage dataframe from.

Feature request: Utilize Diamond's contig features

Thanks for the awesome software! If I understand the code correctly, Diamond is executed using largely default parameters. I'd suggest adding in ‐‐range‐culling ‐‐top 10 -F 15 (source), but this will likely require rewrites of other areas of contigtax. These parameters will perform local Diamond alignment, retaining the top hit (within 10%) in each area of the query contig. We'd then have to not filter by bitscore in contigtax, and also should rely on the evalue parameter of Diamond instead of filtering on that. Just wanted to get the discussion started - there's probably a bunch of other design decisions that I'm missing.

getting crashed

Making lineages: 7%|██▍ | 20724/277641 [00:14<02:51, 1497.39 taxids/s]multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
File "/home/vsingh/.conda/envs/tango/lib/python3.5/multiprocessing/pool.py", line 119, in worker
result = (True, func(*args, **kwds))
File "/home/vsingh/.conda/envs/tango/lib/python3.5/site-packages/tango/assign.py", line 490, in process_lineages
x = add_names(x, taxid, ncbi_taxa)
File "/home/vsingh/.conda/envs/tango/lib/python3.5/site-packages/tango/assign.py", line 114, in add_names
if t < 0:
File "/home/vsingh/.conda/envs/tango/lib/python3.5/site-packages/pandas/core/generic.py", line 1576, in nonzero
.format(self.class.name))
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/home/vsingh/.conda/envs/tango/bin/tango", line 12, in
sys.exit(main())
File "/home/vsingh/.conda/envs/tango/lib/python3.5/site-packages/tango/main.py", line 283, in main
args.func(args)
File "/home/vsingh/.conda/envs/tango/lib/python3.5/site-packages/tango/main.py", line 77, in assign_taxonomy
args.rank_thresholds, args.taxdir, args.sqlitedb, args.chunksize, args.cpus)
File "/home/vsingh/.conda/envs/tango/lib/python3.5/site-packages/tango/assign.py", line 776, in parse_hits
lineage_df, name_dict = make_lineage_df(taxids, taxdir, sqlitedb, reportranks, cpus)
File "/home/vsingh/.conda/envs/tango/lib/python3.5/site-packages/tango/assign.py", line 561, in make_lineage_df
unit=" taxids", ncols=100))
File "/home/vsingh/.conda/envs/tango/lib/python3.5/site-packages/tqdm/std.py", line 1093, in iter
for obj in iterable:
File "/home/vsingh/.conda/envs/tango/lib/python3.5/multiprocessing/pool.py", line 731, in next
raise value
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

contigtax dowload fails

I just installed the program through conda on a compute server. When downloading data base files, the command fails with the following:

contigtax download taxonomy
Downloading NCBI taxdump.tar.gz
0 bytes [00:00, ? bytes/s]
Traceback (most recent call last):
File "/home/m.sevi/miniconda3/envs/contitax_env/bin/contigtax", line 10, in
sys.exit(main())
File "/home/m.sevi/miniconda3/envs/contitax_env/lib/python3.6/site-packages/contigtax/main.py", line 387, in main
args.func(args)
File "/home/m.sevi/miniconda3/envs/contitax_env/lib/python3.6/site-packages/contigtax/main.py", line 24, in download
prepare.download_ncbi_taxonomy(args.taxdir, args.force)
File "/home/m.sevi/miniconda3/envs/contitax_env/lib/python3.6/site-packages/contigtax/prepare.py", line 200, in download_ncbi_taxonomy
urllib.request.urlretrieve(url, local, reporthook=reporthook)
File "/home/m.sevi/miniconda3/envs/contitax_env/lib/python3.6/urllib/request.py", line 274, in urlretrieve
reporthook(blocknum, bs, size)
File "/home/m.sevi/miniconda3/envs/contitax_env/lib/python3.6/site-packages/contigtax/prepare.py", line 37, in update_to
t.update((b - last_b[0]) * bsize)
File "/home/m.sevi/miniconda3/envs/contitax_env/lib/python3.6/site-packages/tqdm-4.7.2-py3.6.egg/tqdm/_tqdm.py", line 689, in update
ZeroDivisionError: float division by zero

A similar behaviour is observed with >contigtax download uniref100

nbisweden / contigtax Goto Github PK

contigtax's People

Contributors

Stargazers

Watchers

Forkers

contigtax's Issues

Assignment fails due to duplicate taxonomic rank entries

Make it possible to extract proteins for certain taxa before building database

Unclear instructions

Add option to supply protein id to taxonomy id mapping for custom database

Feature request: Utilize Diamond's contig features

getting crashed

contigtax dowload fails

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent