Comments (6)
Hi
I've finally found a fix to prevent the multiple tax in the same line and it is already pused on the master branch. I hope it works since I couldn't recreate the error of a multiple tax ID classification with our samples.
The classification for the problematic cluster had the same score for two database entries that correspond to the same exact sequence, so BLAST include both tax IDs in the same line. We encourage to explore classification results for each cluster in the consensus_classification.csv file (as you did), to better interpret NanoCLUST results.
Thank you for trying the application, I hope it is useful for you
from nanoclust.
Hi Devin, thank you again for your feedback,
It seems that your consensus_classification.csv
has some errors (maybe due to past executions with errors?). Our recommendation is to delete all output files for that sample and rerun again the pipeline to see if you are getting the same file issues.
consensus_classification.csv
has the following structure:
id;reads_in_cluster;used_for_consensus;reads_after_corr;draft_id;sciname;taxid;length;per_ident
- id: HDBSCAN cluster ID number
- reads_in_cluster: Total reads assigned to the cluster
- used_for_consensus: Total reads taken from the cluster for polishing (equivalent to polishing_reads parameter)
- reads_after_corr: Total reads remaining after the first correction
- draft_id: Draft read ID from the FASTQ header
- sciname: Tax name
- taxid: NCBI tax ID
- length: Consensus read length (between 1400-1600bp)
- per_ident: Percentage identity with BLAST
Example line:
13;1314;100;41;2c3fc50f-da48-44e1-a06b-74bec98aaf93 id=96;Escherichia coli str. K-12 substr. MG1655;511145;1474;99.932
It doesn't seem like an API problem, but if some tax_name entries included special characters (especially ";" that we use as file separator) it would be problematic for the get_abundance process. Anyway, get_abundance and plot_abundances just make some calculations from the "main" output file that is consensus_classification.csv
so that file will be useful for analysis.
Thank you for the support, we hope you can get your data analyzed. Feel free to answer again with more information or open new issues.
from nanoclust.
Thank you for your assistance. I've started a clean run with just a single input sample. The run continues to terminate with error.
I looked in reported the working dir /data/NanoCLUST/work/2b/1a079249e3a73f9c15717a38a58844
There are two files in that directory
PERM16S_20201028.barcode01.qcreads_nanoclust_out.txt
PERM16S_20201028.barcode01.qcreads.nanoclust_out.txt
Looking at those files I see that most lines look the same and follow the specifications you mentioned, but there is a single offending sample that includes two scientific names. For example in PERM16S_20201028.barcode01.qcreads.nanoclust_out.txt:
58;1012;100;43;960ccf21-7860-45dd-969b-42a2de87e2b5 id=91;Bradyrhizobium viridifuturi;Bradyrhizobium mercantei;1904807;0.0
Looking at the information in cluster 58
$more cluster58/consensus_classification.csv
Bradyrhizobium viridifuturi;Bradyrhizobium mercantei;1654716;1904807;0.0;1411;99.150
Bradyrhizobium embrapense;630921;0.0;1411;99.150
Bradyrhizobium valentinum;1518501;0.0;1411;99.079
Bradyrhizobium jicamae;280332;0.0;1411;99.008
Bradyrhizobium elkanii;29448;0.0;1411;98.866
So it looks like the tax_name for this has two names with an offending ;
in-between. Any suggestions on how to fix this?
More complete error information below
Run Name: loving_mendel
####################################################
## nf-core/nanoclust execution completed unsuccessfully! ##
####################################################
The exit status of the task that caused the workflow execution to fail was: 1.
The full error message was:
Error executing process > 'get_abundances (1)'
Caused by:
Process `get_abundances (1)` terminated with an error exit status (1)
Command executed [/data/NanoCLUST/templates/get_abundance.py]:
Command error:
WARNING: Your kernel does not support swap limit capabilities or the cgroup is not mounted. Memory limited without swap.
Traceback (most recent call last):
File ".command.sh", line 55, in <module>
get_abundance(names,paths, "C")
File ".command.sh", line 49, in get_abundance
df_final_grp = merge_abundance(dfs, tax_level)
File ".command.sh", line 39, in merge_abundance
df_final["taxid"] = [get_taxname(row["taxid"], tax_level) for index, row in df_final.iterrows()]
File ".command.sh", line 39, in <listcomp>
df_final["taxid"] = [get_taxname(row["taxid"], tax_level) for index, row in df_final.iterrows()]
File ".command.sh", line 18, in get_taxname
return json.loads(complete_tax)[0][tax_level_tag]
IndexError: list index out of range
Pipeline Configuration:
-----------------------
- Run Name: loving_mendel
- Reads: /data/PERM/PERM16S_20201028/sample_fasta/PERM16S_20201028.barcode01.qcreads.fastq
- Max Resources: 128 GB memory, 16 cpus, 10d time per job
- Container: docker - [:]
- Output dir: /data/PERM/PERM16S_20201028/NanoCLUST.BC01
- Launch dir: /data/NanoCLUST
- Working dir: /data/NanoCLUST/work
- Script dir: /data/NanoCLUST
- User: dmdrown
- Config Profile: docker
- Date Started: 2020-12-04T09:07:08.609779-09:00
- Date Completed: 2020-12-04T09:48:39.519139-09:00
- Pipeline script file path: /data/NanoCLUST/main.nf
- Pipeline script hash ID: 41a36e29b6db0c14a411b4f911c51f5e
- Nextflow Version: 20.10.0
- Nextflow Build: 5430
- Nextflow Compile Timestamp: 01-11-2020 15:14 UTC
from nanoclust.
Thank you. The latest update with the altered blastn query output does fix the multiple taxid issue.
A remaining issue is that some taxid's appear to be breaking the get_abundance.py
function
Traceback (most recent call last):
File ".command.sh", line 55, in <module>
get_abundance(names,paths, "C")
File ".command.sh", line 49, in get_abundance
df_final_grp = merge_abundance(dfs, tax_level)
File ".command.sh", line 39, in merge_abundance
df_final["taxid"] = [get_taxname(row["taxid"], tax_level) for index, row in df_final.iterrows()]
File ".command.sh", line 39, in <listcomp>
df_final["taxid"] = [get_taxname(row["taxid"], tax_level) for index, row in df_final.iterrows()]
File ".command.sh", line 18, in get_taxname
return json.loads(complete_tax)[0][tax_level_tag]
IndexError: list index out of range
Digging into this, I see that some of the taxon_id output from http://api.unipept.ugent.be/api/v1/taxonomy.json?input
have no output. For example http://api.unipept.ugent.be/api/v1/taxonomy.json?input[]=2715402&extra=true&names=true
returns []
. However, this taxon ID matches NCBI's Caballeronia ginsengisoli.
I'm wondering how the code might be altered to make it robust to these missing values.
Others taxonIDs don't have all of the required taxonomic information for some IDs. For example "taxon_id":988946
has the following output
[{"taxon_id":988946,"taxon_name":"Loriellopsis cavernicola","taxon_rank":"species","superkingdom_id":2,"superkingdom_name":"Bacteria","kingdom_id":null,"kingdom_name":"","subkingdom_id":null,"subkingdom_name":"","superphylum_id":null,"superphylum_name":"","phylum_id":1117,"phylum_name":"Cyanobacteria","subphylum_id":null,"subphylum_name":"","superclass_id":null,"superclass_name":"","class_id":null,"class_name":"","subclass_id":null,"subclass_name":"","infraclass_id":null,"infraclass_name":"","superorder_id":null,"superorder_name":"","order_id":1161,"order_name":"Nostocales","suborder_id":null,"suborder_name":"","infraorder_id":null,"infraorder_name":"","parvorder_id":null,"parvorder_name":"","superfamily_id":null,"superfamily_name":"","family_id":1892258,"family_name":"Symphyonemataceae","subfamily_id":null,"subfamily_name":"","tribe_id":null,"tribe_name":"","subtribe_id":null,"subtribe_name":"","genus_id":988945,"genus_name":"Loriellopsis","subgenus_id":null,"subgenus_name":"","species_group_id":null,"species_group_name":"","species_subgroup_id":null,"species_subgroup_name":"","species_id":988946,"species_name":"Loriellopsis cavernicola","subspecies_id":null,"subspecies_name":"","varietas_id":null,"varietas_name":"","forma_id":null,"forma_name":""}]
Notice that the class name is: "class_name":""
Looking at a list of taxids from my sample, I can see that some are missing names at the Class and others are missing at different taxonomic levels. I realize that these may be special cases and perhaps being handled OK.
from nanoclust.
Hi,
We apologize for the late response. We have integrated some minor changes into the NanoCLUST main branch. Issues with get_abundance.py and the API have been solved. Thank you very much for your time and specially for the error descriptions that have helped to solve this issue.
from nanoclust.
The latest commit seems to have erased the update you made previously (on December 11) to prevent the multiple tax in the same line. There you corrected a couple of lines in the blastn calls
Line 439 in 6fd4a65
Line 450 in 6fd4a65
This was modified such that the sscinames staxids
values became ssciname staxid
as below
blastn -query $consensus -db nr -remote -entrez_query "Bacteria [Organism]" -task blastn -dust no -outfmt "10 staxid ssciname evalue length score pident" -evalue 11 -max_hsps 50 -max_target_seqs 5 > consensus_classification.csv
blastn -query $consensus -db $db -task blastn -dust no -outfmt "10 ssciname staxid evalue length pident" -evalue 11 -max_hsps 50 -max_target_seqs 5 | sed 's/,/;/g' > consensus_classification.csv
from nanoclust.
Related Issues (20)
- Error executing process > 'get_abundances (1)' HOT 2
- Can you do OTU clustering with ASVs that are different lengths?
- Process read_correction is applied to a subset of reads from the fasta file
- Blast DB error during consensus_classification
- Error executing process > 'consensus_classification (1)' HOT 4
- Merge fastq files prior to Nanoclust
- Process `consensus_classification (5)` terminated with an error exit status (255) HOT 1
- some samples terminate at read_correction due to gzip: corrected_reads.correctedReads.fasta.gz: No such file or directory HOT 2
- How we get reads for each consensus fasta file in each cluster, and coverage of our data
- Error executing process > 'get_abundances (1)' HOT 4
- ERROR: Nextflow DSL1 is no longer supported HOT 1
- medaka consensus error HOT 1
- Error read_clustering (ValueError: could not convert string to float: 'TTTTG')
- touch Permission denied with test
- Remote resource not found when running via Nextflow with nf-core/nanoclust HOT 2
- No taxonomy in the summary file
- Create a database (DB) from a custom database.
- Unable to attach MultiQC report to summary email
- NanoCLUST script not supported by latest Nextflow HOT 1
- Error when using Nextflow 22.10.7: Unable to parse config file...compile failed for sources FixedSetSources...
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from nanoclust.