Giter Site home page Giter Site logo

Comments (6)

genomicsITER avatar genomicsITER commented on August 11, 2024 1

Hi

I've finally found a fix to prevent the multiple tax in the same line and it is already pused on the master branch. I hope it works since I couldn't recreate the error of a multiple tax ID classification with our samples.

The classification for the problematic cluster had the same score for two database entries that correspond to the same exact sequence, so BLAST include both tax IDs in the same line. We encourage to explore classification results for each cluster in the consensus_classification.csv file (as you did), to better interpret NanoCLUST results.

Thank you for trying the application, I hope it is useful for you

from nanoclust.

genomicsITER avatar genomicsITER commented on August 11, 2024

Hi Devin, thank you again for your feedback,

It seems that your consensus_classification.csv has some errors (maybe due to past executions with errors?). Our recommendation is to delete all output files for that sample and rerun again the pipeline to see if you are getting the same file issues.

consensus_classification.csv has the following structure:

id;reads_in_cluster;used_for_consensus;reads_after_corr;draft_id;sciname;taxid;length;per_ident

  • id: HDBSCAN cluster ID number
  • reads_in_cluster: Total reads assigned to the cluster
  • used_for_consensus: Total reads taken from the cluster for polishing (equivalent to polishing_reads parameter)
  • reads_after_corr: Total reads remaining after the first correction
  • draft_id: Draft read ID from the FASTQ header
  • sciname: Tax name
  • taxid: NCBI tax ID
  • length: Consensus read length (between 1400-1600bp)
  • per_ident: Percentage identity with BLAST

Example line:

13;1314;100;41;2c3fc50f-da48-44e1-a06b-74bec98aaf93 id=96;Escherichia coli str. K-12 substr. MG1655;511145;1474;99.932

It doesn't seem like an API problem, but if some tax_name entries included special characters (especially ";" that we use as file separator) it would be problematic for the get_abundance process. Anyway, get_abundance and plot_abundances just make some calculations from the "main" output file that is consensus_classification.csv so that file will be useful for analysis.

Thank you for the support, we hope you can get your data analyzed. Feel free to answer again with more information or open new issues.

from nanoclust.

devindrown avatar devindrown commented on August 11, 2024

Thank you for your assistance. I've started a clean run with just a single input sample. The run continues to terminate with error.

I looked in reported the working dir /data/NanoCLUST/work/2b/1a079249e3a73f9c15717a38a58844

There are two files in that directory

PERM16S_20201028.barcode01.qcreads_nanoclust_out.txt 
PERM16S_20201028.barcode01.qcreads.nanoclust_out.txt

Looking at those files I see that most lines look the same and follow the specifications you mentioned, but there is a single offending sample that includes two scientific names. For example in PERM16S_20201028.barcode01.qcreads.nanoclust_out.txt:

58;1012;100;43;960ccf21-7860-45dd-969b-42a2de87e2b5 id=91;Bradyrhizobium viridifuturi;Bradyrhizobium mercantei;1904807;0.0

Looking at the information in cluster 58

$more cluster58/consensus_classification.csv

Bradyrhizobium viridifuturi;Bradyrhizobium mercantei;1654716;1904807;0.0;1411;99.150
Bradyrhizobium embrapense;630921;0.0;1411;99.150
Bradyrhizobium valentinum;1518501;0.0;1411;99.079
Bradyrhizobium jicamae;280332;0.0;1411;99.008
Bradyrhizobium elkanii;29448;0.0;1411;98.866

So it looks like the tax_name for this has two names with an offending ; in-between. Any suggestions on how to fix this?

More complete error information below

Run Name: loving_mendel

####################################################
## nf-core/nanoclust execution completed unsuccessfully! ##
####################################################
The exit status of the task that caused the workflow execution to fail was: 1.
The full error message was:

Error executing process > 'get_abundances (1)'

Caused by:
  Process `get_abundances (1)` terminated with an error exit status (1)

Command executed [/data/NanoCLUST/templates/get_abundance.py]:

Command error:
  WARNING: Your kernel does not support swap limit capabilities or the cgroup is not mounted. Memory limited without swap.
  Traceback (most recent call last):
    File ".command.sh", line 55, in <module>
      get_abundance(names,paths, "C")
    File ".command.sh", line 49, in get_abundance
      df_final_grp = merge_abundance(dfs, tax_level)
    File ".command.sh", line 39, in merge_abundance
      df_final["taxid"] = [get_taxname(row["taxid"], tax_level) for index, row in df_final.iterrows()]
    File ".command.sh", line 39, in <listcomp>
      df_final["taxid"] = [get_taxname(row["taxid"], tax_level) for index, row in df_final.iterrows()]
    File ".command.sh", line 18, in get_taxname
      return json.loads(complete_tax)[0][tax_level_tag]
  IndexError: list index out of range


Pipeline Configuration:
-----------------------
 - Run Name: loving_mendel
 - Reads: /data/PERM/PERM16S_20201028/sample_fasta/PERM16S_20201028.barcode01.qcreads.fastq
 - Max Resources: 128 GB memory, 16 cpus, 10d time per job
 - Container: docker - [:]
 - Output dir: /data/PERM/PERM16S_20201028/NanoCLUST.BC01
 - Launch dir: /data/NanoCLUST
 - Working dir: /data/NanoCLUST/work
 - Script dir: /data/NanoCLUST
 - User: dmdrown
 - Config Profile: docker
 - Date Started: 2020-12-04T09:07:08.609779-09:00
 - Date Completed: 2020-12-04T09:48:39.519139-09:00
 - Pipeline script file path: /data/NanoCLUST/main.nf
 - Pipeline script hash ID: 41a36e29b6db0c14a411b4f911c51f5e
 - Nextflow Version: 20.10.0
 - Nextflow Build: 5430
 - Nextflow Compile Timestamp: 01-11-2020 15:14 UTC

from nanoclust.

devindrown avatar devindrown commented on August 11, 2024

Thank you. The latest update with the altered blastn query output does fix the multiple taxid issue.

A remaining issue is that some taxid's appear to be breaking the get_abundance.py function

  Traceback (most recent call last):
    File ".command.sh", line 55, in <module>
      get_abundance(names,paths, "C")
    File ".command.sh", line 49, in get_abundance
      df_final_grp = merge_abundance(dfs, tax_level)
    File ".command.sh", line 39, in merge_abundance
      df_final["taxid"] = [get_taxname(row["taxid"], tax_level) for index, row in df_final.iterrows()]
    File ".command.sh", line 39, in <listcomp>
      df_final["taxid"] = [get_taxname(row["taxid"], tax_level) for index, row in df_final.iterrows()]
    File ".command.sh", line 18, in get_taxname
      return json.loads(complete_tax)[0][tax_level_tag]
  IndexError: list index out of range

Digging into this, I see that some of the taxon_id output from http://api.unipept.ugent.be/api/v1/taxonomy.json?input have no output. For example http://api.unipept.ugent.be/api/v1/taxonomy.json?input[]=2715402&extra=true&names=true returns []. However, this taxon ID matches NCBI's Caballeronia ginsengisoli.

I'm wondering how the code might be altered to make it robust to these missing values.

Others taxonIDs don't have all of the required taxonomic information for some IDs. For example "taxon_id":988946 has the following output

[{"taxon_id":988946,"taxon_name":"Loriellopsis cavernicola","taxon_rank":"species","superkingdom_id":2,"superkingdom_name":"Bacteria","kingdom_id":null,"kingdom_name":"","subkingdom_id":null,"subkingdom_name":"","superphylum_id":null,"superphylum_name":"","phylum_id":1117,"phylum_name":"Cyanobacteria","subphylum_id":null,"subphylum_name":"","superclass_id":null,"superclass_name":"","class_id":null,"class_name":"","subclass_id":null,"subclass_name":"","infraclass_id":null,"infraclass_name":"","superorder_id":null,"superorder_name":"","order_id":1161,"order_name":"Nostocales","suborder_id":null,"suborder_name":"","infraorder_id":null,"infraorder_name":"","parvorder_id":null,"parvorder_name":"","superfamily_id":null,"superfamily_name":"","family_id":1892258,"family_name":"Symphyonemataceae","subfamily_id":null,"subfamily_name":"","tribe_id":null,"tribe_name":"","subtribe_id":null,"subtribe_name":"","genus_id":988945,"genus_name":"Loriellopsis","subgenus_id":null,"subgenus_name":"","species_group_id":null,"species_group_name":"","species_subgroup_id":null,"species_subgroup_name":"","species_id":988946,"species_name":"Loriellopsis cavernicola","subspecies_id":null,"subspecies_name":"","varietas_id":null,"varietas_name":"","forma_id":null,"forma_name":""}]

Notice that the class name is: "class_name":"" Looking at a list of taxids from my sample, I can see that some are missing names at the Class and others are missing at different taxonomic levels. I realize that these may be special cases and perhaps being handled OK.

from nanoclust.

genomicsITER avatar genomicsITER commented on August 11, 2024

Hi,

We apologize for the late response. We have integrated some minor changes into the NanoCLUST main branch. Issues with get_abundance.py and the API have been solved. Thank you very much for your time and specially for the error descriptions that have helped to solve this issue.

from nanoclust.

devindrown avatar devindrown commented on August 11, 2024

The latest commit seems to have erased the update you made previously (on December 11) to prevent the multiple tax in the same line. There you corrected a couple of lines in the blastn calls

blastn -query $consensus -db nr -remote -entrez_query "Bacteria [Organism]" -task blastn -dust no -outfmt "10 staxids sscinames evalue length score pident" -evalue 11 -max_hsps 50 -max_target_seqs 5 > consensus_classification.csv

blastn -query $consensus -db $db -task blastn -dust no -outfmt "10 sscinames staxids evalue length pident" -evalue 11 -max_hsps 50 -max_target_seqs 5 | sed 's/,/;/g' > consensus_classification.csv

This was modified such that the sscinames staxids values became ssciname staxid as below

blastn -query $consensus -db nr -remote -entrez_query "Bacteria [Organism]" -task blastn -dust no -outfmt "10 staxid ssciname evalue length score pident" -evalue 11 -max_hsps 50 -max_target_seqs 5 > consensus_classification.csv

blastn -query $consensus -db $db -task blastn -dust no -outfmt "10 ssciname staxid evalue length pident" -evalue 11 -max_hsps 50 -max_target_seqs 5 | sed 's/,/;/g' > consensus_classification.csv

from nanoclust.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.