foi-bioinformatics / flextaxd Goto Github PK

FlexTaxD (Flexible Taxonomy Databases) - Create, add, merge different taxonomy sources (QIIME, GTDB, NCBI and more) and create metagenomic databases (kraken2, ganon and more )

License: GNU General Public License v3.0

Python 100.00%

ncbi-taxonomy gtdb ganon kraken2 custom-taxonomy metagenomic-classification custom-database taxonomic-classification taxonomic-classifications centrifuge

flextaxd's People

Contributors

Stargazers

Watchers

Forkers

seporion yi1873 ssyamoako hvanphucs jianshu93 jaclew pythseq

flextaxd's Issues

Cano not import name "BadGzipFile"

Hi, FOI-bioinformatics team member,

I meet error when i run:

flextaxd-create -db gtdb_r95.sqlite3 -o taxonomy/ --genomes_path gtdb/release95/fastani/database/ -p 20 --verbose -
-logs build_kraken2.db.log --create --db_name gtdb_k2_db --dbprogram kraken2

2020-11-13 10:42:07,494 create_databases [INFO ]  FlexTaxD-create logging initiated!
2020-11-13 10:42:07,531 create_databases [INFO ]  Processing files; create kraken seq.map
2020-11-13 10:42:07,532 DatabaseConnection [INFO ]  gtdb_r95.sqlite3 opened successfully.
2020-11-13 10:42:07,652 ProcessDirectory [INFO ]  Number of genomes annotated in database 31910
2020-11-13 10:42:07,652 ProcessDirectory [INFO ]  Process genome path (gtdb/release95/fastani/database/)
2020-11-13 10:42:08,306 ProcessDirectory [INFO ]  Processed 31910 genomes
2020-11-13 10:42:08,341 create_databases [INFO ]  Loading module: CreateKrakenDatabase
Traceback (most recent call last):
File "/ldfssz1/ST_META/share/User/zhujie/.conda/envs/bioenv/bin/flextaxd-create", line 10, in <module>
sys.exit(main())
File "/ldfssz1/ST_META/share/User/zhujie/.conda/envs/bioenv3.7/lib/python3.7/site-packages/flextaxd/create_databases.py", line 195, in main
classifier = dynamic_import("modules", "CreateKrakenDatabase")
File "/ldfssz1/ST_META/share/User/zhujie/.conda/envs/bioenv3.7/lib/python3.7/site-packages/flextaxd/create_databases.py", line 42, in dynamic_import
module = import_module(".".join([abs_module_path,class_name]))
File "/ldfssz1/ST_META/share/User/zhujie/.conda/envs/bioenv3.7/lib/python3.7/importlib/__init__.py", line 127, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
File "<frozen importlib._bootstrap>", line 1006, in _gcd_import
File "<frozen importlib._bootstrap>", line 983, in _find_and_load
File "<frozen importlib._bootstrap>", line 967, in _find_and_load_unlocked
File "<frozen importlib._bootstrap>", line 677, in _load_unlocked
File "<frozen importlib._bootstrap_external>", line 728, in exec_module
File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
File "/ldfssz1/ST_META/share/User/zhujie/.conda/envs/bioenv3.7/lib/python3.7/site-packages/flextaxd/modules/CreateKrakenDatabase.py", line 17, in <module>
from gzip import BadGzipFile
ImportError: cannot import name 'BadGzipFile' from 'gzip' (/ldfssz1/ST_META/share/User/zhujie/.conda/envs/bioenv3.7/lib/python3.7/gzip.py)

Not all superkingdoms appear in nodes.dmp file

Hello,

I noticed that if I have Bacteria, Archaea, and Eukaryota in the domains in my QIIME formatted file, only the Bacteria node appears in the nodes.dmp file and not Archaea and Eukayota. I have attached the sample files below.
This is the following command I used:

flextaxd --taxonomy_file Sample_taxonomy_2.tsv --taxonomy_type QIIME --database .ftd
flextaxd --dump

In the nodes file, I do not see nodes 4 and 13 which should be Eukaryota and Archaea respectively. However, the nodes are present in the names.dmp file. I have attached the .txt versions of my .tsv files below.
Sample_taxonomy_2.txt
names.txt
nodes.txt
Please let me know if I am doing something incorrectly here.

Thank you.

flextaxd error: "ValueError: not enough values to unpack (expected 2, got 1)"

I'm attempting to follow along with this part of the tutorial/wiki, to get a better understanding of how to create my own custom DB. Things are okay until I get to the database creation step:

# flextaxd -db 16S_database.db -tf GTDB_arc_bact_taxo_tree_unique.txt -tt CanSNPer --genomeid2taxid g2id.txt --dump --dbprogram kraken2 -o taxonomy --verbose --logs logs/zenodo
2024-02-07 18:08:45,291 custom_taxonomy_databases [INFO ]  FlexTaxD logging initiated!
Warning: 16S_database.db already exists, overwrite? (y/n): y
2024-02-07 18:08:49,303 custom_taxonomy_databases [INFO ]  Loading module: ReadTaxonomyCanSNPer
2024-02-07 18:08:49,352 DatabaseConnection [INFO ]  16S_database.db opened successfully.
2024-02-07 18:08:49,353 ReadTaxonomyCanSNPer [INFO ]  GTDB_arc_bact_taxo_tree_unique.txt
2024-02-07 18:08:49,353 ReadTaxonomyCanSNPer [INFO ]  Fetching root name from file
2024-02-07 18:08:49,353 ReadTaxonomyCanSNPer [INFO ]  Adding, cellular organism node
2024-02-07 18:08:49,354 ReadTaxonomyCanSNPer [INFO ]  Adding root node root!
2024-02-07 18:08:49,355 custom_taxonomy_databases [INFO ]  Parse taxonomy
2024-02-07 18:08:49,355 ReadTaxonomyCanSNPer [INFO ]  Parse CanSNP tree file
2024-02-07 18:08:49,902 ReadTaxonomyCanSNPer [INFO ]  New taxonomy ids assigned 12929
Traceback (most recent call last):
  File "/home/nnnnnn/mambaforge/lib/python3.9/site-packages/flextaxd/modules/ReadTaxonomy.py", line 153, in parse_genomeid2taxid
    genomeid,taxid = row.strip().split("\t")
ValueError: not enough values to unpack (expected 2, got 1)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/nnnnnn/mambaforge/bin/flextaxd", line 8, in <module>
    sys.exit(main())
  File "/home/nnnnnn/mambaforge/lib/python3.9/site-packages/flextaxd/custom_taxonomy_databases.py", line 330, in main
    read_obj.parse_genomeid2taxid(args.genomeid2taxid)
  File "/home/nnnnnn/mambaforge/lib/python3.9/site-packages/flextaxd/modules/ReadTaxonomy.py", line 156, in parse_genomeid2taxid
    genomeid,taxid,reference = row.strip().split("\t")
ValueError: not enough values to unpack (expected 3, got 1)

Here's the first few lines of my two input files:

# head g2id.txt 
GB_GCA_000010565.1      Pelotomaculum thermopropionicum
GB_GCA_000018565.1      Herpetosiphon aurantiacus
GB_GCA_000024525.1      Spirosoma linguale
GB_GCA_000091165.1      Methylomirabilis oxyfera_B
GB_GCA_000146855.1      Peptoanaerobacter margaretiae
GB_GCA_000147015.1      Zinderia insecticola
GB_GCA_000163995.1      Campylobacter_D jejuni_A
GB_GCA_000165065.1      Longicatena sp000165065
GB_GCA_000166295.1      Marinobacter adhaerens
GB_GCA_000168735.1      Endoriftia persephone

 # head GTDB_arc_bact_taxo_tree_unique.txt 
root;Archaea;Aenigmatarchaeota;Aenigmatarchaeia;Aenigmatarchaeales;Aenigmatarchaeaceae;Aenigmatarchaeum;Aenigmatarchaeum_subterraneum
root;Archaea;Aenigmatarchaeota;Aenigmatarchaeia;CG10238-14;CG10238-14;CG10238-14;CG10238-14_sp002789635
root;Archaea;Aenigmatarchaeota;Aenigmatarchaeia;CG10238-14;CG10238-14;RBG-16-49-10;RBG-16-49-10_sp001784635
root;Archaea;Aenigmatarchaeota;Aenigmatarchaeia;CG10238-14;EX4484-224;EX4484-224;EX4484-224_sp002254545
root;Archaea;Aenigmatarchaeota;Aenigmatarchaeia;CG10238-14;SCSR01;SCSR01;SCSR01_sp004297575
root;Archaea;Aenigmatarchaeota;Aenigmatarchaeia;GW2011-AR5;GCA-2688965;GCA-2688965;GCA-2688965_sp002688965
root;Archaea;Aenigmatarchaeota;Aenigmatarchaeia;GW2011-AR5;GW2011-AR5;GW2011-AR5;GW2011-AR5_sp000806115
root;Archaea;Aenigmatarchaeota;Aenigmatarchaeia;GW2011-AR5;GW2011-AR5;GW2011-AR5;GW2011-AR5_sp10154u
root;Archaea;Aenigmatarchaeota;Aenigmatarchaeia;QMZP01;QMZP01;QMZP01;QMZP01_sp003663225
root;Archaea;Aenigmatarchaeota;Aenigmatarchaeia;QMZP01;QMZP01;QMZY01;QMZY01_sp003663415

I'd like to use this tool so any help is greatly appreciated

Use GTDB genomes (representatives or all) instead of NCBI

GTDB supplies several genome sets for their databases on their download page,

Implement a function that alternatively can use these genome sets instead of downloading genomes from NCBI.

https://data.ace.uq.edu.au/public/gtdb/data/releases/latest/genomic_files_reps/
https://data.ace.uq.edu.au/public/gtdb/data/releases/latest/genomic_files_all/

Self linked parents

Some parts of the database contains identical nodes at several levels in the tree,

GB_GCA_003697105.1      d__Bacteria;p__4572-55;c__4572-55;o__J002;f__J002;g__J002;s__J002 sp003697105
GB_GCA_002084765.1      d__Bacteria;p__4572-55;c__4572-55;o__4572-55;f__4572-55;g__4572-55;s__4572-55

What is the best way of solving this issue

[Question] Does taxonomy exported from flextaxd only contained species with genome available in refseq folder? Can I make use of library.fna downloaded via kraken2-build?

Hi Flextaxd team,

Great tool! I am attempting building a kraken2 database with archaea and bacteria from GTDB and viral, plasmid, uniVec_Core from NCBI.
I have downloaded library.fna via kraken2 command: kraken2-build --download-library viral --db $DBNAME, and I am wondering if there is a way to make use of these genomes instead of downloading again withncbi-genome-download?

Is it possible to merge the taxonomy file only without building a flextaxd database?

I am following the walkthrough -WGS and encounter this line:

As the nucl_gb accession file only contain chr/cont/scaffold - id to taxid, the script must match annotations to a local file (hence the nessesity to download genomes first)

It was not very clear to me what this indicates.
If I follow the example (only download human, 4 bacterial genera, archaea), and export taxonomy by:
flextaxd -db databases/NCBI_GTDB_merge.db -o taxonomy --dbprogram kraken2 --dump

Do I receive a complete taxonomy including everything else than archaea/bacteria/human? And can I use this taxonomy directly in combination with the downloaded library.fna with kraken2 interface to build a database?

Thank you very much! Please let me know if my question description is unclear.

genomeid2taxid of local files does not allow names to be without .fa/.fasta

(

flextaxd/flextaxd/modules/ProcessDirectory.py

Line 127 in bb9fc16

genome_name = fname.split(".fna")[0] ## strip .fna from name

)

Additional advice/walkthrough

Thanks for this superb tool. I can see many uses.

I would love a walkthrough for producing an NCBI taxonomy with bacteria and archaea replaced with the GTDB hierarchy. In my case I only need a kraken/ganon database for human, bacteria, viruses, fungi and

I did the following:

Download NCBI taxdump files, seq2taxid and viral/fungal RefSeq files (with https://github.com/kblin/ncbi-genome-download/)
Built database with flextaxd pointing to relevant files (successful)
Built separate GTDB archaeal and bacterial databases
Ran command flextaxd --database NCBI_taxonomy_with_genomes.db --mod_database GTDB_archaea.db --parent "Archaea"

I then planned to run the same command for bacteria, and produce the required databases (pooling the GTDB sequence data and RefSeq data).

However, step 4 failed:

2020-10-09 17:26:24,244 ModifyTree [INFO ]  Validate modified database!
2020-10-09 17:26:24,246 DatabaseConnection [INFO ]  Get all database nodes
2020-10-09 17:26:37,848 DatabaseConnection [INFO ]  Get all database edges
2020-10-09 17:26:49,064 DatabaseConnection [INFO ]  Get all children from root node

2020-10-09 17:49:58,667 DatabaseConnection [INFO ]  Get tree edges from children
2020-10-09 17:50:51,665 DatabaseConnection [INFO ]  Get nodes from tree edges
2020-10-09 17:50:56,048 DatabaseConnection [INFO ]  Validate parents
Traceback (most recent call last):
  File "/rds/general/user/ajm3018/home/anaconda3/envs/flextaxd/bin/flextaxd", line 10, in <module>
    sys.exit(main())
  File "/rds/general/user/ajm3018/home/anaconda3/envs/flextaxd/lib/python3.8/site-packages/flextaxd/custom_taxonomy_databases.py", line 262, in main
    modify_obj.update_database()
  File "/rds/general/user/ajm3018/home/anaconda3/envs/flextaxd/lib/python3.8/site-packages/flextaxd/modules/ModifyTree.py", line 434, in update_database
    self.taxonomydb.validate_tree()
  File "/rds/general/user/ajm3018/home/anaconda3/envs/flextaxd/lib/python3.8/site-packages/flextaxd/modules/database/DatabaseConnection.py", line 219, in validate_tree
    self.check_parent(n)
  File "/rds/general/user/ajm3018/home/anaconda3/envs/flextaxd/lib/python3.8/site-packages/flextaxd/modules/database/DatabaseConnection.py", line 271, in check_parent
    raise TreeError("Node: {node} has more than one parent!".format(node=name))
modules.database.DatabaseConnection.TreeError: 'Node: 2266 has more than one parent!'

Checking out the database, 2266 is Thermoproteales, and the two parents are 183924 "Thermoprotei" and 3000005 "Thermoproteia." 183924 is the NCBI taxonomy and 3000005 the one that's been added.

Any thoughts as to what might have gone wrong? And am I on the right lines?

Thanks.

Update: in reading through the logs I see that the database update program is expecting the taxid for the merged node to match. The taxid for archaea is identified in the NCBI database, and then the same taxid sought in the GTDB database. These of course do not align! This is a clue, at least, that I'm doing something wrong, though it doesn't tell me exactly what!

I've tried recreating the archaeal database with an offset of 3000000. I still hit the same problem.

Using flextaxd to modify the taxonomy without genomes

Hello!

I had rather hoped to use flextaxd to modify the NCBI taxonomy without providing it without any genomes - however, if I understand it correctly, this is not possible.

What I would like to do is

load the NCBI taxonomy only (no genome data)
add some additional species and strains (no genome data)
dump out the database in NCBI format

Is this at all possible?
Will it ever be possible?

Thanks
Mick

flextaxd-create doesn't find custum genomes.

Thank you for your cools tool.

I like to create a custom kraken2 database based on the GTDB database.
I read successfully in the taxonomy of my 14 genomes in Qiime format.


GUT_GENOME000180        d__Bacteria;p__Firmicutes_A;c__Clostridia;o__Lachnospirales;f__Lachnospiraceae;g__Lachnospira;s__Lachnospira rogosae
GUT_GENOME140299        d__Bacteria;p__Bacteroidota;c__Bacteroidia;o__Bacteroidales;f__Bacteroidaceae;g__Prevotella;s__Prevotella copri
GUT_GENOME096175        d__Bacteria;p__Firmicutes_C;c__Negativicutes;o__Acidaminococcales;f__Acidaminococcaceae;g__Phascolarctobacterium_A;s__Phascolarctobacterium_A succinatutens
GUT_GENOME096166        d__Bacteria;p__Desulfobacterota_A;c__Desulfovibrionia;o__Desulfovibrionales;f__Desulfovibrionaceae;g__Bilophila;s__Bilophila wadsworthia

I created successfully the dmp files.

However when I want to produce a Kraken database my genomes are not found:

2020-10-09 13:58:54,619 CreateKrakenDatabase [INFO ]  Number of genomes annotated in database 14
2020-10-09 13:58:54,619 CreateKrakenDatabase [INFO ]  Process genome path (genomes_test)
2020-10-09 13:58:54,623 CreateKrakenDatabase [INFO ]  Processing files; create kraken seq.map
2020-10-09 13:58:55,468 CreateKrakenDatabase [INFO ]  #Warning 14 genomes in the input folder could not be matched to any database entry!
2020-10-09 13:58:55,469 CreateKrakenDatabase [INFO ]  Number of genomes succesfully added to the kraken2 database: 0
2020-10-09 13:58:55,470 CreateKrakenDatabase [INFO ]  Downloaded genomes 0
2020-10-09 13:58:55,471 create_databases [INFO ]  Genome folder preprocessing completed!

commands executed.



flextaxd --taxonomy_file taxonomy/Subspecies_taxonomy.tsv --taxonomy_type QIIME --force --verbose --database taxonomy/flextaxd/flextaxd.ftd --logs log/flextaxd/  


 flextaxd --dump --outdir taxonomy/flextaxd/ --database taxonomy/flextaxd/flextaxd.ftd --verbose  --logs log/flextaxd/ 


flextaxd-create --db_name /abspath/kraken_db --skip_download --database taxonomy/flextaxd/flextaxd.ftd -o taxonomy/flextaxd  --genomes_path genomes_test -p 3  --create_db --dbprogram kraken2 --log log/kraken --verbose &> log/kraken/build_krken_db.log

I put my genomes are in genomes_test/

GUT_GENOME000180.fasta.gz  GUT_GENOME019375.fasta.gz  GUT_GENOME096175.fasta.gz
``
`

I tried to add a `--genomeid2taxid` during the database creation but it didn't changed anything.

GUT_GENOME000180 Lachnospira rogosae
GUT_GENOME140299 Prevotella copri

Bug on replace

A bug when replacing a node in an annotated database.

Causes problems if a node with a different path between the new and old database is annotated in the "source" database before merge.

DatabaseConnection [INFO ]  Validate parents
Traceback (most recent call last):
  File "miniconda3/envs/flextaxd/bin/flextaxd", line 33, in <module>
    sys.exit(load_entry_point('flextaxd==0.3.5', 'console_scripts', 'flextaxd')())
  File "miniconda3/envs/flextaxd/lib/python3.8/site-packages/flextaxd-0.3.5-py3.8.egg/flextaxd/custom_taxonomy_databases.py", line 275, in main
    modify_obj.update_database()
  File "miniconda3/envs/flextaxd/lib/python3.8/site-packages/flextaxd-0.3.5-py3.8.egg/flextaxd/modules/ModifyTree.py", line 451, in update_database
    self.taxonomydb.validate_tree()
  File "miniconda3/envs/flextaxd/lib/python3.8/site-packages/flextaxd-0.3.5-py3.8.egg/flextaxd/modules/database/DatabaseConnection.py", line 221, in validate_tree
    failed_nodes = [self.nodes[x[0]] for x in list(res)]
  File "miniconda3/envs/flextaxd/lib/python3.8/site-packages/flextaxd-0.3.5-py3.8.egg/flextaxd/modules/database/DatabaseConnection.py", line 221, in <listcomp>
    failed_nodes = [self.nodes[x[0]] for x in list(res)]
KeyError: 183924

flextaxd strain rank?

Hello,

I have strain names in addition to species to domain level names, and I want to create an NCBI formatted nodes and names dmp files.

Is there a way to include strain names in the taxonomy database? I see that the ranks go from species to domain but is there a way to also allow for strain level too or this not something that NCBI uses in its files?

Thank you.

Flextaxd creates empty kraken db

I try to build a custom kreken db flextaxd creates an empty kreken db and does not even fail.

I have latest flextaxd 0.4.2 and kraken 2.1.2

Could you help me to solve this issue.

2021-11-24 08:24:45,172 create_databases [INFO ]  Loading module: CreateKrakenDatabase
2021-11-24 08:24:45,698 create_databases [INFO ]  Get genomes from input directory!
2021-11-24 08:24:45,700 DatabaseConnection [DEBUG]  Connecting to uhgg/flextaxd/flextaxd.ftd
2021-11-24 08:24:45,708 DatabaseConnection [INFO ]  uhgg/flextaxd/flextaxd.ftd opened successfully.
2021-11-24 08:24:45,709 DatabaseConnection [DEBUG]  cursor created.
2021-11-24 08:24:45,710 DatabaseConnection [DEBUG]  Load DatabaseFunctions
2021-11-24 08:24:45,726 DatabaseConnection [DEBUG]  SELECT id,name FROM nodes
2021-11-24 08:24:45,739 CreateKrakenDatabase [INFO ]  /srv/beegfs/scratch/users/k/kiesers/Kraken/uhgg
2021-11-24 08:24:45,740 create_databases [INFO ]  --- process finished in 0 minutes 0.8386135101318359 seconds---

2021-11-24 08:24:45,742 CreateKrakenDatabase [INFO ]  Create library directory
2021-11-24 08:24:45,744 CreateKrakenDatabase [INFO ]  Processing files; create kraken seq.map
2021-11-24 08:27:27,430 CreateKrakenDatabase [INFO ]  Number of genomes succesfully added to the kraken2 database: 4616
2021-11-24 08:27:27,438 create_databases [INFO ]  Genome folder preprocessing completed!
2021-11-24 08:27:27,439 create_databases [INFO ]  --- process finished in 2 minutes 42.5374813079834 seconds---

2021-11-24 08:27:27,440 create_databases [INFO ]  Create database
2021-11-24 08:27:27,442 CreateKrakenDatabase [INFO ]  mkdir -p /srv/beegfs/scratch/users/k/kiesers/Kraken/uhgg/taxonomy
2021-11-24 08:27:27,456 CreateKrakenDatabase [INFO ]  cp uhgg/flextaxd/*.dmp /srv/beegfs/scratch/users/k/kiesers/Kraken/uhgg/taxonomy
2021-11-24 08:27:27,727 CreateKrakenDatabase [INFO ]  cp uhgg/flextaxd/*.map /srv/beegfs/scratch/users/k/kiesers/Kraken/uhgg
2021-11-24 08:27:27,728 CreateKrakenDatabase [INFO ]  kraken2-build --build --db /srv/beegfs/scratch/users/k/kiesers/Kraken/uhgg  --threads 8
Creating sequence ID to taxonomy ID map (step 1)...
Sequence ID to taxonomy ID map complete. [0.489s]
Estimating required capacity (step 2)...
Estimated hash table requirement: 16802807220 bytes
Capacity estimation complete. [55.311s]
Building database files (step 3)...
Taxonomy parsed and converted.
CHT created with 13 bits reserved for taxid.
Completed processing of 0 sequences, 0 bp
Writing data to disk...  complete.
Database files completed. [1m6.379s]
Database construction complete. [Total: 2m2.357s]
2021-11-24 08:29:33,807 CreateKrakenDatabase [INFO ]  Create inspect file!
2021-11-24 08:29:51,538 CreateKrakenDatabase [INFO ]  kraken2 database created
2021-11-24 08:29:51,539 create_databases [INFO ]  --- Time summary  5 minutes 6.63733434677124 seconds---
2021-11-24 08:29:51,539 create_databases [DEBUG]  1637738991.53929

Adding a custom seqid2taxid file to the NCBI accession2taxid causes error

Dear flextaxd team,

Thanks for an awesome tool! I use it to combine NCBI with GTDB taxonomy for kraken2 pathogen classification. I have recently been trying to implement the EuPathDB (http://ccb.jhu.edu/data/eupathDB/) to get clean eukaryotic genomes in the database. I tried just adding the genomes to my NCBI genomes path hoping that all would be annotated in the database. However, it seems that there has been some changes to the fasta headers causing problems when trying this since they are not all recognized and printed to the .flextaxdNotAdded file. To circumvent this I modified the seqid2taxid file (downloaded with the EuPathDB) to look like the accession2taxid files from NCBI (modified file attached: reduced_seqid2taxid_duplicate_no_univec.txt.gz) and concatenated this file to the other NCBI accession2taxid files, but somehow flextaxd recognizes that this is not the original one and throws an error message:

Traceback (most recent call last):
  File "/space/sharedbin_ubuntu_14_04/software/flextaxd/0.4.2-foss-2020b-Python-3.8.6/bin/flextaxd", line 8, in <module>
    sys.exit(main())
  File "/space/sharedbin_ubuntu_14_04/software/flextaxd/0.4.2-foss-2020b-Python-3.8.6/lib/python3.8/site-packages/flextaxd/custom_taxonomy_databases.py", line 279, in main
    read_obj.parse_genomeid2taxid(args.genomes_path,args.genomeid2taxid)
  File "/space/sharedbin_ubuntu_14_04/software/flextaxd/0.4.2-foss-2020b-Python-3.8.6/lib/python3.8/site-packages/flextaxd/modules/ReadTaxonomyNCBI.py", line 100, in parse_genomeid2taxid
    raise TypeError("The supplied annotation file does not seem to be the ncbi nucl_gb.accession2taxid.gz")
TypeError: The supplied annotation file does not seem to be the ncbi nucl_gb.accession2taxid.gz

Code to build the NCBI database:

flextaxd -db databases/NCBI_GTDB_merge.db -tf source/ncbi/nodes.dmp -tt NCBI --genomeid2taxid source/ncbi/complete.accession2taxid_w_eupath_univec.gz --verbose --logs NCBI_GTDB_merge_log --genomes_path genomes/refseq/

Do you have suggestions on how to solve this?

Kind regards,
Morten

Issues with cleaning and merging NCBI+GTDB database

Hello,
I was updating my GTDB database to the 202 release and am hitting a few snags. The errors start showing up starting from the section 2.3 Clean the database. They dont show up on the logs which i also have attached with this post (FlexTaxD-Jul-14-2021.log). I have also attached a separate file with all the errors that do crop up (flextaxD_Error.txt).
The taxonomy file generated when exported to NCBI format (section 4) is just an empty file.

Any suggestion on working through this is extremely appreciated!

conda installation fails

In Ubuntu 16.04 and conda 4.8.0 the installation fails with the following error:

host:~$ conda install flextaxd
Collecting package metadata (current_repodata.json): done
Solving environment: failed with initial frozen solve. Retrying with flexible solve.
Collecting package metadata (repodata.json): done
Solving environment: failed with initial frozen solve. Retrying with flexible solve.

PackagesNotFoundError: The following packages are not available from current channels:

  - flextaxd

Current channels:

  - https://repo.anaconda.com/pkgs/main/linux-64
  - https://repo.anaconda.com/pkgs/main/noarch
  - https://repo.anaconda.com/pkgs/r/linux-64
  - https://repo.anaconda.com/pkgs/r/noarch

To search for alternate channels that may provide the conda package you're
looking for, navigate to

    https://anaconda.org

and use the search bar at the top of the page.

The installation with conda install -c bioconda flextaxd worked so far. Maybe, it is just a documentation error. Or is it intentional to get it installed using conda without bioconda?

Error message when no gneomes are found

This error message should say that flextaxd didn't found any genomes and cannot create the DB.

2021-02-17 18:02:37,639 create_databases [INFO ]  Genome annotations with no matching source: 1699
2021-02-17 18:02:37,674 create_databases [INFO ]  Loading module: CreateKrakenDatabase
2021-02-17 18:02:38,896 DatabaseConnection [DEBUG]  Connecting to human_species/flextaxd/flextaxd.ftd
2021-02-17 18:02:38,901 DatabaseConnection [INFO ]  human_species/flextaxd/flextaxd.ftd opened successfully.
2021-02-17 18:02:38,902 DatabaseConnection [DEBUG]  cursor created.
2021-02-17 18:02:38,902 DatabaseConnection [DEBUG]  Load DatabaseFunctions
2021-02-17 18:02:38,908 DatabaseConnection [DEBUG]  SELECT id,name FROM nodes
2021-02-17 18:02:38,915 CreateKrakenDatabase [INFO ]  /srv/beegfs/scratch/users/k/kiesers/Subspecies/Poppunk/human_species
2021-02-17 18:02:38,916 create_databases [INFO ]  --- process finished in 2 minutes 31.115368843078613 seconds---

Traceback (most recent call last):
  File "/home/kiesers/scratch/Atlas/databases/conda_envs/d80bdd19/bin/flextaxd-create", line 10, in <module>
    sys.exit(main())
  File "/home/kiesers/scratch/Atlas/databases/conda_envs/d80bdd19/lib/python3.8/site-packages/flextaxd/create_databases.py", line 225, in main
    classifierDB.create_library_from_files()
  File "/home/kiesers/scratch/Atlas/databases/conda_envs/d80bdd19/lib/python3.8/site-packages/flextaxd/modules/CreateKrakenDatabase.py", line 203, in create_library_from_files
    self.genome_names_split = self._split(self.genome_names,self.processes)
AttributeError: 'CreateKrakenDatabase' object has no attribute 'genome_names'

MP style output

Implement an MPI style output format planned for the next reliease v0.4.3

Importing trees with unspecified rank levels

How to deal with input trees with no clear rank order or given levels (QIIME includes p__, s__ etc) but a standard CanSNPer tree does not (for example). Should FlexTaxD assume that the top X order follow a certain level structure, or should it require "--skip_rank" or a given file with "translated" ranks? Or should default behaviour be to create a number series for the levels?

A;B;C;D
vs
d__A,c__B,o__C,f__D

Duplicated parents?

Actually, I'm still struggling. Archaea worked fine! However, when I added bacteria I ended up with validation errors:

flextaxd --database NCBI_taxonomy_with_genomes_and_GTDB.db --mod_database ../GTDB_bacteria.db --parent "Bacteria" --replace --verbose --debug
...
2020-10-12 18:45:18,286 DatabaseConnection [INFO ]  Validate parents
Traceback (most recent call last):
...
  File ".../flextaxd/lib/python3.8/site-packages/flextaxd/modules/database/DatabaseConnection.py", line 271, in check_parent
    raise TreeError("Node: {node} has more than one parent!".format(node=name))
modules.database.DatabaseConnection.TreeError: 'Node: 40929 has more than one parent!'

In fact there are many with double parents:

select name, child, count(parent) from tree inner join nodes on tree.child = nodes.id group by child having count(parent) > 1;

name          child    count(parent)
------------  -------  -------------
Planococcus   40929    2
Buchnera      46073    2
Bacillus      55087    2
Gordonia      79255    2
Centipeda     82283    2
Morganella    108061   2
Edwardsiella  132406   2
Schwartzia    164984   2
Bosea         169215   2
Leptonema     177878   2
Proteus       210425   2
Paracoccus    249411   2
Eremococcus   308526   2
Lamprocystis  424207   2
Yersinia      444888   2
Rothia        508215   2
Spirulina     551299   2
Rhodobium     869314   2
Lawsonia      1091138  2
Coxiella      1260513  2
Thermosipho   1445919  2
Rhodococcus   1661425  2
Syntrophus    1671858  2
Nodularia     1830174  2
Leptothrix    1907117  2
Bogoriella    1929901  2
Rivularia     2023769  2
Halofilum     2045120  2
Labrys        2066135  2
Sulcia        2716471  2
Longispora    2759766  2
Xenococcus    2774086  2

Any ideas what I might be doing wrong?

Originally posted by @andrewjmc in #17 (comment)

Building Kraken2 compatible GTDB only database

Hello,
Thank you so much for this tool! I have been using it most of today to merge the taxonomies of archaea and bacteria from GTDB and protozoa, viruses, fungi and chlorophyta genomes from NCBI into one database. However, I seem to have issues:

merge these files: Since i have no archaea and bacterial sequences from NCBI
Command:
flextaxd -db db_file_FTD/NCBI_GTDB_merge.db -md db_file_FTD/ar122_gtdb.db --parent Archaea --replace --verbose --logs NCBI_GTDB_merge_log/ar122_rep.log
Error: only the final line
modules.database.DatabaseConnection.NameError: 'Name not found in the database! Archaea'

Does this mean that i have to download all the genomes for archaea and bacteria for NCBI too before i merge it with GTDB?

I also have another issue when creating kraken2 compatible 16s database from GTDB sequences.
I only have 2 files for this data:
a) A single fasta file with 16s sequences from Archaea and Bacterial genomes as shown below:
`>Archaea;Crenarchaeota;Korarchaeia;Korarchaeales;Korarchaeaceae;Korarchaeum;Korarchaeum_cryptofilum(RS_GCF_000019605.1)
GGTTGATCCTGCCGGAGGGAACCCCTATCGGGCTCGCA ……..

Archaea;Crenarchaeota;Thermoprotei;Desulfurococcales;Desulfurococcaceae;Staphylothermus;Staphylothermus_hellenicus(RS_GCF_000092465.1)
AGGTGATCCAGCCGCAGGTTCCCCTACGGCTACCTTGTT …….`

b) Taxonomy file, format as shown below:
Archaea;Crenarchaeota;Korarchaeia;Korarchaeales;Korarchaeaceae;Korarchaeum;Korarchaeum_cryptofilum(RS_GCF_000019605.1) Archaea;Crenarchaeota;Thermoprotei;Desulfurococcales;Desulfurococcaceae;Staphylothermus;Staphylothermus_hellenicus(RS_GCF_000092465.1) Archaea;Halobacterota;Methanosarcinia;Methanosarcinales;Methanosarcinaceae;Methanosarcina;Methanosarcina_mazei(RS_GCF_000970205.1) Archaea;Euryarchaeota;Methanobacteria;Methanobacteriales;Methanobacteriaceae;Methanobacterium;Methanobacterium_formicicum_A(RS_GCF_000302455.1) Archaea;Crenarchaeota;Thermoprotei;Thermofilales;Thermofilaceae;Thermofilum_A;Thermofilum_A_uzonense(RS_GCF_000993805.1) Archaea;Halobacterota;Halobacteria;Halobacteriales;Haloferacaceae;Halonotius;(GB_GCA_000416025.1) Archaea;Crenarchaeota;Nitrososphaeria;Nitrososphaerales;Nitrosopumilaceae;Nitrosopumilus;(GB_GCA_000484975.1) Archaea;Crenarchaeota;Thermoprotei;Desulfurococcales;Ignicoccaceae;Ignicoccus_A;Ignicoccus_A_hospitalis(UBA8868) Archaea;Halobacterota;Methanomicrobia;Methanomicrobiales;Methanocullaceae;Methanoculleus;Methanoculleus_sp2(UBA8928)
I was wondering if it was possible to use these two files to generate a kraken2 compatible 16s database?

pip install fails in Ubuntu 16.04

In Ubuntu 16.04 and python 3.5.2 the installation using pip fails with the following error:

host:~$ pip3 install -U flextaxd
ERROR: Could not find a version that satisfies the requirement flextaxd (from versions: none)
ERROR: No matching distribution found for flextaxd

Empty folder are downloaded

When downloading genomes that does not exist in the genome reference folder, empty directories of previous versions of the genome are downloaded.

Example:

Annotation:
    id         genome
    3702    GCF_000001735.4
Result:
    GCF_000001735.1_TAIR8/<empty>
    GCF_000001735.2_TAIR9/<emtpy>
    GCF_000001735.3_TAIR10/<empty>
    GCF_000001735.4_TAIR10.1/GCF_000001735.4_TAIR10.1_genomic.fna.gz

Citation

Hi,

If I were to use this tool, do you have a publication for it that I should cite?

Use of custom genomes retrieved from another databases apart from NCBI and GTDB in FlexTaxD

Hello,

First of all I would like to congratulate for the release of FlexTaxD which will facilitate the taxonomic exhange.

I have a rapid question in order to know if your tool could help me or not in my workflow.

I performed a metagenomic study and I classified the taxonomy through Kraken2 + Bracken, Metaphlan3 and Kaiju.

For Metaphlan3, a Newick format tree is available from the Metaphlan developers.

Nevertheless in Kraken2 I classified the taxonomy based on UHGG v2.0, and in Kaiju, based on proGenomes database. Both database webs does not have available a tree.

Therefore and from your publication in Bioinformatics Journal, and also from https://doi.org/10.1371/journal.pcbi.1009947, it seems me that I could use in FlexTaxD a custom sets of genomes retrieved from UHGG v.2.0 and proGenomes to construct a Newick taxonomy format and after find some other software to construct a Newick Tree that I can use in a R phyloseq package to study Unifrac distances as beta-diversity analysis. Could it be possible or not?

Sorry if it's not a interesting question, I'm fledging in bioinformatics.

Thanks on advance for your comments,

Magi.

Create completely custom Kraken2 database

Many thanks for creating flextaxd, I was looking in the documentation for how to build a kraken2 database from a completely custom taxonomy, and I couldn't figure it out.

Let's say for example I have two custom genomes, genome1.fasta and genome2.fasta

My taxonomy.tsv looks like this:

genome1.fasta   d__Bacteria;p__Firmicutes;c__Bacilli;o__Lactobacillales;f__Streptococcaceae;g__Streptococcus;s__Streptococcus pneumoniae
genome2.fasta   d__Bacteria;p__Firmicutes;c__Bacilli;o__Staphylococcales;f__Staphylococcaceae;g__Staphylococcus;s__Staphylococcus aureus

My annotation.tsv looks like this:

genome1.fasta   Streptococcus pneumoniae
genome2.fasta   Staphylococcus aureus

I can build the flextaxd DB and dump out kraken2 format taxonomy like this:

flextaxd --database test.db --taxonomy_file taxonomy.tsv --taxonomy_type QIIME --genomeid2taxid annotation.tsv --genomes_path genomes

flextaxd --database test.db --dbprogram kraken2 -o taxonomy --dump --genomes_path genomes

BUT this just dumps out kraken2 taxonomy, as far as I can tell, I still need to manually edit the genome files to have the correct taxonomy ID in their header, correct?

Am I missing something? Is this functionality already in flextaxd?

Thanks
Mick

Flextaxd - test database inspect file has no genome names

Dear Flextaxd team,
Thanks for a great tool!
I am running into issues again building the merged NCBI and GTDB databases and hope that you can help me.

I followed the 'Walkthrough merge NCBI and GTDB' for most but did use krakenuniq to download NCBI genomes instead. What I wish to create is a database with NCBI Refseq Human genome, NCBI Refseq complete viruses, NCBI Refseq (All) fungi and GTDB bacteria. To do this I downloaded GTDB genomes from the GTDB repository and made a script to extract only the bacterial genomes (archaea and bacterial genomes are present in a confusing subfolder system together). All NCBI genomes were downloaded with krakenuniq-download and comes as .fna that I have gzipped.

krakenuniq-download --db $WD/krakenuniq_31db208/ --threads 30 refseq/fungi/Any refseq/viral refseq/vertebrate_mammalian/Chromosome/species_taxid=9606

Below I have attached the inspect file of the test database, the merged NCBI_GTDB database file and the log for the build of the test database.
inspect.txt
NCBI_GTDB_merge.zip
build_kraken_logsFlexTaxD-create-Dec-02-2021.log

I dont understand why the inspect file does not have genome names associated (as is the case for your example).

Kind regards,
Morten

Non-unique contig names for GTDB

I encountered errors like [contig_179238] skipped - duplicated sequence identifier) during ganon database building.

Many GTDB reference sequences (https://data.ace.uq.edu.au/public/gtdb/data/releases/latest/genomic_files_reps/) contain contig names like this, and clearly the numbers sometimes clash.

Given that this issue will likely persist in GTDB (due to use of MAGs, I assume) is it worth detecting them in kraken/ganon database building and prepending GCA/GCF names when producing .fna.gz files for kraken/ganon build steps?

Krakenuniq "Unknown options: skip-maps"

Dear David,

I encounter an error trying to build a krakenuniq database. Is it an error that you have encountered before?

flextaxd-create -db databases/NCBI_GTDB_merge.db -o taxonomy_krakenuniq --genomes_path "/shared-nfs/MEN/silico_reads/gtdb_202/krakenuniq_gtdb_202_no_dust/library/" -p 30 --verbose --logs build_kraken_logs --create_db --db_name krakenuniqdb_test --dbprogram krakenuniq --test
2021-12-03 09:51:24,361 create_databases [INFO ]  FlexTaxD-create logging initiated!
2021-12-03 09:51:24,366 create_databases [INFO ]  Processing files; create kraken seq.map
2021-12-03 09:51:24,367 DatabaseConnection [INFO ]  databases/NCBI_GTDB_merge.db opened successfully.
2021-12-03 09:51:24,693 ProcessDirectory [INFO ]  Number of genomes annotated in database 265625
2021-12-03 09:51:24,693 ProcessDirectory [INFO ]  Process genome path (/shared-nfs/MEN/silico_reads/gtdb_202/krakenuniq_gtdb_202_no_dust/library/)
2021-12-03 09:51:25,310 ProcessDirectory [INFO ]  Processed 57311 genomes
2021-12-03 09:51:25,539 create_databases [INFO ]  Genome annotations with no matching source: 220070
2021-12-03 09:51:25,798 create_databases [INFO ]  Loading module: CreateKrakenDatabase
2021-12-03 09:51:25,812 create_databases [INFO ]  Get genomes from input directory!
2021-12-03 09:51:25,812 DatabaseConnection [INFO ]  databases/NCBI_GTDB_merge.db opened successfully.
2021-12-03 09:51:26,133 CreateKrakenDatabase [INFO ]  krakenuniqdb_test
2021-12-03 09:51:26,192 create_databases [INFO ]  --- process finished in 0 minutes 1.8346500396728516 seconds---

2021-12-03 09:51:26,192 CreateKrakenDatabase [INFO ]  Test use only 10 genomes
2021-12-03 09:51:26,195 CreateKrakenDatabase [INFO ]  Create library directory
2021-12-03 09:51:26,199 CreateKrakenDatabase [INFO ]  Processing files; create kraken seq.map
2021-12-03 09:51:27,360 CreateKrakenDatabase [INFO ]  Number of genomes succesfully added to the krakenuniq database: 10
2021-12-03 09:51:27,360 create_databases [INFO ]  Genome folder preprocessing completed!
2021-12-03 09:51:27,360 create_databases [INFO ]  --- process finished in 0 minutes 3.002981424331665 seconds---

2021-12-03 09:51:27,360 create_databases [INFO ]  Create database
2021-12-03 09:51:27,360 CreateKrakenDatabase [INFO ]  mkdir -p krakenuniqdb_test/taxonomy
2021-12-03 09:51:27,397 CreateKrakenDatabase [INFO ]  cp taxonomy_krakenuniq/*.dmp krakenuniqdb_test/taxonomy
2021-12-03 09:51:27,796 CreateKrakenDatabase [INFO ]  cp taxonomy_krakenuniq/*.map krakenuniqdb_test
2021-12-03 09:51:27,796 CreateKrakenDatabase [INFO ]  krakenuniq-build --build --db krakenuniqdb_test  --threads 30
Unknown option: skip-maps
Usage: krakenuniq-build [task option] [options]

Task options (exactly one can be selected -- default is build):
  --download-taxonomy        Download NCBI taxonomic information
  --download-library TYPE    Download partial library (TYPE = one of "refseq/bacteria", "refseq/archaea", "refseq/viral").
                             Use krakenuniq-download for more options.
  --add-to-library FILE      Add FILE to library
  --build                    Create DB from library (requires taxonomy d/l'ed and at
                             least one file in library)
  --rebuild                  Create DB from library like --build, but remove
                             existing non-library/taxonomy files before build
  --clean                    Remove unneeded files from a built database
  --shrink NEW_CT            Shrink an existing DB to have only NEW_CT k-mers
  --standard                 Download and create default database, which contains complete genomes
                             for archaea, bacteria and viruses from RefSeq, as well as viral strains
                             from NCBI. Specify --taxids-for-genomes and --taxids-for-sequences
                             separately, if desired.

  --help                     Print this message
  --version                  Print version information

Options:
  --db DBDIR                 Kraken DB directory (mandatory except for --help/--version)
  --threads #                Number of threads (def: 1)
  --new-db NAME              New Kraken DB name (shrink task only; mandatory
                             for shrink task)
  --kmer-len NUM             K-mer length in bp (build/shrink tasks only;
                             def: 31)
  --minimizer-len NUM        Minimizer length in bp (build/shrink tasks only;
                             def: 15)
  --jellyfish-hash-size STR  Pass a specific hash size argument to jellyfish
                             when building database (build task only)
  --jellyfish-bin STR        Use STR as Jellyfish 1 binary.
  --max-db-size SIZE         Shrink the DB before full build, making sure
                             database and index together use <= SIZE gigabytes
                             (build task only)
  --shrink-block-offset NUM  When shrinking, select the k-mer that is NUM
                             positions from the end of a block of k-mers
                             (default: 1)
  --work-on-disk             Perform most operations on disk rather than in
                             RAM (will slow down build in most cases)
  --taxids-for-genomes       Add taxonomy IDs (starting with 1 billion) for genomes.
                             Only works with 3-column seqid2taxid map with third
                             column being the name
  --taxids-for-sequences     Add taxonomy IDs for sequences, starting with 1 billion.
                             Can be useful to resolve classifications with multiple genomes
                             for one taxonomy ID.
  --min-contig-size NUM      Minimum contig size for inclusion in database.
                             Use with draft genomes to reduce contamination, e.g. with values between 1000 and 10000.
  --library-dir DIR          Use DIR for reference sequences instead of DBDIR/library.
  --taxonomy-dir DIR         Use DIR for taxonomy instead of DBDIR/taxonomy.

Experimental:
  --uid-database             Build a UID database (default no)
  --lca-database             Build a LCA database (default yes)
  --no-lca-database          Do not build a LCA database
  --lca-order DIR1           Impose a hierarchical order for setting LCAs.
  --lca-order DIR2           The directories must be specified relative to the libary directory
  ...                        (DBDIR/library). When setting the LCAs, k-mers from sequences in
                             DIR1 will be set first, and only unset k-mers will be set from
                             DIR2, etc, and final from the whole library.
                                                         Use this option when including low-confidence draft genomes,
                             e.g use --lca-order Complete_Genome --lca-order Chromosome to
                             prioritize more complete assemblies.
                             Keep in mind that this option takes considerably longer.
Incomplete database, clean aborted.
2021-12-03 09:51:28,228 CreateKrakenDatabase [INFO ]  krakenuniq database created
2021-12-03 09:51:28,228 create_databases [INFO ]  --- Time summary  0 minutes 3.871030569076538 seconds---

Referenced genomes remain even if node is updated (on replace)

When replacing a node into an existing database, genomes referenced to the deleted nodes will remain, also genomes that may have been placed in the wrong node (in case they do not exist in the new database). Add a function to remove all annotations with no attached node still in the database.

Custom genome annotations with "-" sometimes gets lost

A few examples of genomes containing "-" in the name are lost in the kraken2 database build.

TypeError: 'int' object is not iterable

Hi.

Thank you for creating this excellent tool.

I have run the workflow according to the wiki to get GTDB database for ganon.
However, I am facing an error in the database clean step.

(flextaxd) [ide@tn2 data]$ flextaxd -db databases/NCBI_GTDB_merge.db --clean_database --verbose --log NCBI_GTDB_merge_log
2021-09-25 22:47:28,013 custom_taxonomy_databases [INFO ]  FlexTaxD logging initiated!
2021-09-25 22:47:28,019 ModifyTree [INFO ]  Modify Tree
2021-09-25 22:47:28,019 DatabaseConnection [INFO ]  databases/NCBI_GTDB_merge.db opened successfully.
2021-09-25 22:47:28,056 ModifyTree [INFO ]  Fetch annotated nodes
2021-09-25 22:47:28,097 ModifyTree [INFO ]  Annotated nodes: 10357
2021-09-25 22:47:28,097 ModifyTree [INFO ]  Get all links in database
2021-09-25 22:47:28,129 ModifyTree [INFO ]  Get all nodes in database
2021-09-25 22:47:28,167 ModifyTree [INFO ]  Retrieve all parents of annotated nodes
2021-09-25 22:47:28,644 ModifyTree [INFO ]  Parents added: 4321
2021-09-25 22:47:28,692 ModifyTree [INFO ]  Links to remove 3380
2021-09-25 22:47:28,692 ModifyTree [INFO ]  Nodes to remove 3380
2021-09-25 22:47:28,692 ModifyTree [INFO ]  Clean annotations related to removed nodes
2021-09-25 22:47:28,692 ModifyTree [INFO ]  Cleaning 3380 links
2021-09-25 22:47:28,692 DatabaseConnection [INFO ]  Fast clean
2021-09-25 22:47:29,763 DatabaseConnection [INFO ]  Deleting 3380 annotations!
2021-09-25 22:47:29,806 DatabaseConnection [INFO ]  Get all database nodes
2021-09-25 22:47:29,828 DatabaseConnection [INFO ]  Get all database edges
2021-09-25 22:47:29,848 DatabaseConnection [INFO ]  Get all children from root node
2021-09-25 22:47:31,573 DatabaseConnection [INFO ]  Get tree edges from children
2021-09-25 22:47:31,621 DatabaseConnection [INFO ]  Get nodes from tree edges
2021-09-25 22:47:31,631 DatabaseConnection [INFO ]  Validate parents
2021-09-25 22:47:31,639 DatabaseConnection [INFO ]  Tree statistics
					Nodes: 14677
					Links: 14677
					Tree: n(14375), l(14375)
					LinkNodes: 14696
					Parent_ok: True

Traceback (most recent call last):
  File "/home/ide/miniconda3/envs/flextaxd/bin/flextaxd", line 10, in <module>
    sys.exit(main())
  File "/home/ide/miniconda3/envs/flextaxd/lib/python3.6/site-packages/flextaxd/custom_taxonomy_databases.py", line 262, in main
    modify_obj.clean_database(ncbi=ncbi)
  File "/home/ide/miniconda3/envs/flextaxd/lib/python3.6/site-packages/flextaxd/modules/ModifyTree.py", line 430, in clean_database
    if self.taxonomydb.validate_tree():
  File "/home/ide/miniconda3/envs/flextaxd/lib/python3.6/site-packages/flextaxd/modules/database/DatabaseConnection.py", line 275, in validate_tree
    logger.debug([self.taxonomy[x] for x in lset])
TypeError: 'int' object is not iterable

The version of python is 3.6.0 and the installation is done using miniconda.
Even if I skip this step, I am facing the same error in the Merge database step.

Thanks in advance for any advise.

Keigo

16s GTDB database: Kingdom taxonomy issue

Hello again!
I was working with 16s datasets and I think i found a bug with the kingdom level taxonomy in the kraken2 16s GTDB database. It looks like the kingdom taxonomy is empty for me. There is no description of bacteria and archaeal kingdoms and instead the taxonomies start with the phylum level.

For example:
kraken2-inspect --db GTDB_orig_krakendb/ --use-mpa-style --threads 5 | head -50
# Database options: nucleotide db, k = 35, l = 31
# Spaced mask = 11111111111111111111111111111111110011001100110011001100110011
# Toggle mask = 1110001101111110001010001100010000100111000110110101101000101101
# Total taxonomy nodes: 31893
# Table size: 1789033
# Table capacity: 2603885
# Min clear hash value = 0
p__Proteobacteria	317702
p__Proteobacteria|c__Gammaproteobacteria	196884
p__Proteobacteria|c__Gammaproteobacteria|o__Enterobacterales	56049
p__Proteobacteria|c__Gammaproteobacteria|o__Enterobacterales|f__Enterobacteriaceae	24124
p__Proteobacteria|c__Gammaproteobacteria|o__Enterobacterales|f__Enterobacteriaceae|g__Buchnera	4361
p__Proteobacteria|c__Gammaproteobacteria|o__Enterobacterales|f__Enterobacteriaceae|g__Buchnera|s__Buchnera aphidicola_V	180
p__Proteobacteria|c__Gammaproteobacteria|o__Enterobacterales|f__Enterobacteriaceae|g__Buchnera|s__Buchnera aphidicola_T	151
p__Proteobacteria|c__Gammaproteobacteria|o__Enterobacterales|f__Enterobacteriaceae|g__Buchnera|s__Buchnera aphidicola_AB	145

this was more evident when i imported the biom files into R, but you can still see it here. Do you see a similar situation? Its not too difficult to correct the taxonomy in R though.

Issues with replacing NCBI bacterial and archeal taxonomies with GTDB taxonomies

Hi,

Thanks a lot for an excellent tool!

I want to create kraken2 and krakenuniq databases with the human genome, viral refseq complete genomes, fungi refseq genomes (also not complete!), and GTDB bacterial & archeal genomes. I followed the walk-through of your example - which is close to the same - and tried to implement what I needed to do on the way.

I already have all of the NCBI genomes as .fna files located in the folder called "genomes" below. They are concatenated into fasta files for each of the domains (e.g. bacteria.fna).

I merged the two NCBI accession2taxid files into one and used that to generate the database:
zcat nucl_gb.accession2taxid.gz nucl_wgs.accession2taxid.gz | gzip > complete.accession2taxid.gz

However, it seems not to complete the action when running:

flextaxd -db databases/NCBI_GTDB_merge.db -tf source/ncbi/nodes.dmp -tt NCBI --genomeid2taxid source/ncbi/complete.accession2taxid.gz --verbose --logs NCBI_GTDB_merge_log --genomes_path genomes

2021-11-17 12:41:29,380 custom_taxonomy_databases [INFO ]  FlexTaxD logging initiated!
Creating a new FlexTaxD database databases/NCBI_GTDB_merge.db using source/ncbi/nodes.dmp, press any key to continue...
2021-11-17 12:41:32,371 custom_taxonomy_databases [INFO ]  Loading module: ReadTaxonomyNCBI
2021-11-17 12:41:57,251 DatabaseConnection [INFO ]  databases/NCBI_GTDB_merge.db opened successfully.
2021-11-17 12:41:57,254 custom_taxonomy_databases [INFO ]  Parse taxonomy
2021-11-17 12:41:57,255 ReadTaxonomyNCBI [INFO ]  Parse names file source/ncbi/names.dmp
2021-11-17 12:42:43,363 ReadTaxonomyNCBI [INFO ]  Parse nodes file source/ncbi/nodes.dmp
2021-11-17 12:43:51,524 ReadTaxonomyNCBI [INFO ]  Parsing ncbi accession2taxid, genome_path: genomes

Traceback (most recent call last):

  File "/space/sharedbin_ubuntu_14_04/software/flextaxd/0.4.2-foss-2020b-Python-3.8.6/bin/flextaxd", line 8, in <module>
    sys.exit(main())

  File "/space/sharedbin_ubuntu_14_04/software/flextaxd/0.4.2-foss-2020b-Python-3.8.6/lib/python3.8/site-packages/flextaxd/custom_taxonomy_databases.py", line 279, in main
    read_obj.parse_genomeid2taxid(args.genomes_path,args.genomeid2taxid)

  File "/space/sharedbin_ubuntu_14_04/software/flextaxd/0.4.2-foss-2020b-Python-3.8.6/lib/python3.8/site-packages/flextaxd/modules/ReadTaxonomyNCBI.py", line 97, in parse_genomeid2taxid
    self.parse_genebank_file(filepath,filename)

  File "/space/sharedbin_ubuntu_14_04/software/flextaxd/0.4.2-foss-2020b-Python-3.8.6/lib/python3.8/site-packages/flextaxd/modules/ReadTaxonomyNCBI.py", line 81, in parse_genebank_file
    refseqid = f.readline().split(b" ")[0].lstrip(b">")

  File "/space/sharedbin_ubuntu_14_04/software/Python/3.8.6-GCCcore-10.2.0/lib/python3.8/gzip.py", line 390, in readline
    return self._buffer.readline(size)

  File "/space/sharedbin_ubuntu_14_04/software/Python/3.8.6-GCCcore-10.2.0/lib/python3.8/_compression.py", line 68, in readinto
    data = self.read(len(byte_view))

  File "/space/sharedbin_ubuntu_14_04/software/Python/3.8.6-GCCcore-10.2.0/lib/python3.8/gzip.py", line 479, in read
    if not self._read_gzip_header():

  File "/space/sharedbin_ubuntu_14_04/software/Python/3.8.6-GCCcore-10.2.0/lib/python3.8/gzip.py", line 427, in _read_gzip_header
    raise BadGzipFile('Not a gzipped file (%r)' % magic)

gzip.BadGzipFile: Not a gzipped file (b'>N')

Hope you can help me

Kind regards,
Morten

Add - to fit bracken names and nodes format

To add on to this as well, here are the two lines indicating the format for the nodes.dmp & names.dmp files:
https://github.com/DerrickWood/kraken2/blob/master/scripts/build_rdp_taxonomy.pl
Lines 55 & 56
This is the format that Kraken & Bracken specifically work with.

Originally posted by @punnettsun in #16 (comment)

foi-bioinformatics / flextaxd Goto Github PK

flextaxd's People

Contributors

Stargazers

Watchers

Forkers

flextaxd's Issues

Recommend Projects

Recommend Topics

Recommend Org