rnajena / bertax Goto Github PK

Taxonomic classification of DNA sequences

License: GNU General Public License v3.0

Python 99.30% Dockerfile 0.70%

bertax's Introduction

BERTax: Taxonomic Classification of DNA sequences

This is the repository to the preprint-paper BERTax: taxonomic classification of DNA sequences with Deep Neural Networks and the published paper: Taxonomic classification of DNA sequences beyond sequence similarity using deep neural networks respectively.

The used data can be found under DOI 10.17605/OSF.IO/QG6MV or https://osf.io/qg6mv/

Run Bertax

We provide a Docker container to run BERTax. Pull and run with:

docker run -t --rm -v /path/to/input/files:/in fkre/bertax:latest /in/sequences.fa

The docker container can also be run with GPU-support, likely resulting in much faster predictions. For this, the nvidia-container-toolkit has to be installed, the bertax image has to be run with the flag --gpus all.

The image can be built locally (after cloning -- see below) with

docker build -t bertax bertax

Alternative local installation

Prepare conda

Install in new conda environment

conda create -n bertax -c fkretschmer bertax

Activate environment and install necessary pip-dependencies

conda activate bertax
pip install keras-bert==0.86.0

Local pip-only installation

First clone this repository

git clone https://github.com/rnajena/bertax.git

Then download the (big) model file into the resources subfolder.

Finally, install with pip

pip install -e bertax

Usage

The script takes a (multi)fasta as input and outputs a list of predicted classes to the console:

bertax sequences.fasta

Options:

parameter	explanation
`-o` `--output_file`	write output to specified file (tab-separated format) instead of to the output stream (console)
`--conf_matrix_file`	output confidences for all classes of all ranks to JSON file
`--sequence_split`	how to handle sequences sequence longer than the maximum (window) size: split into equal chunks (`equal_chunks`, default) or use random sequence window (`window`)
`-C` `--maximum_sequence_chunks`	maximum number of chunks to use per (long) sequence
`--running_window`	if enabled, a running window approach is chosen to go over each sequence to make predictions
`--running_window_stride`	stride for running window (default: 1)
`--custom_window_size`	allows specifying a custom, smaller window size
`--chunk_predictions`	output predictions per chunk, otherwise (by default) chunk predictions are averaged
`--output_ranks`	specify which ranks to include in output (default: superkingdom phylum genus)
`--no_confidence`	if set, do not include confidence scores in output
`--batch_size`	batch size (i.e., how many sequence chunks to predict at the same time); can be lowered to decrease memory usage and increased for better performance (default: 32)
`-t` `--nr_threads`	set the number of threads used (default: determine automatically)

Note, that "unknown" is a special placeholder class for each prediction rank, meaning the sequence's taxonomy is predicted to be unlike any possible output class.

Examples

Default mode, sequences longer than 1500 nt are split into equal chunks, one prediction (average) per sequence

bertax sequences.fa

Only use one random chunk per sequence (for sequences longer than 1500 nt)

bertax --sequence_split window sequences.fa

Only output the superkingdom

bertax sequences.fa --output_ranks superkingdom

Predict with a running window in 300 nt steps and output predictions for all chunks (no threshold for the number of chunks per sequence)

bertax -C -1 --running_window --running_window_stride 300 --chunk_predictions sequences.fa

Confusion Matrices

In the directory confusion_matrices you can find confusion matrices from the publication's results which indicate the classification quality. These matrices could not be included directly in the paper due to the vast amount and size of them.

Visualization

It is possible to get a visualization of the underlying BERT model, based on bertviz for a specific DNA sequence. For this, additional dependencies have to be installed:

torch
transformers
bertviz==1.0.0

An HTML file with interactive visualization can be created with:

bertax-visualize sequence.fa

As visualization is quite performance-intensive for big sequences, parameters can be set to only visualize a specific part (-a $start -n $size). Both an attention-head view and model-view are available, set with the parameter --mode {head|model}.

Training BERTax models

The repository with the code used in the development of BERTax is located at https://github.com/rnajena/bertax_training. Custom models trained with these scripts can be used in BERTax with the parameter --custom_model_file.

Compatible phyla and genera

Due to the limited amount of samples that can be used for training, we could not train all known phyla and genera. Therefore, we present here the list of compatible phyla and genera. Note: If the taxon of your sample is not included in this list, there is a high probability that phylum/genus will be predicted as "unknown". If you want you can train your own model, that includes the taxa of interest to you.

Note: We recommend using BERTax only for super kingdom and phylum prediction, but genera are possible. For more details see: our paper at pnas.org

phylum

'Actinobacteria', 'Apicomplexa', 'Aquificae',
'Arthropoda', 'Artverviricota', 'Ascomycota', 'Bacillariophyta', 'Bacteroidetes',
'Basidiomycota', 'Candidatus Thermoplasmatota', 'Chlamydiae', 'Chlorobi',
'Chloroflexi', 'Chlorophyta', 'Chordata', 'Crenarchaeota', 'Cyanobacteria',
'Deinococcus-Thermus', 'Euglenozoa', 'Euryarchaeota', 'Evosea', 'Firmicutes',
'Fusobacteria', 'Gemmatimonadetes', 'Kitrinoviricota', 'Lentisphaerae', 'Mollusca',
'Negarnaviricota', 'Nematoda', 'Nitrospirae', 'Peploviricota', 'Pisuviricota',
'Planctomycetes', 'Platyhelminthes', 'Proteobacteria', 'Rhodophyta', 'Spirochaetes',
'Streptophyta', 'Tenericutes', 'Thaumarchaeota', 'Thermotogae', 'Uroviricota',
'Verrucomicrobia'

genus

'Acidilobus', 'Acidithiobacillus',
'Actinomyces', 'Actinopolyspora', 'Acyrthosiphon', 'Aeromonas', 'Akkermansia', 'Anas',
'Apis', 'Aquila', 'Archaeoglobus', 'Asparagus', 'Aspergillus', 'Astyanax', 'Aythya',
'Bdellovibrio', 'Beta', 'Betta', 'Bifidobacterium', 'Botrytis', 'Brachyspira',
'Bradymonas', 'Brassica', 'Caenorhabditis', 'Calypte', 'Candidatus Kuenenia',
'Candidatus Nitrosocaldus', 'Candidatus Promineofilum', 'Carassius', 'Cercospora',
'Chanos', 'Chlamydia', 'Chrysemys', 'Ciona', 'Citrus', 'Clupea', 'Coffea',
'Colletotrichum', 'Cottoperca', 'Crassostrea', 'Cryptococcus', 'Cucumis', 'Cucurbita',
'Cyanidioschyzon', 'Cynara', 'Cynoglossus', 'Daucus', 'Deinococcus', 'Denticeps',
'Desulfovibrio', 'Dictyostelium', 'Drosophila', 'Echeneis', 'Egibacter', 'Egicoccus',
'Elaeis', 'Equus', 'Erpetoichthys', 'Esox', 'Euzebya', 'Fervidicoccus', 'Frankia',
'Fusarium', 'Gadus', 'Gallus', 'Gemmata', 'Gopherus', 'Gossypium', 'Gouania',
'Helianthus', 'Ictalurus', 'Ktedonosporobacter', 'Legionella', 'Leishmania',
'Lepisosteus', 'Leptospira', 'Limnochorda', 'Malassezia', 'Manihot', 'Mariprofundus',
'Methanobacterium', 'Methanobrevibacter', 'Methanocaldococcus', 'Methanocella',
'Methanopyrus', 'Methanosarcina', 'Microcaecilia', 'Modestobacter', 'Monodelphis',
'Mus', 'Musa', 'Myripristis', 'Neisseria', 'Nitrosopumilus', 'Nitrososphaera',
'Nitrospira', 'Nymphaea', 'Octopus', 'Olea', 'Oncorhynchus', 'Ooceraea',
'Ornithorhynchus', 'Oryctolagus', 'Oryzias', 'Ostreococcus', 'Papaver', 'Perca',
'Phaeodactylum', 'Phyllostomus', 'Physcomitrium', 'Plasmodium', 'Podarcis', 'Pomacea',
'Populus', 'Prosthecochloris', 'Pseudomonas', 'Punica', 'Pyricularia', 'Pyrobaculum',
'Quercus', 'Rhinatrema', 'Rhopalosiphum', 'Roseiflexus', 'Rubrobacter', 'Rudivirus',
'Salarias', 'Salinisphaera', 'Sarcophilus', 'Schistosoma', 'Scleropages',
'Sedimentisphaera', 'Sesamum', 'Solanum', 'Sparus', 'Sphaeramia', 'Spodoptera',
'Sporisorium', 'Stanieria', 'Streptomyces', 'Strigops', 'Synechococcus', 'Takifugu',
'Thalassiosira', 'Theileria', 'Thermococcus', 'Thermogutta', 'Thermus', 'Tribolium',
'Trichoplusia', 'Ustilago', 'Vibrio', 'Vitis', 'Xenopus', 'Xiphophorus',
'Zymoseptoria'

bertax's People

Contributors

Stargazers

Watchers

Forkers

peterk87 qianjinydx tajmilur-rahman junruixing thilus alephreish replacedspace17

bertax's Issues

How to train the "unknown" label during the training?

Hello, Bertax team.

How should the unknown label be trained at each taxonomic rank, I wonder. In another meaning, it refers to the question of how to identify a sequence as "unknown" during training processing.

Could you provide me with some references or cites?

Thanks a lot if you can assist.

On the inconsistency of Taxid and BERTax taxonomy labels and the calculation of evaluation metrics for AveP.

Hi!
I'm interested in your work and I'm trying to reproduce the results on the data you released, but I'm having some problems.

1, The released sequence data contains taxid, and I used NCBI to map these taxids into taxonomic classification, and I got the corresponding taxonomic level for each sequence. However, many of these taxonomic labels obtained cannot correspond to those labels in the BERTax model(5 superkingdom，44 phylum，156 genus), and some of them I have corrected manually.

Although I have done the correction in the final dataset, the genus level correction is a bit difficult in similar dataset and non-similar dataset. I would like to ask, is this an objective problem right? Is there any possible solution?

2, I would also like to ask if the Accuracy and AveP metrics mentioned in the paper are accuracy and precision as we know them? Use from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score is it possible to calculate the same metrics mentioned in the paper?

Thank you for your work.

Speeding up predictions

Hello,

Thank you for developing BERTax! It looks like a really great tool for taxonomic classification of sequences that are typically difficult to classify with tools that rely on big databases.

I was interested to see if BERTax could be used for classification of metagenomic sequencing reads, but it seems like it would be quite a bit slower than kmer based methods (Centrifuge, Kraken2) even with GPU acceleration (16 CPU threads (Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz): 6 reads/s; Nvidia Quadro RTX 5000 (Driver Version: 470.63.01; CUDA Version: 11.4): 20 reads/s).

Are there any plans to optimize BERTax for performing predictions on larger inputs?

I tried to modify the BERTax code to be a little more efficient on large inputs (reads in FASTQ) in PR peterk87#1 but I'm not familiar with Keras or Tensorflow, so I'm not sure how one would go about optimizing that code. The call to model.predict seems to be taking the most time by far.

For example, for a read of length 6092 split into 5 chunks:

seq2tokens: 0.792363 ms
process_bert_tokens_batch: 1.096281 ms
model.predict: 67.773608 ms
writing output: 1.32 ms

Total elapsed time of 70.986515 ms. Timings were obtained with time.time_ns. Although there may be optimizations that could be possible for input processing and formatting output, most of the time (>95%) is spent running model.predict.

I noticed that in the bertax-visualize script, that the Keras model is converted into a PyTorch model:

https://github.com/f-kretschmer/bertax/blob/ae8cc568a2e66692e7663025906fda0016aa8b52/bertax/visualize.py#L29

I haven't tested whether using PyTorch and a converted model would help speed-up predictions. Maybe the Keras model could be converted to a Tensorflow model for less overhead per call to model.predict as per the following blogpost:

https://micwurm.medium.com/using-tensorflow-lite-to-speed-up-predictions-a3954886eb98

Unfortunately, I'm only familiar with NumPy and not familiar with Keras, Tensorflow or PyTorch. I have a bit of experience working with Cython and Numba for accelerating Python code, but using those may not be appropriate in this case.

Any speed-ups (or ideas for how to achieve speed-ups) would be extremely useful and appreciated and allow BERTax to be used on a wider range of datasets!

Thanks!
Peter

Output clarification

Hello, very nice tool! It ran smooothly.

I only have a clarification question about the output (during default running parameters).
I get the headers [id, superkingdom, (%), ...] in bertax.tsv file, does the percentage refer to the percentage of chunks classified as the respective superkingdom or is it a certainty estimate?
I used it to classify contigs, of which most were above 1500 nt, so most would be multiple chunks.

Best

Want to know the relationship between avep and confusion matrix

Hello, I would like to ask what is the meaning of the confusion matrix. As shown in the figure below, does 97% of Archaea refer to the proportion of the total Archaea sequence? How to calculate avep based on confusion matrix.

Please also ask where can I find the exact rate and recall rate of each sorting tool, because it is difficult to understand avep

organellar classification

I've been using BERTax all morning (via singularity) and am really liking it so far, but I have noticed one or two plastid contigs that were classified with high confidence as Bacteria for superkingdom, unknown for phylum. I checked the list of genomes you trained with, but didn't see any plastid or mitochondrial genomes included. Were they? If not, do you know if this might influence classification accuracy? I'd like to use this software for both draft assembly contig classification as well as raw long read classification, but working with photosynthetic microbes means a lot of my data are organellar in origin. Thanks!

License

Can you attach a license please?

Hi,can I run berax on my GPU ?

Hi, I have successfully run BERTax on the CPU, but at a very limited speed.
I now have a GPU server with CUDA version 11.4.1 and CUDNN version 8.2.4, and I tried to run BERTax on it, but it failed.
After my troubleshooting, I surmise that it has something to do with the version of Tensorflow that BERTax uses.
Are you able to come up with a more detailed configuration method?

I found that CUDA version 11.4 needs to match Tensorflow version 2.6.

suggestion: input check

When feeding by mistake a FASTQ file instead of a FASTA file the was no specific error/warning pointing to the problem.
It could be useful to add some checks.

BERTax test with Bombus terrestris - Error messages

Hello,
Thanks for BERTax, it really fills a gap in RNA-Seq analyses workflows.

I have tested BERTax with some 'known' taxa in order to get used to it. I selected 100 cDNA sequences from Bombus terrestris which is also part of the BERTax reference genomes (GCA_000214255.1).
The output of a test set of 100 sequences, however, did not show B. terrestris as the most likely taxon. Actually, none of the 100 test sequences ended up in genus Bombus. (see attached output below, fasta and tsv files are not allowed as attachments, sorry)
Is there an explanation for this unexpected result?

In addition, I receive the following error message. Could this explain the issue described above?

2022-12-06 02:53:57.118535: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: SSE4.1 SSE4.2 AVX AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
/home/bt140047/miniconda3/envs/bertax/lib/python3.10/site-packages/keras/initializers/initializers_v2.py:120: UserWarning: The initializer VarianceScaling is unseeded and being called multiple times, which will return identical values each time (even if the initializer is unseeded). Please update your code to provide a seed to the initializer, or avoid using the same initalizer instance more than once.
warnings.warn(

Best regards

######################### bertax outout ##############

id superkingdom phylum XR_002308984.1 cdna chromosome_group:Bter_1.0:B01:... Eukaryota (100%) Arthropoda (49%) XR_002307712.1 cdna chromosome_group:Bter_1.0:B01:... Eukaryota (100%) Arthropoda (97%) XR_002308309.1 cdna chromosome_group:Bter_1.0:B13:... Eukaryota (99%) Arthropoda (59%) XR_002308391.1 cdna chromosome_group:Bter_1.0:B14:... Eukaryota (58%) Mollusca (53%) XR_002308163.1 cdna chromosome_group:Bter_1.0:B11:... Eukaryota (67%) Arthropoda (66%) XR_002308164.1 cdna chromosome_group:Bter_1.0:B11:... Eukaryota (99%) Arthropoda (59%) XR_002308005.1 cdna chromosome_group:Bter_1.0:B09:... Eukaryota (100%) Arthropoda (30%) XM_012309198.2 cdna chromosome_group:Bter_1.0:B06:... Eukaryota (69%) Arthropoda (44%) XM_020863329.1 cdna chromosome_group:Bter_1.0:B06:... Eukaryota (67%) Arthropoda (36%) XM_003395929.3 cdna chromosome_group:Bter_1.0:B06:... Eukaryota (98%) Arthropoda (43%) XM_003395928.3 cdna chromosome_group:Bter_1.0:B06:... Eukaryota (71%) Arthropoda (35%) XM_012315222.2 cdna chromosome_group:Bter_1.0:B12:... Eukaryota (72%) Arthropoda (29%) XM_012315220.2 cdna chromosome_group:Bter_1.0:B12:... Eukaryota (99%) Apicomplexa (19%) XM_012315218.2 cdna chromosome_group:Bter_1.0:B12:... Eukaryota (99%) Apicomplexa (22%) XM_012315221.2 cdna chromosome_group:Bter_1.0:B12:... Eukaryota (94%) Streptophyta (24%) XM_012315219.2 cdna chromosome_group:Bter_1.0:B12:... Eukaryota (99%) Apicomplexa (21%) XM_012308017.2 cdna chromosome_group:Bter_1.0:B04:... Eukaryota (100%) Ascomycota (41%) XM_020865912.1 cdna chromosome_group:Bter_1.0:B12:... Eukaryota (99%) Arthropoda (100%) XM_012317133.2 cdna chromosome_group:Bter_1.0:B15:... Eukaryota (69%) Arthropoda (59%) XM_003402617.3 cdna scaffold:Bter_1.0:GL899399:281... Eukaryota (40%) Arthropoda (35%) XM_012313924.2 cdna chromosome_group:Bter_1.0:B11:... Eukaryota (98%) Arthropoda (36%) XM_012313925.2 cdna chromosome_group:Bter_1.0:B11:... Eukaryota (92%) Arthropoda (34%) XM_012320838.2 cdna chromosome_group:Bter_1.0:B03:... Eukaryota (98%) Platyhelminthes (41%) XM_003394305.3 cdna chromosome_group:Bter_1.0:B03:... Eukaryota (95%) Arthropoda (76%) XM_020868488.1 cdna chromosome_group:Bter_1.0:B03:... Eukaryota (99%) Arthropoda (90%) XM_012320836.2 cdna chromosome_group:Bter_1.0:B03:... Eukaryota (87%) Arthropoda (75%) XM_012320837.2 cdna chromosome_group:Bter_1.0:B03:... Eukaryota (95%) Arthropoda (43%) XM_003400287.3 cdna chromosome_group:Bter_1.0:B13:... Eukaryota (98%) Arthropoda (94%) XM_003399441.2 cdna chromosome_group:Bter_1.0:B12:... Bacteria (85%) Bacteroidetes (81%) XM_003402536.3 cdna scaffold:Bter_1.0:GL899322:191... Eukaryota (100%) Arthropoda (100%) XM_012310587.2 cdna chromosome_group:Bter_1.0:B07:... Eukaryota (61%) Arthropoda (30%) XM_012310588.2 cdna chromosome_group:Bter_1.0:B07:... Eukaryota (69%) Arthropoda (29%) XM_012310590.2 cdna chromosome_group:Bter_1.0:B07:... Eukaryota (95%) Arthropoda (59%) XM_003401212.3 cdna chromosome_group:Bter_1.0:B15:... Eukaryota (86%) Arthropoda (46%) XM_020862821.1 cdna chromosome_group:Bter_1.0:B01:... Eukaryota (68%) Arthropoda (37%) XM_003396613.3 cdna chromosome_group:Bter_1.0:B07:... Eukaryota (99%) Arthropoda (34%) XM_020865134.1 cdna chromosome_group:Bter_1.0:B10:... Eukaryota (52%) Uroviricota (37%) XM_020867249.1 cdna chromosome_group:Bter_1.0:B16:... Eukaryota (88%) Ascomycota (22%) XM_003395306.3 cdna chromosome_group:Bter_1.0:B05:... Eukaryota (61%) Ascomycota (29%) XM_012313976.2 cdna chromosome_group:Bter_1.0:B11:... Eukaryota (98%) Arthropoda (47%) XM_012313977.2 cdna chromosome_group:Bter_1.0:B11:... Eukaryota (99%) Arthropoda (63%) XM_012313975.2 cdna chromosome_group:Bter_1.0:B11:... Eukaryota (87%) Arthropoda (43%) XM_020868361.1 cdna chromosome_group:Bter_1.0:B02:... Eukaryota (86%) Arthropoda (85%) XM_020868360.1 cdna chromosome_group:Bter_1.0:B02:... Eukaryota (79%) Arthropoda (75%) XM_012316912.2 cdna chromosome_group:Bter_1.0:B02:... Eukaryota (90%) Arthropoda (81%) XM_020868362.1 cdna chromosome_group:Bter_1.0:B02:... Eukaryota (86%) Arthropoda (74%) XM_012316908.2 cdna chromosome_group:Bter_1.0:B02:... Eukaryota (87%) Arthropoda (77%) XM_003395930.3 cdna chromosome_group:Bter_1.0:B06:... Eukaryota (41%) Mollusca (25%) XM_020862692.1 cdna chromosome_group:Bter_1.0:B04:... Eukaryota (96%) Arthropoda (57%) XM_012308144.2 cdna chromosome_group:Bter_1.0:B04:... Eukaryota (98%) Arthropoda (64%) XM_003395055.3 cdna chromosome_group:Bter_1.0:B04:... Eukaryota (98%) Arthropoda (85%) XM_012319421.2 cdna chromosome_group:Bter_1.0:B02:... Eukaryota (79%) Apicomplexa (33%) XM_003393756.3 cdna chromosome_group:Bter_1.0:B02:... Eukaryota (90%) Apicomplexa (44%) XM_020868526.1 cdna chromosome_group:Bter_1.0:B03:... Eukaryota (65%) Chlorophyta (23%) XM_012313439.2 cdna chromosome_group:Bter_1.0:B11:... Eukaryota (79%) Arthropoda (53%) XM_020865500.1 cdna chromosome_group:Bter_1.0:B11:... Eukaryota (79%) Arthropoda (49%) XM_012313440.2 cdna chromosome_group:Bter_1.0:B11:... Eukaryota (75%) Mollusca (33%) XM_012315807.2 cdna chromosome_group:Bter_1.0:B14:... Eukaryota (89%) Arthropoda (87%) XM_012310523.2 cdna chromosome_group:Bter_1.0:B07:... Eukaryota (72%) Arthropoda (24%) XM_020863545.1 cdna chromosome_group:Bter_1.0:B07:... Eukaryota (99%) Arthropoda (39%) XM_012310525.2 cdna chromosome_group:Bter_1.0:B07:... Eukaryota (99%) Arthropoda (41%) XM_003394110.3 cdna chromosome_group:Bter_1.0:B03:... Eukaryota (100%) Arthropoda (75%) XM_003399740.3 cdna chromosome_group:Bter_1.0:B12:... Eukaryota (100%) Chordata (73%) XM_012309585.2 cdna chromosome_group:Bter_1.0:B01:... Viruses (89%) Peploviricota (87%) XM_003393159.3 cdna chromosome_group:Bter_1.0:B01:... Viruses (81%) Peploviricota (68%) XM_012321104.2 cdna chromosome_group:Bter_1.0:B03:... Eukaryota (50%) Streptophyta (49%) XM_012321103.2 cdna chromosome_group:Bter_1.0:B03:... Eukaryota (53%) Streptophyta (43%) XM_003394620.3 cdna chromosome_group:Bter_1.0:B03:... Eukaryota (50%) Streptophyta (49%) XM_012313942.2 cdna chromosome_group:Bter_1.0:B11:... Eukaryota (97%) Chordata (32%) XM_012316212.1 cdna chromosome_group:Bter_1.0:B14:... Eukaryota (99%) Arthropoda (92%) XM_012314813.2 cdna chromosome_group:Bter_1.0:B12:... Eukaryota (98%) Arthropoda (50%) XM_003399636.3 cdna chromosome_group:Bter_1.0:B12:... Eukaryota (100%) Arthropoda (50%) XM_020866083.1 cdna chromosome_group:Bter_1.0:B12:... Viruses (57%) Pisuviricota (54%) XM_012316045.2 cdna chromosome_group:Bter_1.0:B14:... Eukaryota (68%) Peploviricota (26%) XM_003393799.3 cdna chromosome_group:Bter_1.0:B02:... Eukaryota (71%) Firmicutes (43%) XM_003401455.3 cdna chromosome_group:Bter_1.0:B15:... Eukaryota (100%) Arthropoda (76%) XM_012308865.2 cdna chromosome_group:Bter_1.0:B05:... Eukaryota (64%) Arthropoda (46%) XM_020866407.1 cdna chromosome_group:Bter_1.0:B14:... Viruses (84%) Pisuviricota (38%) XM_020866406.1 cdna chromosome_group:Bter_1.0:B14:... Eukaryota (73%) Apicomplexa (54%) XM_020866405.1 cdna chromosome_group:Bter_1.0:B14:... Eukaryota (71%) Apicomplexa (54%) XM_020866410.1 cdna chromosome_group:Bter_1.0:B14:... Eukaryota (78%) Apicomplexa (46%) XM_020866409.1 cdna chromosome_group:Bter_1.0:B14:... Viruses (80%) Apicomplexa (31%) XM_020866408.1 cdna chromosome_group:Bter_1.0:B14:... Eukaryota (83%) Arthropoda (41%) XM_020866412.1 cdna chromosome_group:Bter_1.0:B14:... Eukaryota (87%) Apicomplexa (46%) XM_020866411.1 cdna chromosome_group:Bter_1.0:B14:... Eukaryota (65%) Apicomplexa (64%) XM_003394548.3 cdna chromosome_group:Bter_1.0:B03:... Eukaryota (66%) Arthropoda (79%) XM_020862794.1 cdna chromosome_group:Bter_1.0:B04:... Archaea (64%) Peploviricota (26%) XM_012311541.2 cdna chromosome_group:Bter_1.0:B09:... Eukaryota (81%) Arthropoda (51%) XM_003394183.3 cdna chromosome_group:Bter_1.0:B03:... Eukaryota (100%) Arthropoda (57%) XM_003399176.3 cdna chromosome_group:Bter_1.0:B11:... Eukaryota (70%) Arthropoda (62%) XM_003399177.3 cdna chromosome_group:Bter_1.0:B11:... Eukaryota (79%) Arthropoda (73%) XM_012313573.2 cdna chromosome_group:Bter_1.0:B11:... Eukaryota (81%) Arthropoda (77%) XM_012313574.2 cdna chromosome_group:Bter_1.0:B11:... Eukaryota (64%) Arthropoda (54%) XM_012311732.2 cdna chromosome_group:Bter_1.0:B09:... Eukaryota (88%) Arthropoda (51%) XM_012309428.2 cdna chromosome_group:Bter_1.0:B06:... Eukaryota (95%) Arthropoda (35%) XM_003395969.3 cdna chromosome_group:Bter_1.0:B06:... Eukaryota (99%) Arthropoda (45%) XM_012309794.2 cdna chromosome_group:Bter_1.0:B07:... Eukaryota (75%) Arthropoda (88%) XM_012309795.1 cdna chromosome_group:Bter_1.0:B07:... Eukaryota (90%) Apicomplexa (49%) XM_003396215.3 cdna chromosome_group:Bter_1.0:B07:... Eukaryota (81%) Arthropoda (73%) XM_020867575.1 cdna scaffold:Bter_1.0:GL898856:161... Eukaryota (76%) Streptophyta (43%) genus
Ooceraea (47%)
Trichoplusia (28%)
Rhopalosiphum (36%)
Crassostrea (50%)
Ooceraea (33%)
Ooceraea (31%)
Caenorhabditis (21%)
Ooceraea (46%)
Ooceraea (26%)
Ooceraea (30%)
Ooceraea (22%)
Nitrososphaera (17%)
Solanum (19%)
Phaeodactylum (17%)
Solanum (21%)
Phaeodactylum (15%)
Drosophila (74%)
Ooceraea (94%)
Ooceraea (45%)
Ooceraea (29%)
Ooceraea (22%)
Ooceraea (26%)
Schistosoma (35%)
Ooceraea (62%)
Ooceraea (53%)
Ooceraea (42%)
Ooceraea (40%)
Ooceraea (48%)
unknown (35%)
Apis (100%)
Acyrthosiphon (14%)
Ooceraea (10%)
Acyrthosiphon (22%)
Trichoplusia (19%)
Theileria (31%)
Crassostrea (31%)
Acyrthosiphon (10%)
Solanum (26%)
Ooceraea (54%)
Acyrthosiphon (17%)
Drosophila (33%)
Ooceraea (24%)
Ooceraea (62%)
Ooceraea (62%)
Ooceraea (60%)
Ooceraea (56%)
Ooceraea (64%)
Crassostrea (31%)
Ooceraea (25%)
Ooceraea (38%)
Tribolium (29%)
Theileria (22%)
Theileria (29%)
Ooceraea (23%)
Ooceraea (24%)
Ooceraea (24%)
Ooceraea (24%)
Brassica (41%)
Ooceraea (24%)
Ooceraea (22%)
Beta (18%)
Ooceraea (43%)
Ciona (60%)
Ooceraea (55%)
Phaeodactylum (29%)
Solanum (40%)
Solanum (45%)
Solanum (35%)
Ooceraea (26%)
Apis (97%)
Apis (50%)
Apis (50%)
Ooceraea (17%)
Olea (14%)
Acyrthosiphon (16%)
Ooceraea (54%)
Apis (24%)
Ooceraea (26%)
Theileria (15%)
Ooceraea (18%)
Acyrthosiphon (19%)
Ooceraea (34%)
Ooceraea (38%)
Ooceraea (34%)
Olea (32%)
Apis (49%)
Methanobrevibacter (60%)
Apis (36%)
Ooceraea (53%)
Apis (45%)
Apis (48%)
Apis (49%)
Apis (34%)
Apis (34%)
Ooceraea (49%)
Apis (38%)
Ooceraea (27%)
Plasmodium (50%)
Ooceraea (30%)
Ooceraea (34%)

failed to run the betrax tool on a slurm machine

I created the singularity of betrax and tried to run this tool on a slurm HPC, but I keep getting the error message below. Any idea.

My slurm is very simple -> bertax.sif bertax -o ${WORK_DIR}/output --sequence_split window sequence.fa

usage: bertax [-h] [-v] [-o FILE] [--conf_matrix_file FILE]
[--sequence_split {equal_chunks,window}] [--chunk_predictions]
[--running_window] [--running_window_stride STRIDE] [-s SIZE]
[-C NR] [--output_ranks RANK [RANK ...]] [--no_confidence]
[--batch_size BATCH_SIZE]
fasta
bertax: error: unrecognized arguments: sequence.fa

How to use with paired-end reads

Hello,
I have metagenomic paired-reads and would like to try the program. How do I use it?
my inputs: reads.R1.fq.gz and reads.R2.fq.gz

Limit number of threads?

bertax seems to use 10 threads by default, how can this be changed?

Also, are there any experiences classifying

short reads (up to 300bp)
long error-prone reads (PacBio, Nanopore)?

classification for genomes?

Dear bertax team,

I am wondering whether the query sequences could be genomes against database genomes and then found the best/significant hits to classify. In theory, this is the same with when query is sequences (short) right?

Thanks,

Jianshu

About PR plots

Hello, thank you for your reply. Do you have specific values for the PR plots in the Supplement of the PNAS BERTax publication you mentioned? That is, it is convenient to give me the data of those plots. Want to use these data to see if they have successfully achieved

IndexError appeared when using the "--output_ranks" parameter

Hi,

I wanted to apply Bertax to classify several phages I had, and every time I run this program, it brings me the following error:

Traceback (most recent call last):
  File "/home/egtortuero/anaconda3/envs/bertax/bin/bertax", line 10, in <module>
    sys.exit(main())
  File "/home/egtortuero/anaconda3/envs/bertax/lib/python3.7/site-packages/bertax/bertax.py", line 113, in main
    for i in range(len(out_values[0]))]
  File "/home/egtortuero/anaconda3/envs/bertax/lib/python3.7/site-packages/bertax/bertax.py", line 113, in <listcomp>
    for i in range(len(out_values[0]))]
  File "/home/egtortuero/anaconda3/envs/bertax/lib/python3.7/site-packages/bertax/bertax.py", line 112, in <genexpr>
    max_field_lens = [max(o[i] for o in [list(map(len, o)) for o in out_values])
IndexError: list index out of range

I was using the Conda version of Bertax, and I ran it with the following command:

bertax Phage_A.fasta --output_ranks superkingdom kingdom phylum class family genus

Note that, when I tried to run it using the option before the FASTA file (as indicated in the help), the error was the following:

bertax --output_ranks superkingdom,kingdom,phylum,class,family,genus Phage_A.fasta
usage: bertax [-h] [-v] [-o FILE] [--conf_matrix_file FILE]
              [--sequence_split {equal_chunks,window}] [--chunk_predictions]
              [--running_window] [--running_window_stride STRIDE] [-s SIZE]
              [-C NR] [--output_ranks RANK [RANK ...]] [--no_confidence]
              [--batch_size BATCH_SIZE]
              fasta
bertax: error: the following arguments are required: fasta

Does it seem to only works to retrieve the "superkingdom phylum genus" output? What might happen if I want only to know to which (e.g.) phylum the sequence belongs?

If you need more information from my side, please feel free to reply to me.

Sorry for all inconvenience, and thank you in advance.

Best,

Enrique

About the final model

Hello, I am very interested in your work. I'd like to ask, is the final model"big_trainingset _all fix_classes_selection.h5" you gave trained with the training set and the test set together? We look forward to hearing from you

Sharing the sequences used for pre-train

Hi, is it possible to share the sequences you used for pre-training in FASTA format?