Giter Site home page Giter Site logo

sophiagosselin / tani_tool Goto Github PK

View Code? Open in Web Editor NEW
1.0 1.0 2.0 145 KB

Script for creating a tANI distance matrix for use in phylogenetic reconstruction.

Home Page: https://doi.org/10.1093/sysbio/syab060

Perl 94.98% R 4.30% Shell 0.72%
average-nucleotide-identity phylogenetic-data phylogenetic-tree whole-genome-distance

tani_tool's Introduction

tANI_tool version 1.4.0

As of 04/03/2023 the new version of the code (tANI_tool.pl) is functional. I would highly reccomend to use it instead of the original legacy versions. If you do need or desire one of the legacy packages, they are still available (see tANI_original for the version used in the original manuscript, or tANI_low_mem, which was released alongside it).

General Information

Pairwise whole genome comparison via total average nucleotid identity (tANI). Non-parametric bootstrap capabilities included. Please cite

Improving Phylogenies Based on Average Nucleotide Identity, Incorporating Saturation Correction and Nonparametric Bootstrap Support

Sophia Gosselin, Matthew S Fullmer, Yutian Feng, Johann Peter Gogarten

DOI: https://doi.org/10.1093/sysbio/syab060

tANI_tool.pl takes a set of whole genomes (or incomplete assemblies) and allows a user to compute genome-genome distance values per: https://doi.org/10.1093/sysbio/syab060. The script will output matrices for the tANI (total average nucleotide identity) metric as per the paper, as well as three other whole genome metrics (AF (alignment fraction), jANI (jspecies-ANI), and a modified gANI (whole genome-ANI)). It will do so for the original genomes, and if requested, use a non parametric bootstrap approach to create bootstrapped matrices for these metrics.

Note that the gANI and AF metrics do not apply to only ORF's as per the original implementation (see https://doi.org/10.1093/nar/gkv657), but instead are applied to the whole genome broken into 1020nt fragments.

Important Considerations

One may use genomes with any degree of genome completion; however, low quality assemblies with contigs smaller than 1020 nt will result in large losses of meaningfull phylogenetic information. Therefore one should always strive to use genome assemblies with the lowest number of these small contigs as possible. In general, the higher the level of completion, the better, and always be critical of phylogenies built from lower quality assemblies. For the purposes of the original manuscript, anything below 80% completion was discarded, but better results are obtained when using a more stringent cutoff (90% and up).

Be warry of comparisons between genomes with wildly different sizes (e.g. a 2MB genome vs a 6MB genome). Such differences will lead to potentially inflated AF results, and hence an inflated tANI distance between these taxa.

If the output matrix reports a value of exactly 13 between two taxa, then there were no BLAST hits between the two genomes (for that specific query-subject pair) that passed one of the following cutoffs: coverage, percent identity, or e-value. If this is true for a single query across all subjects then that genome may simply be too divergent from the rest of your samples to get an accurate estimation of distance.

You may also see values of 0 between sufficiently similar taxa within the bootstrapped outputs. These two genomes are not necessarily identical. A value of 0 could indicate that the randomly selected sample of fragments from the query genome were sufficiently similar to the subject genome AND that the randomly selected fragements collectively comprised a length longer than the original query genome. A tANI calculation of this type of sample would result in a negative tANI score; hence we instead output a 0 value. Check the original non-bootstrapped matrix for an accurate assesement of the tANI between those two genomes.

Usage and Help Text.

To run, include the script in the same working directory as the genomes you wish to compute pair-wise comparisons for. Your genome files should be in fasta format, and have one of the following extentions: .fna, .fasta, .contig, .contigs. Be aware that input files within the home directory will be edited to remove special characters; however, the original unedited inputs can be found in "intermediates/unchanged_inputs" after initial setup.

IMPORTANT: tANI tool has a checkpointing system. If your run is interupted simply rerun your original command in the starting directory, and the code will backup from the logs file therin.

Note that the checkpointing system can cause issues if you attempt to rerun the script from a directory where tANI ran to completion (even if there were errors along the way). So if you do encounter a bug please try to rerun the program after removing the files created by tANI_tool.pl.

Dependencies:

perl v5.36.0 and up

BLAST v2.11.0 and up

Usage:

perl tANI_tool.pl -id your_%_here -cv your_%_here -boot bootstrap_#_here

Optional Inputs:

[id]: Percent identity cutoff for inclusion of BLAST hit in tANI calculation. Default: .7

[cv]: Percent coverage cutoff for inclusion of BLAST hit in tANI calculation. Default: .7

[e]: Evalue cutoff for inclusion. Default: 1e-4

[task]: Setting BLAST uses for its search criteria (see -task in BLAST).

[boot OR bt]: Number of non-parametric tANI bootstraps. Default: 0

[v]: Verbosity level. 1 for key checkpoints only. 2 for all messages. Default: 0

[log OR l]: Name of file to print logs to. If none is provided program prints messages to screen only. Default: None

[t]: Thread count. Default will use half of available cores.

NOTE: Default ID, CV, and E values are set to match those of Gosselin et al. 2020.

tani_tool's People

Contributors

sophiagosselin avatar

Stargazers

 avatar

Watchers

 avatar

tani_tool's Issues

Calculating metrics -- Error

Hi
Thank you for the modification to the tool to calculate ANI. The most recent version is crashing with error:

1. Error using 193 sequences (clusters of related family, Bacillaceaea)
$perl tANI_tool.pl -boot 100 -v 2 -t 8
......................................
..........................................
Warning: [blastn] Query_2595 Anoxybacillus.. : Could not calculate ungapped Karlin-Altschul parameters due to an invalid query sequence or its translation. Please verify the query sequence(s) and/or filtering options_

Loading fragment and genome lengths into memory.
Calculating metrics for all comparisons.
Killed

2. Current set up (and variations):**
I have the following set up or the above:
-- perl versions installed with perlbrew (* is the selected version)

  • threaded-perl-5.39.4
    5.38.0t
    perl-5.38.0
    perl-5.22.0

-- blast is blast2.14+

3. Variations tested with same error
-- perl 5.38t or 5.38t
-- perl tANI_tool.pl -boot 100 -v 2 -t 1
-- perl tANI_tool.pl -v 2 -t 8

4. Ran without any errors
-- Legacy version of the tool with -Default
-- Sequences in smaller numbers (80 sequences per run)

Question related to the error
I understand that the warning for blastn is only a warning and not an error, so I have disregarded this as an error. The metrics comparisons being killed is not due to errors in the annotations of any sequence(s) as using smaller batches of sequences run without any errors and running all the sequences with the legacy version also works. Is this error due to memory issues?

buildtree_w_support.R error

Hi

  1. I am trying to build a tree with tANI/buildtree_w_support.R but am getting the following error:
    File "/home/bharat/opt/scripts/buildtree_w_support.R", line 34
    tree_orig <- fastme.bal(as.matrix(Comb_ANI_orig), nni = TRUE, spr = TRUE, tbr = TRUE)
    ^^
    SyntaxError: invalid syntax

  2. This is the output matrix file from tANI_tool.pl
    more tANI_original.matrix
    GCF_000020485_1_Halothermothrix_orenii_H_168 GCF_000144695_1_Acetohalobium_arabaticum_DSM_5501 GCF_000165465_1_Halanaerobium_praevalens_DSM_2228 GCF_000328625_1_Halobacteroides_hal
    obius_DSM_5150 GCF_000350165_1_Halanaerobium_saccharolyticum_subsp_saccharolyticum_DSM_6643 GCF_000379025_1_Orenia_marismortui_DSM_5156 GCF_000517025_1_Halonatronum_saccharophilum_DSM_138
    68 GCF_001693735_1_Orenia_metallireducens GCF_003991135_1_Anoxybacter_fermentans GCF_004366375_1_Halanaerobium_congolense GCF_016908315_1_Halanaerobacter_jeridensis GCF_0169086
    35_1_Sporohalobacter_salinus GCF_017751145_1_Iocasia_fonsfrigidae GCF_023223645_1_Natroniella_sulfidigena GCF_023227745_1_Natroniella_acetigena GCF_023227765_1_Fuchsiella_alkaliacetigenaGCF_900103135_1_Halarsenatibacter_silvermanii GCF_900114545_1_Halanaerobium_salsuginis GCF_900156285_1_Halanaerobium_kushneri GCF_900167185_1_Selenihalanaerobacter_shriftii Halocella_c
    ellulosilytica_Z10151
    GCF_000020485_1_Halothermothrix_orenii_H_168 0 5.42719414755492 5.27339752476991 5.13900577735181 4.91889682428554 5.28688026286792 5.22480818558226 5.21870183625652 5.05504683536394 5.08618195407372 6.17922569276882 5.68554272260153 4.32137265353101 5.97658140615949 5.79316724850669 5.5
    5064100318302 5.83685948712251 6.15171690178991 5.48579663599273 6.21777334637247 4.73290194585748
    GCF_000144695_1_Acetohalobium_arabaticum_DSM_5501 5.22977697920968 0 4.99460574614351 4.13335888475053 5.38215301126784 4.31152240722284 4.322112712
    1972 4.55706577561279 4.85211889150845 5.18880868258653 4.76762061746371 1.16713390191407 5.45492561201875 4.13136511024781 4.29746181486809 3.70576708471556 5.60926539960242 7.12770635401551 5.92655840638655 3.18255186079528 5.84946387570556
    GCF_000165465_1_Halanaerobium_praevalens_DSM_2228 5.0494486378808 5.1785038474866 0 4.51827562301592 1.45532033417509 4.54059704562318 5.0073970721596 4.915832248
    40662 4.93251319486679 1.62114470344695 4.84695162791813 4.59258719020369 4.77668181946024 4.95819078347316 4.80695009810314 5.15309336246751 6.03866328527756 2.26473625721482 1.77508092320195 5.28646291481397 4.53444449966808
    GCF_000328625_1_Halobacteroides_halobius_DSM_5150 5.01027900229848 4.15790817171283 4.54587285900051 0 5.06345534951391 3.04084319249756 3.594289963
    9929 3.32032466634403 5.18154504112614 5.00875072428634 2.99880036288647 3.53591926644309 5.34595862257716 3.38467145198544 3.41864627714474 4.39750687896356 5.85876690795458 5.78979941722217 5.54591665672004 3.84660769812252 5.51384901674395
    GCF_000350165_1_Halanaerobium_saccharolyticum_subsp_saccharolyticum_DSM_6643 5.64347764242949 6.36705982607198 1.67140533623113 5.32721707628692 0 5.364730812
    0887 5.84812266390071 5.58999631677835 6.34303586532699 1.08826470050259 5.69438537990136 6.20417746253622 5.36317025981638 5.74160706665105 6.63289373275234 6.20624389499493 7.13137206816804 2.39211853959209 1.50935662627996 5.78582429495539 4.89670991875761
    GCF_000379025_1_Orenia_marismortui_DSM_5156 6.38954892770081 5.12274263896281 5.35082239302344 3.44777585434625 5.6079842887115 0 3.22265220161547 2.0
    1977964019112 6.5144380238915 6.00983034000628 4.00143383638893 4.62607217641155 5.49698273370909 3.67404280194979 3.81869739560961 5.48054737220808 -7.29156493589522 6.04238577359753 6.05352092766261 4.06687049237067 5.74851816673853
    GCF_000517025_1_Halonatronum_saccharophilum_DSM_13868 6.03675460723025 5.37634630486879 5.73280708728867 3.94349634610338 5.90838475332845 3.0837291615334 0 3.11649868565618 6.08719051619049 5.90494468224008 4.20696097357645 5.08181070224786 5.59369812478689 3.79658050782216 3.75486844779661 5.5
    3490556393311 8.09267480591985 6.98316625641855 6.87083567321921 4.4375233836173 5.99846340591119
    GCF_001693735_1_Orenia_metallireducens 6.46639406627934 5.75194905338732 5.55400735505465 3.7031260151832 6.18939069683576 2.02669136623926 3.25923940921253 06.30292705284042 6.14967841490906 4.36583639037344 4.84006405884405 5.65878809541915 4.01561889331783 4.01813363691152 5.63442833099369 7.37418426869385 6.12076385191458 6.54311862701476 4.49424211125871 6.25463430752659
    GCF_003991135_1_Anoxybacter_fermentans 4.74690910859569 4.82187824756363 4.8883833810846 4.97281674868454 5.13721608390138 4.8701197492875 4.95974824310063 4.9
    854819743511 0 5.04359117221984 5.63050648595935 5.12522972275852 5.11938317454213 5.55237179502083 5.20098654137151 5.0229338648156 5.129588383
    39912 8.45520804938 5.90872506931094 5.31622111887265 5.23364628722329
    GCF_004366375_1_Halanaerobium_congolense 5.98955110808155 6.2604514976442 1.85153578001687 5.67659705791149 1.0739073790125 5.66470740981283 5.76945669587536 6.10337342505922 6.49784166761779 0 5.2493060533817 5.68623562956719 5.53117629294236 5.93655110865401 6.33051559762974 5.94838980719056 6.8
    2326496044066 2.45234928964674 1.50128782312665 5.776268029505 5.26128277236755
    GCF_016908315_1_Halanaerobacter_jeridensis 6.97255624976867 5.24276330055711 5.11595727925693 3.2601667042461 5.40253719676177 3.87456145516362 4.416038027
    83941 4.26907387970529 7.22750100947766 5.39455220726951 0 4.30977331650788 6.06496838860014 4.28861373013832 4.29609609499528 5.273398071
    13459 7.71337340387772 5.99192689284713 6.2742854377649 4.77408280361509 5.78659354839624
    GCF_016908635_1_Sporohalobacter_salinus 5.92138991156531 1.28377889315149 5.07979812930072 3.73621502379725 5.99687141083857 4.49272464547559 5.026439324
    95433 4.4187675773218 5.84732257357728 5.60959215576993 4.03377107881826 0 5.7294796769932 4.10089135647878 4.28079741017701 3.7497286330321 6.507396335
    22774 5.55155594281898 5.82836228869665 2.78578592126391 5.88072247724936
    GCF_017751145_1_Iocasia_fonsfrigidae 5.08589052538609 5.92842194959452 5.25449186419209 5.90059387106556 5.33059011953792 5.31773261241539 5.431231941
    41285 5.40595617551564 5.70083185997198 5.16912291096203 6.1095368617146 6.17344327470814 0 6.17699452893818 5.7856900746987 6.21358915582416 6.1
    7002807627955 5.46219017999117 5.48710115621626 6.28311711230175 4.01120368498446
    GCF_023223645_1_Natroniella_sulfidigena 6.43564384182322 4.33760621291998 5.35937366668737 3.44791346081003 5.67719923509583 3.39940249659795 3.778239977
    80478 3.6496325002299 6.85294004641536 5.69437347525921 3.94765241581839 4.24858090045723 6.67561632993509 0 0.888947197487396 4.33332953035918 7.85306567608134 6.05041806535516 6.70535274903426 3.9263269946723 6.04606451623752
    GCF_023227745_1_Natroniella_acetigena 6.46363385686148 4.53213205571597 5.25765951857965 3.45198050195129 6.7944059576188 3.57580091289788 3.74156268782947 3.74863677308531 6.40159891192807 5.93239906140925 4.06963063492591 4.14816108868729 5.99659578581931 0.881031378677709 0 4.4824461003956 7.6
    5001574686886 5.76282179331574 7.16315123170728 4.12855776641567 6.48911060952819
    GCF_023227765_1_Fuchsiella_alkaliacetigena 6.79706278592512 4.03090149899065 6.01442864556625 5.08009975973608 6.57803890251405 5.23941120951659 5.1
    9400112126083 5.51362426316836 6.25676829808174 6.32492052070217 5.23600466745516 3.7633853024253 6.68797033820372 4.33665064061862 4.50119903401353 07.50338081630952 6.44820696706795 6.86172448772546 3.8053426698498 6.77884676370499
    GCF_900103135_1_Halarsenatibacter_silvermanii 6.83939009466526 7.03158394922597 6.69297223350638 7.050498258831 7.13316101877039 6.84870932453453 7.127188179
    7688 7.13592647825729 6.7836080904502 6.89519334895543 8.34329725520641 7.03617406597374 6.71730578385775 8.05639060405807 8.0598753353692 7.088545704
    16909 0 7.55371598544032 8.02554396410642 7.71975032659935
    GCF_900114545_1_Halanaerobium_salsuginis 5.96584840228616 7.40239200840313 2.54566241886174 5.88236062400875 2.38716339177884 5.48750427656849 6.7
    2276774708612 5.53982690408673 7.3097799867566 2.45480455176437 5.62517603959162 6.02573927839677 5.19324059406067 5.7489811226521 5.99534378206794 6.4
    3127481866014 8.64777825965464 0 2.27533405368884 5.61017699937251 5.4038507895558
    GCF_900156285_1_Halanaerobium_kushneri 6.06476965951033 6.73598396041939 2.10225782170039 5.97354627213883 1.60357345387283 6.06664103551885 6.724033995
    85983 6.3575386348045 7.17601855510978 1.62433107876536 6.37269157992906 6.1506392887578 5.01909586125832 6.55326934257514 6.54957583076517 7.033995772
    18275 7.82860133564736 2.36002280417881 0 6.33121489538793 5.11655887949033
    GCF_900167185_1_Selenihalanaerobacter_shriftii 6.80372764491916 3.34628431602229 5.19816775450073 3.90641797837832 5.60296384362255 3.89199509048106 4.3
    5259229099756 4.10752834666094 5.68760864362928 5.5676918422553 4.40594771245625 2.89642432038028 5.75065881390809 3.89784290955971 4.37964432369067 3.83191541970133 8.14534829726091 5.66155807390397 6.13889760309082 0 5.58395462239915
    Halocella_cellulosilytica_Z10151 5.34800396613506 6.5205082278691 5.31147554484251 6.8228447043623 5.20754308630277 5.56452848956186 6.362727737116 6.334066732
    02368 6.47697360356491 5.55942407390316 5.82913134387534 6.49405916394614 4.28134559776523 6.38396742711611 6.42004770698986 7.1451009721801 7.8
    8495996300442 5.60095256944116 5.41155450558601 6.4554247484776 0

Your input and assistance appreciated

bootstrap matrices problem

Hi,

I encounter some problems running tANI_tool enabling bootstrap calculation. The program finishes without any problems, but the bootstrap matrices are actually not matrices as they contains just one column. The value reported is the same as the diagonal value in the original_matrix. Moreover, all the bootstrap generated files are identical. The tANI_original.matrix is however fine. This happen in all the different output dir (AF, gANI, jANI and tANI).
I used the following command line: tANI_tool.pl --boot 100 -v 2 -t 1
I also tested the legacy version and it seems to work fine.
I was testing the program using just five genomes (GCF_014058425, GCF_018437225, GCF_018437235, GCF_018437265, GCF_018437275)
Attached you can find the tANI original file anche one of the bootstrap file.

Thank you for your help

tANI_0.matrix.txt
tANI_original.matrix.txt

More information about the usage

Hello,
I'm trying to build a phylogenetic tree based on whom genome sequences of several bacteria. However, I encountered the following errors when I ran the tANI_Matrix-master/tANI_low_mem.pl script, with the fasta files located in the same directory as the scripts.
image

Errors:
Can't exec "makeblastdb": not find the file or directory at /home/liuhongbin/soft/tANI_Matrix-master/tANI_low_mem.pl line 210.

I'm not sure how to set up the fasta files. Could you help me with these errors? Thanks!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.