Giter Site home page Giter Site logo

benkhedher / fastani Goto Github PK

View Code? Open in Web Editor NEW

This project forked from parbliss/fastani

0.0 0.0 0.0 6.05 MB

Fast Whole-Genome Similarity (ANI) Estimation

License: Apache License 2.0

Makefile 0.36% Shell 0.84% M4 0.83% C++ 88.08% C 9.35% R 0.54%

fastani's Introduction

FastANI

Apache 2.0 License BioConda Install GitHub Downloads

FastANI is developed for fast alignment-free computation of whole-genome Average Nucleotide Identity (ANI). ANI is defined as mean nucleotide identity of orthologous gene pairs shared between two microbial genomes. FastANI supports pairwise comparison of both complete and draft genome assemblies. Its underlying procedure follows a similar workflow as described by Goris et al. 2007. However, it avoids expensive sequence alignments and uses Mashmap as its MinHash based sequence mapping engine to compute the orthologous mappings and alignment identity estimates. Based on our experiments with complete and draft genomes, its accuracy is on par with BLAST-based ANI solver and it achieves two to three orders of magnitude speedup. Therefore, it is useful for pairwise ANI computation of large number of genome pairs. More details about its speed, accuracy and potential applications are described here: "High Throughput ANI Analysis of 90K Prokaryotic Genomes Reveals Clear Species Boundaries".

Download and Compile

Clone the software from Github and follow INSTALL.txt to compile the code. There is also an option to download dependency-free binary for Linux or OSX through the latest release.

Usage Summary

  • Produce help page. Quickly check the software usage and available command line options.
$ ./fastANI -h
  • One to One. Compute ANI between single query and single reference genome:
$ ./fastANI -q [QUERY_GENOME] -r [REFERENCE_GENOME] -o [OUTPUT_FILE] 

Here QUERY_GENOME and REFERENCE_GENOME are the query genome assemblies in fasta or multi-fasta format.

  • One to Many. Compute ANI between single query genome and multiple reference genomes:
$ ./fastANI -q [QUERY_GENOME] --rl [REFERENCE_LIST] -o [OUTPUT_FILE]

For above use case, REFERENCE_LIST should be a file containing directory paths to reference genomes, one per line.

  • Many to Many. When there are multiple query genomes and multiple reference genomes:
$ ./fastANI --ql [QUERY_LIST] --rl [REFERENCE_LIST] -o [OUTPUT_FILE]

Again, QUERY_LIST and REFERENCE_LIST are files containing paths to genomes, one per line.

Output format. In all above use cases, OUTPUT_FILE will contain tab delimited row(s) with query genome, reference genome, ANI value, count of bidirectional fragment mappings, and total query fragments. Alignment fraction (wrt. the query genome) is simply the ratio of mappings and total fragments. Optionally, users can also get a second .matrix file with identity values arranged in a phylip-formatted lower triangular matrix by supplying --matrix parameter. NOTE: No ANI output is reported for a genome pair if ANI value is much below 80%. Such case should be computed at amino acid level.

Two genome assemblies are provided in data folder to do a quick test run.

We suggest users to do an adequate quality check of their input genome assemblies (both reference and query), especially the N50 be โ‰ฅ10 Kbp.

An Example Run

  • One to One. Here we compute ANI between Escherichia coli and Shigella flexneri genomes provided in the data folder.
$ ./fastANI -q data/Shigella_flexneri_2a_01.fna -r data/Escherichia_coli_str_K12_MG1655.fna -o fastani.out 

Expect output log in the following format in the console:

$ ./fastANI -q data/Shigella_flexneri_2a_01.fna -r data/Escherichia_coli_str_K12_MG1655.fna -o fastani.out 
>>>>>>>>>>>>>>>>>>
Reference = [data/Escherichia_coli_str_K12_MG1655.fna]
Query = [data/Shigella_flexneri_2a_01.fna]
Kmer size = 16
Fragment length = 3000
ANI output file = fastani.out
>>>>>>>>>>>>>>>>>>
....
....
INFO, skch::main, Time spent post mapping : 0.00310319 sec

Output is saved in file fastani.out, provided above using the -o option.

$ cat fastani.out
data/Shigella_flexneri_2a_01.fna data/Escherichia_coli_str_K12_MG1655.fna 97.7443 1305 1608

Above output implies that the ANI estimate between S. flexneri and E. coli genomes is 97.7443. Out of the total 1608 sequence fragments from S. flexneri genome, 1305 were aligned as orthologous matches.

Visualize Conserved Regions b/w Two Genomes

FastANI supports visualization of the reciprocal mappings computed between two genomes. Getting this visualization requires a one to one comparison using FastANI as discussed above, except an additional flag --visualize should be provided. This flag forces FastANI to output a mapping file (with .visual extension) that contains information of all the reciprocal mappings. Finally, an R script is provided in the repository which uses genoPlotR package to plot these mappings. Here we show an example run using two genomes: Bartonella quintana (GenBank: CP003784.1) and Bartonella henselae (NCBI Reference Sequence: NC_005956.1).

$ ./fastANI -q B_quintana.fna -r B_henselae.fna --visualize -o fastani.out
$ Rscript scripts/visualize.R B_quintana.fna B_henselae.fna fastani.out.visual

Using above commands, we get a plot file fastani.out.visual.pdf displayed below. Each red line segment denotes a reciprocal mapping between two genomes, indicating their evolutionary conserved regions.

Parallelization

FastANI (v1.1 onwards) supports multi-threading, see the help page on how to configure thread count. To parallelize FastANI beyond single compute node, users also have the choice to simply divide their reference database into multiple chunks, and execute them as parallel processes. We provide a script in the repository to randomly split the database for this purpose.

Troubleshooting

Users are welcome to report any issue or feedback related to FastANI by posting a Github issue.

fastani's People

Contributors

aphillippy avatar blankenberg avatar cj101192 avatar cjain7 avatar skoren avatar tcpan avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.