Giter Site home page Giter Site logo

president's Introduction

Twitter Follow Twitter Follow

Not longer maintained! Check out: https://gitlab.com/RKIBioinformaticsPipelines/president

PRESIDENT: PaiRwisE Sequence IDENtiTy

Calculate pairwise nucleotide identity with respect to a reference sequence.

Given a reference and a query sequence (which can be fragmented), calculate pairwise nucleotide identity with respect to the reference sequence relative to the entire length of the reference. Only informative nucleotides (A,T,G,C) are considered identical to each other.

Requirements:

conda create -y -n president -c bioconda python=3.8 mafft screed && conda activate president

Usage:

python pairwise_nucleotide_identity.py --query NC_045512.2.20mis.fasta --reference NC_045512.2.fasta -x 3000 -p 8

Output:

The script provides three different output metrics to assess the quality of the query sequence in comparison to the reference sequence. The rationale behind this is, that ambiguous bases (Ns) can impact sequence identity differently depending on how they are counted.

Running Mafft ...
a) 0.9987 nucleotide identity
b) 0.9994 nucleotide identity (excluding non-ACTG)
c) 20 non-ACTG symbols out of 29903

a) Percentage identity based on hamming distance and including gaps/ Ns as mismatches

b) Percentage identity only based on A, T, C, G characters (ACTG matches / (len_query - nonACTGchars_query))

c) Count all characters in the query sequence that are not A, T, C, G

Notes:

  • nextstrain uses a quality threshold of < 3000 non-canonical nucleotides
  • Ns in the query are treated as mismatches, uncomment the corresponding line directly in the code to ignore Ns

ANI definition:

president's People

Contributors

hoelzer avatar phiweger avatar

Watchers

James Cloos avatar  avatar  avatar

president's Issues

Calculated output metrics

The script should provide three different output metrics to assess the quality of the query sequence in comparison to the reference sequence. The rationale behind this is, that ambiguous bases (Ns) can be counted differently when it comes to the calculation of sequence identity.

We basically aim to answer three questions:
a) Is the input query sequence the same species as the reference sequence (high ANI)? --> percent sequence identity
b) How is the quality in regard to ambiguous Ns?
c) Coverage?

a)

  • count positions in query that match the reference (ACGT)
  • matches / query_length
  • ... or in other words: (query_length - hamming_distance_ignoring_Ns) / query_length)

b)

  • count number of Ns in query sequence
  • (query_number_Ns + query_gaps_in_aln) / query_length

c)

  • count number of matches and mismatches in aln (ACGT), gaps and Ns do not count!
  • (matches + mismatches) / reference_length
  • ... or in other words: (length_query - query_number_Ns_and_gaps) / reference_length

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.