Giter Site home page Giter Site logo

lens's Introduction

Lens

Algorithm for uncovering genomic variants and haplotypes in metagenomes from long reads. Lens is described in:

High-resolution structure of the human microbiome revealed with synthetic long reads.
Volodymyr Kuleshov, Chao Jiang, Wenyu Zhou, Fereshteh Jahanbani, Serafim Batzoglou, Michael Snyder. 
Nature Biotechnology, 2015.

The human microbiome is composed of thousands of bacterial stains; these strains may greatly differ in their virulence or in their response to antibiotics.

Lens is a software package that detects strain variation between organisms of the same bacterial species directly from metagenomic samples based on long read data. Lens assembles and reports these variants in the form of long haplotypes.

At the core of Lens is a greedy phasing algorithm that generalizes single-individual haplotyping algorithms to the setting where there the number of underlying haplotypes may be greater than two, and moreover may be unknown.

Installation

To install Lens, simply clone the git repo:

git clone [...];
cd lens;

Requirements

Lens requires python-2.7 and the package pysam version >=0.8.2. The easiest way to install it using pip install pysam.

Testing the installation

To test whether Lens works, we have provided a small testing package in lens/test:

cd lens/test;
make run;

This executes Lens on a small .bam containing real synthetic long reads. The reults is a file called haplotypes.txt, which should contain 6 haplotypes recovered from the long read data.

Running Lens

Lens takes as input a .bam file with long reads aligned to a reference genome or to an assembled genomic contig. We recommend using bwa mem for generating this file.

Step 0: BAM filtering

We suggest running Lens only on reads with high mapping scores (>=30). We also suggest using reads that align across their entire length (that aren't clipped). We filter for such reads using:

python filter_by_cigar.py -i input.bam -o filtered.bam;
samtools view -b -h -q 30 filtered.bam > filtered.q30.bam
Step 1: Variant calling

Our first step is to call variants in the reference:

python make_variants.py \
  -b filtered.q30.bam \
  -o variants.pos \
  --coverage-threshold 3 \
  --frequency-threshold 0.1 \
  --qscore-threshold 15 \
  --indels

This will find all variants (including indels) that have at least 3 reads supporting them and have an allele frequency of at least 0.1 (these are the default parameters). Also, only reads that have a qscore of 15 or more at a given position will be used for calculating support.

Step 2: Variant calling in reads

Next, we compile the reads that support the variants we have just identified. The result of this step is a file that lists for each read all the variants that the read covers.

python make_reads.py \
  -b filtered.q30.bam \
  -v variants.pos \
  -r variants.reads \
  --qscore-threshold 15

The format of variants.reads is:

<ctg id>   <read id>   <pos1:allele1>   <pos2:allele2>  ...
Step 3: Assembling reads into haplotypes

Finally, we assemble the reads at their overlapping variants into bacterial haplotypes.

python detect_subspecies.py \
  -b filtered.q30.bam \
  -r variants.reads \
  -k haplotypes.txt \
  --cov-cutoff 2 \
  --cov-cutoff-percentage 0.75 \
  --similarity-cutoff 1.0 \
  --overlap-cutoff 2

This will report only haplotypes that have a coverage of at least two over at least 75% of their length. Two reads will be connected into a haplotype if they overlap at at least two variants and are identical at both of these positions.

The format of haplotypes.txt is:

<ctg id>   <start pos>   <end pos>   <coverage>   <variants>

Feedback

Please send feedback to Volodymyr Kuleshov

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.