Giter Site home page Giter Site logo

architect's Introduction

Architect

Architect is a genomic scaffolder aimed at synthetic long read and read-cloud sequencing technologies such as Illumina Tru-Seq or the 10X platform.

We describe Architect in detail in the following paper.

V. Kuleshov et al., Genome assembly from synthetic long read clouds, Bioinformatics (2016) 32 (12): i216-i224.

Requirements

Architect is implemented in Python and requires

  • pysam >= 0.82
  • networkx >= 1.10

Installation

To install Architect, clone this repo and add it to your PYTHONPATH.

git clone https://github.com/kuleshov/architect.git
cd architect
export PYTHONPATH=$PYTHONPATH:`pwd`

You may now run the program as python /path/to/architect.py.

Input data

Architect takes as input:

  • Genomic contigs in fasta format assembled using a standard (short-read) assembler.
  • A mapping of read clouds to contigs in bam format.
  • Optionally, an alignment of paired-end reads to the contigs.

Usage

Architect is run as follows.

usage: architect.py scaffold [-h] --fasta FASTA [--edges EDGES] --containment
                             CONTAINMENT --out OUT [--min-ctg-len MIN_CTG_LEN]
                             [--cut-tip-len CUT_TIP_LEN]
                             [--pe-abs-thr PE_ABS_THR]
                             [--pe-rel-thr PE_REL_THR]
                             [--pe-rc-rel-thr PE_RC_REL_THR]
                             [--rc-abs-thr RC_ABS_THR]
                             [--rc-rel-edge-thr RC_REL_EDGE_THR]
                             [--rc-rel-prun-thr RC_REL_PRUN_THR] [--log LOG]

optional arguments:
  -h, --help            show this help message and exit
  --fasta FASTA         Input scaffolds/contigs
  --edges EDGES         Known paired-end or read cloud connections
  --containment CONTAINMENT
                        Container hits and various meta-data
  --out OUT             Prefix for the ouput files
  --min-ctg-len MIN_CTG_LEN
                        Discard contigs smaller than this length (def: 0)
  --cut-tip-len CUT_TIP_LEN
                        Cut tips smaller than this length
  --pe-abs-thr PE_ABS_THR
                        Threshold for absolute support when pruning paired-end
                        edges
  --pe-rel-thr PE_REL_THR
                        Threshold for relative support when pruning paired-end
                        edges
  --pe-rc-rel-thr PE_RC_REL_THR
                        Threshold for relative support for read-cloud /
                        paired-end pruning
  --rc-abs-thr RC_ABS_THR
                        Minimum support for create read-cloud based edge
  --rc-rel-edge-thr RC_REL_EDGE_THR
                        Threshold for relative support when creating read-
                        cloud based edges
  --rc-rel-prun-thr RC_REL_PRUN_THR
                        Threshold for relative support when pruning read-cloud
                        based edges
  --log LOG             Save stdout to log file

A containment file encodes container hits in the genome. The edges file encodes paired-end read information. The fasta file contains the pre-assembled contigs.

The input files are generated from bam alignments using scripts in the /bam folder.

The user may tune Architect via various parameters described above. At the moment, we have set their defaults to values that were found to work well with Illumina TruSeq Synthetic Long Read datasets.

Results

We used Architect to assemble the genomes of D. melanogaster and C. elegans as well as two gut metagenomic datasets. Architect took as input standard short read assemblies augmented with raw short reads cloud based on the Illumina TSLRs technology that were subsampled to various depths. We found that the scaffolder produced up to 5x improvements in contig contiguity without increasing the misassembly rate, and using between 4-20x less sequencing data.

Documentation

For more information on running Architect, have a look at the wiki. You may find there:

architect's People

Contributors

kuleshov avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

Forkers

dayedepps

architect's Issues

A couple questions

Hi Volodymyr,

I had a couple questions. First, our SLR reads came split by their well/barcode. Should I map the reads for each of these back to our contigs one-by-one and then combine those in to a single file prior to running bam2containment.py?

Second, is a script included that generates the TSV file for paired-end data?

Thanks for any input you can provide.

Sincerely,

Ian

Improvement on 10X assembly using Supernova

Hello Volodymyr,

I did a 10X genomics genome assembly using Supernova, the program released by the 10X company. I want to use the architect to improve the quality of the assembly. I used BWA-MEM to align all the pair end reads, the same set used for making the assembly, to the assembly. The alignment rate was 96%. However, when I used the pe-connections.py to calculate the edges, the program outputted 0 connection. Here is the output:
#Strange reads (skipped): 0
#Reads whose mate was not good: 3839735
#Read paired that were used: 0
#Connections established: 0
Do you have any idea why this happened?

Thank you

Xing Wu
Graduate Student
University of Illinois at Urbana Champaign

question regarding format

Hi,
I am trying to compare this tool with other similar scaffolding tools for 10 chromium data.
I have a a draft genome assembled with discovar using 2x250 PCR-free PE reads and I recently got a set of 10X chromium barcoded cloud reads. I mapped it using the LINKS-arcs approach, briefly extracting the read barcodes from the R1 and appending it to the end of the read names, then mapping with bwa using standard conditions. I can see that I need to modify the script bam2containment.py to parse the names correctly, I just wanted to know if there is any constrain int the way the barcode/well information needs to be stored (i.e. does it need to be a numeric value or is it OK to use the barcode sequence?)

Thank you!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.