mdcao / npscarf Goto Github PK

24.0 6.0 2.0 25 KB

npscarf's Introduction

npScarf: Scaffolding and Completing Assemblies in Real-time Fashion

npScarf (jsa.np.npscarf) is a program that scaffolds and completes draft genomes assemblies in real-time with Oxford Nanopore sequencing. The pipeline can run on a computing cluster as well as on a laptop computer for microbial datasets. It also facilitates the real-time analysis of positional information such as gene ordering and the detection of genes from mobile elements (plasmids and genomic islands).

Note: npScarf is not on maintenance anymore, instead npGraph is under development and would be the replacement.

Installation

Dependency: The pipeline requires the following software installed

SPAdes >= 3.5
bwa >= 7.11

Quick installation guide:

$ git clone https://github.com/mdcao/japsa
$ cd japsa
$ make install \
   [INSTALL_DIR=~/.usr/local \] 
   [MXMEM=7000m \] 
   [SERVER=true \] 
   [JLP=/usr/lib/jni:/usr/lib/R/site-library/rJava/jri]

npScarf module is bundled within the Japsa package. Details of installation (including for Windows) and usage of Japsa can be found in its documentation hosted on ReadTheDocs In order to run the npScarf in real-time, npReader and particularly HDF library need to be istalled properly. Please refer to the installation instructions for npReader repository.

Tutorial

This tutorial will walk through how to use npScarf to complete a genome assembly of the K. pnuemoniea ATCC BAA-2146 (Kpn2146) bacterial strain using Illumina and nanopore sequencing data.

Primary data sources:

Illumina sequencing data: It is essential that the reads are trimmed to remove all adaptors. Low quality bases should also be removed. We make available the sequencing data for the Kpn2146 sample, sequenced with Illumina MiSeq and are trimmed with trimmomatic: file1 and file 2.
Nanopore sequencing data: The raw data (before base-calling) of the Kpn2146 can obtained from ENA with run accession ERR868296.

Intermediate data are also made available as you walk through the tutorial.

Processing

Assemble the Illumina data with SPAdes using 16 threads in parallel. Option --careful would help to reduce the errors but the improvement is not so significant. It's safe to exclude it from the command if you want to save the running time.

$ spades.py --careful --pe1-1 Kp2146_paired_1.fastq.gz --pe1-2 Kp2146_paired_2.fastq.gz -o spades -t 16

The result contigs file of interest is spades/contigs.fasta. The contig list is then sorted with

$ jsa.seq.sort -r -n --input spades/contigs.fasta --output Kp2146_spades.fasta

The assembly of the Illumina data (using SPAdes 3.5) of the Kpn2146 is made available here

Create the bwa index for the Illumina assembly:

$ bwa index Kp2146_spades.fasta

In batch mode where all nanopore data have been sequenced and base-called, the scaffolding can be done in batch mode with the command:

$ bwa mem -t 10 -k11 -W20 -r10 -A1 -B1 -O1 -E1 -L0 -a -Y Kp2146_spades.fasta Kp2146_ONT.fastq  | jsa.np.npscarf -input - -format sam -seq Kp2146_spades.fasta -prefix Kp2146-batch

The nanopore sequencing data for the Kpn2164 sample in fastq format is made available here.

In real-time mode, assuming the base-called data from Metrichor service are stored in folder Downloads, the pipeline can run with following command:

$ jsa.np.npreader  --realtime --folder Downloads --fail --stat --number --output - \
 | bwa mem -t 10 -k11 -W20 -r10 -A1 -B1 -O1 -E1 -L0 -a -Y -K 3000 Kp2146_spades.fasta -  \
 | jsa.np.npscarf -realtime -input - -format sam -seq Kp2146_spades.fasta -prefix Kp2146-realtime > log.out 2>&1

The processing can be distributed over a network cluster by using the streaming utilities provided in japsa package. Information can be found
here and here and examples are here

Detailed Usage

A summary of npScarf usage can be obtained by invoking the --help option:

jsa.np.npscarf --help

Note: options with dash or dash-dash (GNU style) are all acceptable and equivalent iff no ambiguity is introduced. For example ones can call instead

jsa.np.npscarf -help

or even

jsa.np.npscarf -h

since h is the only prefix in this command's list of options.

WARNING Please always check the help option first before running npScarf since the structure and parameters list of the command can be changed significantly from different versions.

Input

npScarf takes two files as required input:

jsa.np.npscarf -seq <*draft*> -input <*input*> -format sam

<draft> input is the FASTA file containing the pre-assemblies. Normally this is the output from running SPAdes on Illumina MiSeq paired end reads.

<input> contains SAM/BAM formated alignments between <draft> file and <nanopore> FASTA/FASTQ file of long read data. We use BWA-MEM as the recommended aligner with the fixed parameter set as follow:

bwa mem -k11 -W20 -r10 -A1 -B1 -O1 -E1 -L0 -a -Y <*draft*> <*nanopore*> > <*bam*>

Starting from our newest versions of npScarf, BWA-MEM is integrated into the command for convenience. Thus the input file is not limitted to SAM/BAM anymore, you can also provide long reads in FASTQ/FASTA format together with BWA-MEM arguments. For example, instead of taking SAM/BAM input data from BWA-MEM explicitly like:

bwa mem -k11 -W20 -r10 -A1 -B1 -O1 -E1 -L0 -a -Y <*draft*> <*nanopore*> \
|jsa.np.npscarf -input - -format sam -seq <*draft*> > log.out 2>&1

you can do::

jsa.np.npscarf -bwaExe=</path/to/BWA> -bwaThread=<#threads> -input <*nanopore*> -format fastq -seq <*draft*> > log.out 2>&1

For that reason, it is important to provide the format of the input file if it's in SAM/BAM (default is FASTA/FASTQ). You don't have to specify BWA execution files location if they are already included in your PATH environment variable.

Output

npScarf output is specified by -prefix option. The default prefix is 'out'. Normally the tool generate two files: prefix.fin.fasta and prefix.fin.japsa which indicate the result scaffolders in FASTA and JAPSA format.

In realtime mode, if any annotation analysis is enabled, a file named prefix.anno.japsa is generated instead. This file contains features detected after scaffolding.

Real-time scaffolding

To run npScarf in streaming mode::

jsa.np.npscarf -realtime [options]

In this mode, the <bam> file will be processed block by block. The size of block (number of BAM/SAM records) can be manipulated through option -read and -time.

The idea of streaming mode is when the input <nanopore> file is retrieved in stream. npReader is the module that provides such data from fast5 files returned from the real-time base-calling cloud service Metrichor. Ones can run:

jsa.np.npreader -realtime -folder c:\Downloads\ -fail -output - | \

bwa mem -t 10 -k11 -W20 -r10 -A1 -B1 -O1 -E1 -L0 -a -Y -K 3000 <*draft*> - 2> /dev/null | \ 

jsa.np.npscarf -realtime -input - -format sam -seq <*draft*> > log.out 2>&1

or if you have the whole set of Nanopore long reads already and want to emulate the streaming mode:

jsa.np.timeEmulate -s 100 -i <*nanopore*> -output - | \

bwa mem -t 10 -k11 -W20 -r10 -A1 -B1 -O1 -E1 -L0 -a -Y -K 3000 <*draft*> - 2> /dev/null | \ 

jsa.np.npscarf -realtime -input - -format sam -seq <*draft*> > log.out 2>&1

Note that jsa.np.timeEmulate based on the field timestamp located in the read name line to decide the order of streaming data. So if your input <nanopore> already contains the field, you have to sort it:

jsa.seq.sort -i <*nanopore*> -o <*nanopore-sorted*> -sortKey=timestamp

or if your file does not have the timestamp data yet, you can manually make ones. For example:

cat <*nanopore*> |awk 'BEGIN{time=0.0}NR%4==1{printf "%s timestamp=%.2f\n", $0, time; time++}NR%4!=1{print}' \
> <*nanopore-with-time*>

Real-time annotation

The tool includes usecase for streaming annotation. Ones can provides database of antibiotic resistance genes and/or Origin of Replication in FASTA format for the analysis of gene ordering and/or plasmid identifying respectively:

jsa.np.timeEmulate -s 100 -i <*nanopore*> -output - | \

bwa mem -t 10 -k11 -W20 -r10 -A1 -B1 -O1 -E1 -L0 -a -Y -K 3000 <*draft*> - 2> /dev/null | \ 

jsa.np.npscarf -realtime -input - -format sam -seq <*draft*> -resistGene <*resistDB.fasta*> -oriRep <*origDB.fasta*> > log.out 2>&1

Or one can input any annotation in GFF 3.0 format:

jsa.np.npscarf -realtime -input - -format sam -seq <*draft*> -genes <*genesList.GFF*> > log.out 2>&1

Assembly graph

npScarf can read the assembly graph info from SPAdes for the gap-filling to make the results more precise. This function is still on development and the results might be slightly deviate from the stable version in term of number of final contigs:

jsa.np.npscarf -input <input> -format <format> -seq <*draft*> -spades <spades output folder> > log.out 2>&1

Citation

Please cite npScarf if you find it useful for your research

Cao, M.D., Nguyen, H.S., et al. Scaffolding and Completing Genome Assemblies in Real-time with Nanopore Sequencing. Nature Communications 8, Article number: 14515 (2017). doi:[10.1038/ncomms14515].

Data and results from npScarf presented in the paper are made available following this link. The QUAST analysis of results from npScarf and competitive methods are in also presented for K. pneumoniae ATCC BAA-2146, K. pneumoniae ATCC 13883, [E. coli K12 MG1655] (http://data.genomicsresearch.org/Projects/npScarf/results/QUAST/EcK12S/report.html), [S. Typhil H58] (http://data.genomicsresearch.org/Projects/npScarf/results/QUAST/StH58/report.html) and [S. cerevisae W303] (http://data.genomicsresearch.org/Projects/npScarf/results/QUAST/W303/report.html).

License

See Japsa license

npscarf's People

Contributors

Stargazers

Watchers

Forkers

hsnguyen hyphaltip

npscarf's Issues

run report

Is there a way we can get an output from the run showing things like which contigues were joined, etc. Also, is the analysis done with repeat masking and if not, how can we run it with repeat-masking?

Thanks for the npscarf tool

run npScarf crashes

Hello,
I run npScarf with nanopore reads for scaffolding spades assembly result.
my shell is:

$myexe/jsa.seq.sort -r -n --input $inputfa --output sort_spades.fasta &> $outfile
bwa index sort_spades.fasta &>> $outfile
bwa mem -t 20 -x ont2d -a -Y sort_spades.fasta $reads >sort_spades.fasta.sam
$myexe/jsa.np.npscarf --input sort_spades.fasta.sam --seqFile sort_spades.fasta --format sam --spadesDir ../output_spades_rmLong/ --verbose --prefix=npScarf &>> $outfile

It crashes with error:

#Sort list of bridges
Starting scaffolding.......
Extending 0 to the rear
Last of scaffold 0 extention is on contig 0 (1-NODE_1_length_44147_cov_8.432349): iterating among 57 bridges
...10-NODE_10_length_15495_cov_30.258980
Exception in thread "main" java.lang.NullPointerException
at japsa.bio.hts.scaffold.ContigBridge.display(ContigBridge.java:777)
at japsa.bio.hts.scaffold.ScaffoldGraphDFS.walk2(ScaffoldGraphDFS.java:378)
at japsa.bio.hts.scaffold.ScaffoldGraphDFS.connectBridges(ScaffoldGraphDFS.java:261)
at japsa.tools.bio.np.NPScarfCmd.main(NPScarfCmd.java:286)

SPADES alternative

Hello,

I have some previously generated draft using other assembly pipeline, which doesn't contain the required information in the FASTA headers.

I'd like to ask if there is any other way to generate such headers, because I can't go back and redo assembly from scratch.

Thanks in advance,
Pedro

npScarf for a metagenome bin?

I assume that npScarf is designed for Single species bacteria. Anyhow, I want to check how it works for my bacteria in metagenomics sample. With Illumina MISEQ data, I did the assembly with SPADES and further contig binning using MetaBAT.
I try to change the workflow for one of the good bins.
using bin1.fasta as spades.fasta
mapping the nanopore reads to bin1.fasta to create sam file.
jsa.np.npscarf -input ONT.sam --spadesDir='spades' -format sam -seq bin1.fasta -prefix bin1_spades > a.log
However, I found that not like the example dataset, this takes a long time.... and the output a.log becomes huge and I have to kill the job... Please kindly suggest me if it is ok to run this way for a metagenome bin?

Velvet contig: WARN japsa.tools.bio.np.NPScarfCmd - Not found any legal SPAdes output folder,

Hello:
I had a contigfile Mtctgs.fasta like : I got Mtctgs.fasta from Velvet,
>NODE_10906_length_1007_cov_71.668320 ATCGACTAGCTAGCTACGTCAGCATGCTAGCTCAGCTACGACTAGCATCAGCTCG

a bam file [email protected] computed from:
bwa mem [email protected] pacbio.fasta

And I ran this:
jsa.np.npscarf -seq Mtctgs.fasta -input [email protected] -format bam -seq Mtctgs.fasta > Mtctgs-out

But I still got this warnning and didn't got any output:

[main] WARN japsa.tools.bio.np.NPScarfCmd - Not found any legal SPAdes output folder, assembly graph thus not included!

Exception in thread "main" java.lang.IndexOutOfBoundsException: Index -1 out of bounds for length 381
at java.base/jdk.internal.util.Preconditions.outOfBounds(Preconditions.java:64)
at java.base/jdk.internal.util.Preconditions.outOfBoundsCheckIndex(Preconditions.java:70)
at java.base/jdk.internal.util.Preconditions.checkIndex(Preconditions.java:248)
at java.base/java.util.Objects.checkIndex(Objects.java:372)
at java.base/java.util.ArrayList.get(ArrayList.java:458)
at japsa.bio.hts.scaffold.ScaffoldGraph.makeConnections2(ScaffoldGraph.java:347)
at japsa.tools.bio.np.NPScarfCmd.main(NPScarfCmd.java:284)

Then I checked the code, I find this: Does that mean I have to have the spades.out folder contain graphFile && pathFile ?

`if(spadesFolder !=null && graphFile.exists() && pathFile.exists())

		LOG.info("===> Use assembly graph and path from SPAdes!");
	else{
		LOG.warn("Not found any legal SPAdes output folder, assembly graph thus not included!");
		spadesFolder=null;`

Hope for your reply
Yunxiali

about sort the contig list in step Processing

@hsnguyen hello!
In the step "Processing", after using the command spades.py to assemble the Illumina data, and then i need to use jsa.seq.sort to sort the contig list, i have a little confused about this operation, if i don't do that, will there be any serious consequences? or will the contig data set become inaccurate? Whether it is feasible to use the contig data set directly after the command spades.py？Because when i install the japsa, the specifies paths to libjhdf5 always have something wrong.
Looking forward to your reply，thanks.

npScarf crashes

Hi,
I am running @ npScarf on a few nanopore and PacBio datasets and it is generally working fine.
On a PacBio dataset though it crashes with error:
Exception in thread "main" java.lang.IndexOutOfBoundsException: Index: -1, Size: 1
at java.util.LinkedList.checkElementIndex(LinkedList.java:555)
at java.util.LinkedList.remove(LinkedList.java:525)
at japsa.bio.hts.scaffold.ScaffoldGraph.joinScaffold(ScaffoldGraph.java:494)
at japsa.bio.hts.scaffold.ScaffoldGraphDFS.walk2(ScaffoldGraphDFS.java:309)
at japsa.bio.hts.scaffold.ScaffoldGraphDFS.connectBridges(ScaffoldGraphDFS.java:99)
at japsa.tools.bio.np.GapCloserCmd.main(GapCloserCmd.java:139)
The full report is in the attached file. It seems the problem occurs when it tries to connect contigs
122, 147 and 116 to Scaffold 79, and indeed if I take out 122 or 116 from the initial fasta everything
works fine, but I couldn't figure out where the problem is, can you help me figure out what is wrong or what should I look at to start with?

The pipeline I am running is:
jsa.seq.sort -r -n --input spades.fasta --output sort_spades.fasta
bwa index sort_spades.fasta
bwa mem -t 10 -k11 -W20 -r10 -A1 -B1 -O1 -E1 -L0 -a -Y sort_spades.fasta pacbio.fastq | jsa.np.gapcloser -b - -seq sort_spades.fasta --verbose --output=npScarf.fasta

where spades.fasta is my initial assembly from spades, and pacbio.fastq the fastq with the pacbio reads,
and I am using the japsa package: Version 1.6-01c, Built on Thu Jun 23 10:10:20 BST 2016 with javac 1.7.0_80.
Please let me know if you need additional info.

Thank you a lot!
Fran

output.txt

running program with .bam and scaffold

Hi,

I'm curious if it's possible to run npScarf with ONT reads aligned to a reference genome (consisting of unique contigs). I've aligned the long reads with minimap2 to generate an indexed and sorted .bam file.

It was my thought that all that was needed to accomplish this was to supply the .bam file along with the original contigs.fasta file which the reads were aligned to. Perhaps there is more? This was my command:

BAM=/path/to/my/sample.bam
CONTIGS=/path/to/my/contigs.fa

jsa.np.npscarf --input $BAM --format bam --seqFile $CONTIGS > log.out 2>&1

This generated a series of files (out.fin.fasta, out.fin.japsa, and the .log file), yet the number of unique scaffods (and total nucleotides) was orders of magnitude less than what was present in the original contigs.fa file.

In addition, there was also a series of warnings and errors in the log file:

[main] WARN japsa.tools.bio.np.NPScarfCmd - Not found any legal SPAdes output folder, assembly graph thus not included!
#Sort list of bridges
========================== START =============================
  contig   0  ======>     0   57033 contig_0 
Size = 1 sequence
============================ END ===========================
...
many more of these 
...
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException
	at java.base/java.lang.System.arraycopy(Native Method)
	at java.base/java.util.Arrays.copyOfRange(Arrays.java:4030)
	at japsa.seq.Sequence.subSequence(Sequence.java:272)
	at japsa.bio.hts.scaffold.ContigBridge$Connection.filling(ContigBridge.java:1201)
	at japsa.bio.hts.scaffold.Scaffold.viewSequence(Scaffold.java:438)
	at japsa.bio.hts.scaffold.ScaffoldGraph.printSequences(ScaffoldGraph.java:965)
	at japsa.tools.bio.np.NPScarfCmd.main(NPScarfCmd.java:291)

I'm wondering what to make of the errors. Are each of these exceptions a product of me not providing a graph assembly? Are there additional arguments I should be entering with the jsa.np.npscarf command?

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.