tseemann / nullarbor Goto Github PK

View Code? Open in Web Editor NEW

133.0 20.0 37.0 11.6 MB

:floppy_disk: :page_with_curl: "Reads to report" for public health and clinical microbiology

License: GNU General Public License v2.0

Shell 3.01% Perl 84.67% CSS 0.94% HTML 1.83% PHP 0.57% Makefile 0.55% Raku 8.43%

bacteria report fastq denovo-assembly variant-calling genotyping resistome virulome phylogenomics public-health

nullarbor's Introduction

Nullarbor

Pipeline to generate complete public health microbiology reports from sequenced isolates

⚠️ This documents the current Nullarbor 2.x version; previous 1.x is here

Motivation

Public health microbiology labs receive batches of bacterial isolates whenever there is a suspected outbreak.In modernised labs, each of these isolates will be whole genome sequenced, typically on an Illumina or Ion Torrent instrument. Each of these WGS samples needs to quality checked for coverage, contamination and correct species. Genotyping (eg. MLST) and resistome characterisation is also required. Finally a phylogenetic tree needs to be generated to show the relationship and genomic distance between the strains. All this information is then combined with epidemiological information (metadata for each sample) to assess the situation and inform further action.

Example reports

Feel free to browse some example reports.

Pipeline

Limitations

Nullarbor currently only supports Illumina paired-end sequencing data; single end reads, from either Illumina or Ion Torrent are not supported. All jobs are run on a single compute node; there is no support yet for distributing the work across a high performance cluster.

Per isolate

Clean reads
- remove adaptors, low quality bases and reads (Trimmomatic)
Species identification
- k-mer analysis against known genome database (Kraken, Kraken2, Centrifuge)
De novo assembly
- User can select (SKESA, SPAdes, Megahit, shovill, Velvet)
Annotation
- Add features to assembly Prokka)
MLST
- From assembly w/ automatic scheme detection (mlst + PubMLST)
Resistome
- From assembly (abricate + Resfinder)
Virulome
- From assembly (abricate + VFDB)
Variants
- From reads aligned to reference (snippy)

Per isolate set

Core genome SNPs
- From reads (snippy-core)
Infer core SNP phylogeny
- Maximum-likelihood GTR+G4 model (IQTree, FastTree)
- SNP distance matrix (snp-dists)
Pan genome
- From annotated contigs (Roary)
Report
- Summary isolate information (HTML + Plotly.JS + DataTables + PhyloCanvas)
- More detailed per isolate pages (COMING SOON)

Installation

You need to install both the software and the databases separately.

Software

Conda

Install Conda or Miniconda:

conda install -c conda-forge -c bioconda -c defaults nullarbor

Homebrew (coming soon)

Install Homebrew (macOS) or LinuxBrew (Linux).

brew install brewsci/bio/nullarbor

Source

This is the hardest way to install Nullarbor.

cd $HOME
git clone https://github.com/tseemann/nullarbor.git

# keep running this command and installing stuff until it says everything is correct
./nullarbor/bin/nullarbor.pl --check

# For Perl modules (eg. YAML::Tiny), use one of the following methods
apt-get install yaml-tiny-perl  # ubuntu/debian
yum install perl-YAML-Tiny      # centos/redhat
cpan YAML::Tiny
cpanm YAML::Tiny

Databases

Kraken

You need to install a Kraken database (~8 GB).

wget https://ccb.jhu.edu/software/kraken/dl/minikraken_20171019_8GB.tgz
tar -C $HOME -zxvf minikraken_20171019_8GB.tgz

Kraken 2

You need to install a Kraken2 database (~8 GB).

wget ftp://ftp.ccb.jhu.edu/pub/data/kraken2_dbs/minikraken2_v2_8GB_201904_UPDATE.tgz
tar -C $HOME -zxvf minikraken2_v2_8GB_201904_UPDATE.tgz

Centrifuge

Install a Centrifuge database (~8 GB):

wget ftp://ftp.ccb.jhu.edu/pub/infphilo/centrifuge/data/p_compressed+h+v.tar.gz
mkdir $HOME/centrifuge-db
tar -C $HOME/centrifuge-db -zxvf p_compressed+h+v.tar.gz

Set global database locations

Then add the following to your $HOME/.bashrc so Nullarbor can find the databases:

export KRAKEN_DEFAULT_DB=$HOME/minikraken_20171019_8GB
export KRAKEN2_DEFAULT_DB=$HOME/minikraken2_v2_8GB_201904_UPDATE
export CENTRIFUGE_DEFAULT_DB=$HOME/centrifuge-db/p_compressed+h+v

You should be good to go now. When you first run Nullarbor it will let you know of any missing dependencies or databases.

Usage

Check dependencies

Nullarbor does a self-check of all binaries, Perl modules and databases:

nullarbor.pl --check

Create a 'samples' file (TAB)

This is a file, one line per isolate, with 3 tab separated columns: ID, R1, R2.

Isolate1	/data/reads/Isolate1_R1.fq.gz	/data/reads/Isolate2_R1.fq.gz
Isolate2	/data/reads/Isolate2_R1.fq      /data/reads/Isolate2_R2.fq
Isolate3	/data/old/s_3_1_sequence.txt	/data/old/s_3_2_sequence.txt
Isolate3b	/data/reads/Isolate3b_R1.fastq	/data/reads/Isolate3b_R2.fastq

Choose a reference genome (FASTA, GENBANK)

This is just a regular FASTA or GENBANK file. Try and choose a reference phylogenomically similar to your isolates.
If you use a GENBANK or EMBL file the annotations will be used to annotate SNPs by Snippy.

Generate the run folder

This command will create a new folder with a Makefile in it:

nullarbor.pl --name PROJNAME --mlst saureus --ref US300.fna --input samples.tab --outdir OUTDIR

This will check that everything is okay. One of the last lines it prints is the command you need to run to actually perform the analysis e.g.

Run the pipeline with: nice make -j 4 -C OUTDIR

So you can just cut and paste that:

nice make -j 4 -C OUTDIR

The -C option just means to change into the /home/maria/listeria/nullarbor folder first, so you could do this instead:

cd OUTDIR
make -j 4

View the report

firefox OUTDIR/report/index.html

Here are some example reports.

See some options

Once set up, a Nullarbor folder can be used in a few different ways. See what's available with this command:

make help

Advanced usage

Quick preview mode

You should not do a full run the first time, because it will probably contain outliers and QC failures. To build a quick "rough" tree:

make preview

This will create a mini-report in the same report/ folder. Use this to identify outliers and then comment them out (or delete) them from the --input file. Then type the following to regenerate the report for a second round of inspection:

make again
make preview

When you are happy with the result, proceed with the full analysis:

make again
make

Prefilling data

Often you want to perform multiple analyses where some of the isolates have been used in previous Nullarbor runs. It is wasteful to recompute results you already have. The --prefill option allows you to "copy" existing result files into a new Nullarbor folder before commencing the run.

To set it up, add a prefill section to nullarbor.conf as follows:

# nullarbor.conf
prefill:
        contigs.fa: /home/seq/MDU/QC/{ID}/contigs.fa

The {ID} will replaced for each isolate ID in your --input TAB file and the contigs.fa copied from the source path specified. This will prevent Nullarbor having to re-assemble the reads.

Using different components

Nullarbor 2.x has a plugin system for assembly and tree building. These can be changed using the --assembler and --treebuilder options.

Read trimming is off by default, because most sequences are now provided pre-trimmed, and retrimming occupies much disk space. To trim Illumina adaptors, use the --trim option.

Removing isolates from an existing run

After examining the report from your initial analysis, it is common to observe some outliers, or bad data. In this case, you want to remove those isolates from the analysis, but want to minimize the amount of recomputation needed.

Just go to the original --input TAB file and either (1) remove the offending lines; or (2) just add a # symbol to "comment out" the line and it will be ignored by Nullarbor.

Then go back into the Nullarbor folder and type make again and it should make a new report. Assemblies and SNPs won't be redone, but the tree-builder and pan-genome components will need to run again.

Adding isolates to an existing run

As per "Removing isolates" above, you can also add in more isolates to your original --input TAB file when you want to expand the analysis. Then just type make again and it should only recalculate things it needs to, saving a lot of computation.

Immediate start

If you don't want to cut and paste the make .... instructions to start the analysis, just add the --run option to your nullarbor.pl command.

Influential environmental variables

NULLARBOR_CONF - default --conf, the path to nullarbor.conf
NULLARBOR_CPUS - default --cpus
NULLARBOR_ASSEMBLER - default --assembler tool
NULLARBOR_TREEBUILDER - default --treebuilder tool
NULLARBOR_TAXONER - default --taxoner tool

Dependencies

Nullarbor has many dependencies, so you are best off using a package manager to install it. Type nullarbor.pl --check to see what you need.

Perl: Bio::Perl Time::Piece List::Util Path::Tiny YAML::Tiny Moo SVG Text::CSV List::MoreUtils IO::File

Tools: seqtk trimmomatic prokka roary mlst abricate seqret skesa megahit spades shovill snippy snp-dists newick-utils iqtree fasttree quicktree kraken kraken2 centrifuge

Databases: minikraken centrifuge-bacvirhum

Note that these are only the immediate dependencies and that the tools listed above will depend on various other tools, Perl modules, and Python modules.

Etymology

The Nullarbor is a huge treeless plain that spans the area between south-west and south-east Australia. It comes from the Latin "nullus" (no) and "arbor" (tree), or "no trees". As this software will generate a tree, there is an element of Australian irony in the name.

Issues

Submit problems to the Issues Page

License

GPL 2.0

Citation

Seemann T, Goncalves da Silva A, Bulach DM, Schultz MB, Kwong JC, Howden BP. Nullarbor Github https://github.com/tseemann/nullarbor

nullarbor's People

Contributors

Stargazers

Watchers

nullarbor's Issues

Sort pairwise table by tree closeness

This will make it easier to find distances between near clones as they will be closer in the matrix.

assembly error using --accurate results in duplicate contig

When running nullarbor using --accurate, assembly occasionally results in a duplicate contig:

Running spades.py manually, using the same settings --careful --only-assembler --cov-cutoff auto produces the same duplication in the resulting scaffolds.fasta file, but not in the contigs.fasta file.

This only occurred in 1 out of 22 sequencing QC runs for the Listeria monocytogenes strain EGD-e. No idea why it didn't occur in the others.

nullarbor quits without completing

nullarbor stops without error before finishing.

final output is

[15:05:58] Walltime used: 0.62 minutes
[15:05:58] If you use this result please cite the Prokka paper:
[15:05:58] Seemann T (2014) Prokka: rapid prokaryotic genome annotation. Bioinformatics. 30(14):2068-9.
[15:05:58] Type 'prokka --citation' for more details.
[15:05:58] Share and enjoy!
make: Leaving directory

Roary has not been run and no report has been compiled

Include an unrooted spanning tree with downloadable picture

Might make relationships clearer, dendrogram is misleading.

Reducing isolates doesn't trigger new tree & pan

MLST gives incorrect alleles and ST if --scheme is specified

version comparisons of nullarbor dependencies

Hi Torst,

When running nullarbor.pl --check ..

Got

Need snippy >= 2.5 (found 2.10)

I guess you may need to parse version numbers differently

make abricate fails

When trying to run a job for the first time, if one only wants abricate results, just running make abricate crashes.

It searches for the clip reads, and when it doesn't find them, it crashes. For example:

  Makefile:658: recipe for target '2015-22510/R1.fq.gz' failed

Put reference name in tree rather than 'Reference'

Don't filter out small contigs

Currenty limit of 500bp could miss small repeated AMR genes.

Check that user has a Kraken DB installed

Assumes KRAKEN_DEFAULT_DB is installed, must check this at runtime!

Hidden requirement fq-simulate_illumina_reads.pl

When running bash make_test_data.sh, I got the error:

make_test_data.sh: line 11: fq-simulate_illumina_reads.pl: command not found

Use --no-mercy option for megahit

This will make it work better on isolate genomes.

Add SNP density plot to report

Plot density of (core) SNPs across reference genome to ensure it is uniformly distributed.

Could possibly do a statistical test to check and alert the reader.

Ensure Skewer is using correct adaptors

Need TruSeq AND Nextera

Consider using Trimmomatic or PEAT.

Resistome report

Do ticks represent 100% identity AND full length match?

Trimmomatic .fa file missing from github repository.

Hi Torst, just a reminder for you.

Cheers,

Abricate results don't include the reference

To make a tree that matches up the abricate results, we need an abricate.tab for the reference.

skewer taking a long time ...

Is it usual for skewer to take 8.5 minutes to clip reads? (514 seconds for 5,619,342 MiSeq reads using 64 cpus). I thought I remember skewer running much faster than that eg. 30 seconds previously.

Include reference in MLST table

Downloadable table of the sequencing "fq" results

(from Stefano)

nullabor install breaks

Dear Group
This issue appears to be very similar to issue #47
I am trying to install nullabor on a Ubuntu 14.04 machine.
The "brew install nullarbor" command produces compile errors. Examining ~/.cache/Homebrew/Logs/blast/02.make it appears that the Boost headers can not be found. The gcc command does not seem to list the locations (-I) They are installed in the system (/usr/include/boost) as well as within linuxbrew (e.g. brew install boost
Warning: boost-1.60.0_1 already installed
)
Is there a way to include extra compile paths (i.e. -I) into the brew command?

regards
Simon

Handle contig files

Like Wombac did so they can be put in the tree.

snippy error

Hi Torsten,

The latest brew recipe for nullarbor throws an error when running snippy. I blieve this was previously flagged and fixed (tseemann/snippy#45)

Changes to the snippy code (tseemann/snippy@a8dc9b2) are not present in the version of snippy (v2.9) packaged with the nullarbor brew recipe.

Changing the code results in an error with freebayes-parallel

`### freebayes-parallel reference/ref.txt 4 -p 1 -q 20 -m 60 --min-coverage 10 -V -f reference/ref.fa snps.bam > snps.raw.vcf

parallel: Error: --tollef has been retired.
parallel: Error: Remove --tollef or use --gnu to override --tollef.
terminate called after throwing an instance of 'std::out_of_range'
what(): basic_string::substr
terminate called after throwing an instance of 'std::out_of_range'
what(): basic_string::substr
/home/smrtanalysis/.linuxbrew/bin/freebayes-parallel: line 40: 22492 Exit 255 ( cat $regionsfile | parallel -k -j $ncpus "$command --region {}" )
22493 Done | vcffirstheader
22495 Aborted (core dumped) | vcfstreamsort -w 1000
22497 Aborted (core dumped) | vcfuniq
`

Reference Fasta with no descriptions fails report

Get empty __ column

Installing Nullarbor on Bio-linux 8.0 (Ubuntu 14.10)

I just installed Bio-linux and Nullarbor on a couple of my students laptops. The nullarbor installation had a few issues and the issues were the same on each computer (different specs).
Here is what I did to get nullarbor installed on each machine, maybe this can help someone else.

Blast install fails.

brew install blast --without-check
brew install nullarbor

openssl install fails

brew remove curl
brew install curl
brew install nullarbor

librsvg install fails

sudo apt-get install libgtk-3-dev
brew install librsvg
brew install nullarbor

What does Please set KRAKEN_DEFAULT_DB appropriately mean?

manager@bl8vbox[data] nullarbor.pl --name African --mlst pmultocida_rirdc --ref Pm70.fna --input samples.tab --outdir ahmedtest
[09:24:38] Hello manager
[09:24:38] This is nullarbor.pl 1.01
[09:24:38] Send complaints to Torsten Seemann [email protected]
[09:24:38] Using reference genome: /media/sf_linuxpasty/data/Pm70.fna
[09:24:38] Loaded 1 isolates: Isolate1
[09:24:38] Found 'mlst' => /home/manager/.linuxbrew/bin/mlst
[09:24:38] Found 114 MLST schemes
[09:24:38] Using scheme: pmultocida_rirdc
[09:24:38] Making output folder: /media/sf_linuxpasty/data/ahmedtest
[09:24:38] Found 'convert' => /home/manager/.linuxbrew/bin/convert
[09:24:38] Found 'pandoc' => /usr/bin/pandoc
[09:24:38] Found 'head' => /usr/bin/head
[09:24:38] Found 'cat' => /bin/cat
[09:24:38] Found 'install' => /usr/bin/install
[09:24:38] Found 'env' => /usr/bin/env
[09:24:38] Found 'nl' => /usr/bin/nl
[09:24:38] Found 'date' => /bin/date
[09:24:38] Found 'trimmomatic' => /home/manager/.linuxbrew/bin/trimmomatic
[09:24:38] Found 'prokka' => /home/manager/.linuxbrew/bin/prokka
[09:24:38] Found 'roary' => /usr/local/bin/roary
[09:24:38] Found 'kraken' => /home/manager/.linuxbrew/bin/kraken
[09:24:38] Found 'snippy' => /home/manager/.linuxbrew/bin/snippy
[09:24:38] Found 'mlst' => /home/manager/.linuxbrew/bin/mlst
[09:24:38] Found 'abricate' => /home/manager/.linuxbrew/bin/abricate
[09:24:38] Found 'megahit' => /home/manager/.linuxbrew/bin/megahit
[09:24:38] Found 'spades.py' => /home/manager/.linuxbrew/bin/spades.py
[09:24:38] Found 'nw_order' => /home/manager/.linuxbrew/bin/nw_order
[09:24:38] Found 'nw_display' => /home/manager/.linuxbrew/bin/nw_display
[09:24:38] Found 'FastTree' => /home/manager/.linuxbrew/bin/FastTree
[09:24:38] Found 'fq' => /home/manager/.linuxbrew/bin/fq
[09:24:38] Found 'fa' => /home/manager/.linuxbrew/bin/fa
[09:24:38] Found 'afa-pairwise.pl' => /home/manager/.linuxbrew/bin/afa-pairwise.pl
[09:24:38] Found 'any2fasta.pl' => /home/manager/.linuxbrew/bin/any2fasta.pl
[09:24:38] Found 'roary2svg.pl' => /home/manager/.linuxbrew/bin/roary2svg.pl
[09:24:38] Found Perl module: Data::Dumper
[09:24:38] Found Perl module: Moo
[09:24:38] Found Perl module: Bio::SeqIO
[09:24:38] Found Perl module: File::Copy
[09:24:38] Found Perl module: Time::Piece
[09:24:38] Found Perl module: YAML::Tiny
[09:24:39] Parsed version '1.0' from 'MEGAHIT v1.0.3'
[09:24:39] Parsed version '3.0' from 'snippy 3.0'
[09:24:40] Parsed version '1.12' from 'prokka 1.12-beta'
[09:24:41] Parsed version '3.6' from '3.6.0'
[09:24:42] Parsed version '2.1' from 'mlst 2.1'
[09:24:42] Please set KRAKEN_DEFAULT_DB appropriately.

Add MLST types to tree leaves

can't install blast for nullarbor

linuxmint@linuxmint ~/nullarbor $ brew install nullarbor
==> Installing nullarbor from tseemann/bioinformatics-linux
==> Installing dependencies for tseemann/bioinformatics-linux/nullarbor: blast, bedtools, cd-hit, mcl, mafft, libxml2, gettext, lib
==> Installing tseemann/bioinformatics-linux/nullarbor dependency: blast
==> Downloading ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/2.2.31/ncbi-blast-2.2.31+-src.tar.gz
Already downloaded: /home/linuxmint/.cache/Homebrew/blast-2.2.31.tar.gz
==> Patching
patching file c++/include/corelib/ncbimtx.inl
==> ./configure --prefix=/home/linuxmint/.linuxbrew/Cellar/blast/2.2.31_1 --libdir=/home/linuxmint/.linuxbrew/Cellar/blast/2.2.31_1/libexe
==> make
Last 15 lines from /home/linuxmint/.cache/Homebrew/Logs/blast/02.make:
^
compilation terminated.
make[3]: *** [test_boost.o] Error 1
make[3]: Leaving directory /tmp/blast20160303-61606-9bhsht/ncbi-blast-2.2.31+-src/c++/ReleaseMT/build/corelib' FAILED: src/corelib/Makefile.test_boost.lib make[3]: Entering directory/tmp/blast20160303-61606-9bhsht/ncbi-blast-2.2.31+-src/c++/ReleaseMT/build/corelib'
/bin/rm -f libtest_boost.a .test_boost.dep .libtest_boost.a.stamp
/bin/rm -f /tmp/blast20160303-61606-9bhsht/ncbi-blast-2.2.31+-src/c++/ReleaseMT/lib/libtest_boost.a /tmp/blast20160303-61606-9bhsht/ncbi-blast-2.2.31+-src/c++/ReleaseMT/status/.test_boost.dep
/tmp/blast20160303-61606-9bhsht/ncbi-blast-2.2.31+-src/c++/ReleaseMT/lib/libtest_boost-static.a /tmp/blast20160303-61606-9bhsht/ncbi-blast-2.2.31+-src/c++/ReleaseMT/status/.test_boost-static.dep
make[3]: Leaving directory /tmp/blast20160303-61606-9bhsht/ncbi-blast-2.2.31+-src/c++/ReleaseMT/build/corelib' make[2]: *** [all.nonusr] Error 2 make[2]: Leaving directory/tmp/blast20160303-61606-9bhsht/ncbi-blast-2.2.31+-src/c++/ReleaseMT/build/corelib'
make[1]: *** [all_r.real] Error 5
make[1]: Leaving directory `/tmp/blast20160303-61606-9bhsht/ncbi-blast-2.2.31+-src/c++/ReleaseMT/build'
make: *** [all] Error 2

READ THIS: https://github.com/Linuxbrew/linuxbrew/blob/master/share/doc/homebrew/Troubleshooting.md#troubleshooting
If reporting this issue please do so at (not Homebrew/homebrew):
https://github.com/Homebrew/homebrew-science/issues

Auto-detect the MLST scheme

Be nice if it could auto detect the MLST scheme.

Handle single end reads

eg. Ion Torrent

Tree diagrams have support and branch values in them

You need to compile
brew reinstall imagemagick --with-librsvg
otherwise the SVG to PNG won't work with opacity tags created by newick-utils

Adjust megahit mincov to suit coverage

Default 2-3 works well with typical 50-100x genomes but is too low for 200x.

including low coverage sequences causes nullarbor to exit prematurely without error message

Nullarbor seemed to stop during assembly stage using spades - I couldn't see any error message. I found 2 sequences that were low coverage (9x and 17x), and when I re-ran nullarbor in the same folder using an edited Makefile excluding these two sequences, it ran as expected.

Run PEAR to overlap reads before megahit

Will help with MLST in some cases.

nullarbor.pl is not passing --cpus value to Snippy and FastTree

kraken --threads 64 --preload --quick --paired 2012-10754/R1.fq.gz 2012-10754/R2.fq.gz | kraken-report > 2012-10754/kraken.tab
Loading database... complete.
4078498 sequences (1972.19 Mbp) processed in 98.247s (2490.8 Kseq/m, 1204.43 Mbp/m).
  3924156 sequences classified (96.22%)
  154342 sequences unclassified (3.78%)
snippy --force --outdir 2012-10753/2012-10753 --ref ref.fa --R1 2012-10753/R1.fq.gz --R2 2012-10753/R2.fq.gz
[12:50:38] This is snippy 2.6
[12:50:38] Written by Torsten Seemann <[email protected]>
[12:50:38] Obtained from https://github.com/tseemann/snippy
[12:50:38] Detected operating system: linux
[12:50:38] Enabling bundled linux tools.
[12:50:38] Found bwa - /bio/linuxbrew/bin/bwa
[12:50:38] Found samtools - /bio/linuxbrew/bin/samtools
[12:50:38] Found tabix - /bio/linuxbrew/bin/tabix
[12:50:38] Found bgzip - /bio/linuxbrew/bin/bgzip
[12:50:38] Found parallel - /bio/linuxbrew/bin/parallel
[12:50:38] Found freebayes - /bio/linuxbrew/bin/freebayes
[12:50:38] Found freebayes-parallel - /bio/linuxbrew/bin/freebayes-parallel
[12:50:38] Found fasta_generate_regions.py - /bio/linuxbrew/bin/fasta_generate_regions.py
[12:50:38] Found vcffilter - /bio/linuxbrew/bin/vcffilter
[12:50:38] Found vcfstreamsort - /bio/linuxbrew/bin/vcfstreamsort
[12:50:38] Found vcfuniq - /bio/linuxbrew/bin/vcfuniq
[12:50:38] Found vcffirstheader - /bio/linuxbrew/bin/vcffirstheader
[12:50:38] Found vcf-consensus - /bio/linuxbrew/bin/vcf-consensus
[12:50:38] Found snippy-vcf_to_tab - /home/tseemann/git/snippy/bin/snippy-vcf_to_tab
[12:50:38] Found snippy-vcf_report - /home/tseemann/git/snippy/bin/snippy-vcf_report
[12:50:38] Using reference: /home/jkwong1/testing/nullarbor/test/ref.fa
[12:50:38] Will use 8 CPU cores.
[12:50:38] Using read file: /home/jkwong1/testing/nullarbor/test/2012-10753/R1.fq.gz
[12:50:38] Using read file: /home/jkwong1/testing/nullarbor/test/2012-10753/R2.fq.gz
[12:50:38] Creating folder: 2012-10753/2012-10753
...
FastTree -gtr -nt core.aln > tree.newick
FastTree Version 2.1.8 Double precision (No SSE3), OpenMP (8 threads)
Alignment: core.aln

Hidden requirement 'fa'

I can't really figure out how to install fa -- please help!

make
../bin/nullarbor.pl --outdir ./t --ref data/ref.fa --input data/data.tab  --force --mlst saureus --name NullTest
[15:14:21] Hello root
[15:14:21] This is nullarbor.pl 0.5
[15:14:21] Send complaints to Torsten Seemann <[email protected]>
[15:14:21] Found 'kraken' => /opt/kraken/kraken
[15:14:21] Found 'snippy' => /opt/snippy-2.6/bin/snippy
[15:14:21] Found 'mlst' => /usr/local/bin/mlst
[15:14:21] Found 'abricate' => /opt/abricate/bin/abricate
[15:14:21] Found 'megahit' => /usr/local/bin/megahit
[15:14:21] Found 'nw_order' => /usr/local/bin/nw_order
[15:14:21] Found 'nw_display' => /usr/local/bin/nw_display
[15:14:21] Found 'trimal' => /usr/local/bin/trimal
[15:14:21] Found 'FastTree' => /usr/local/bin/FastTree
[15:14:21] Found 'fq' => /opt/build/nullarbor/bin/fq
[15:14:21] Could not find 'fa'. Please install it and ensure it is in the PATH.

Without fa on-hand, I commented it out and tried the test run. I am guessing that this error is also related:

    [16:22:53] Loading pre-masked/aligned sequences...
    [16:22:53] 1/4  genome01 coverage 0/68250 = 0.00%
    [16:22:53] 2/4  genome02 coverage 0/68250 = 0.00%
    [16:22:53] 3/4  genome03 coverage 0/68250 = 0.00%
    [16:22:53] 4/4  genome04 coverage 0/68250 = 0.00%
    [16:22:53] Patching variant sites into whole genome alignment...
    [16:22:53] Constructing alignment object for core.full.aln

    --------------------- WARNING ---------------------
    MSG: Got a sequence without letters. Could not guess alphabet
    ---------------------------------------------------

    --------------------- WARNING ---------------------
    MSG: Got a sequence without letters. Could not guess alphabet
    ---------------------------------------------------

    --------------------- WARNING ---------------------
    MSG: Got a sequence without letters. Could not guess alphabet
    ---------------------------------------------------

    --------------------- WARNING ---------------------
    MSG: Got a sequence without letters. Could not guess alphabet
    ---------------------------------------------------
    [16:22:53] Writing 'fasta' alignment to core.full.aln
    [16:22:53] Writing core SNP table
    [16:22:53] Found 0 core SNPs from 0 variant sites.
    [16:22:53] Saved SNP table: core.tab
    [16:22:53] Constructing alignment object for core.aln
    [16:22:53] Writing 'fasta' alignment to core.aln
    [16:22:53] Done.
    trimal -in core.full.aln -out core.nogaps.aln -nogaps

    WARNING: Removing sequence 'Reference' composed only by gaps
    WARNING: Removing sequence 'genome01' composed only by gaps
    WARNING: Removing sequence 'genome02' composed only by gaps
    WARNING: Removing sequence 'genome03' composed only by gaps
    WARNING: Removing sequence 'genome04' composed only by gaps


    WARNING: Output alignment has not been generated. It is empty.

    mlst --scheme saureus genome01/contigs.fa > genome01/mlst.tab
    mlst --scheme saureus genome02/contigs.fa > genome02/mlst.tab
    mlst --scheme saureus genome03/contigs.fa > genome03/mlst.tab
    mlst --scheme saureus genome04/contigs.fa > genome04/mlst.tab
    (head -n 1 genome01/mlst.tab && tail -q -n +2 genome01/mlst.tab genome02/mlst.tab genome03/mlst.tab genome04/mlst.tab) > mlst.tab
    make: *** No rule to make target `genome01/denovo.tab', needed by `denovo.tab'.  Stop.
    make: Leaving directory `/opt/build/nullarbor/test/t'

Brew installable

brew install fails due to a dependency of ImageMagic -

Clean Ubuntu 14.04 install.

...
sudo apt-get -y install build-essential curl git m4 ruby texinfo libbz2-dev libcurl4-openssl-dev libexpat-dev libncurses-dev zlib1g-dev python-pip libpng-dev unzip flex bison python-dev libpng-dev pkg-config libcairo2-dev perl-doc expect

sudo perl -MCPAN -e "CPAN::Shell->notest('install', 'Bio::Perl')"
sudo cpan -i Moo
sudo cpan -i Spreadsheet::Read
sudo cpan -i SVG::Graph
...
brew install nullarbor

Installing dependencies for nullarbor: libcroco, librsvg, imagemagick
==> Installing nullarbor dependency: libcroco
==> Downloading http://ftp.gnome.org/pub/GNOME/sources/libcroco/0.6/libcroco-0.6.8.tar.xz
Already downloaded: /home/vagrant/.cache/Homebrew/libcroco-0.6.8.tar.xz
==> ./configure --prefix=/home/vagrant/.linuxbrew/Cellar/libcroco/0.6.8 --disable-Bsymbolic
installed software in a non-standard prefix.

Alternatively, you may set the environment variables CROCO_CFLAGS

Log is at: https://gist.github.com/4c13e3e099957a6c4cc9

Add a check dependencies then quit command line switch

Can you please add a command line switch to check all the dependencies, make a report then exit?

Exclude taxa that don't match the Kraken majority

These taxa contaminate the tree

Add reference to MLST table

Re-using nullarbor produces wrong MLST table

Probably need to delete *.tab when --force is applied?

Report should have tabs along the top

Make the report less long, and make front page overall summary.

Run nullarbor components separately

More of a request rather than an issue:

Will it be possible to run each of the components separately?
Eg. reading a list of samples in samples.tab, suppose all the reads were already clipped, and already had de novo assemblies and MLST, but you wanted to re-analyse a subset of the isolates using a different reference.
Could there be an option to run read metrics, snippy and snippy-core? I know I can run them separately, but would it be possible through the nullarbor command-line?

Thanks.

Option --force incorrectly says it will "nuke" the folder

NOT TRUE :)

error 1

nice make -j 1 -C /media/sf_linuxpasty/data/ahmedtest4 [10:10AM]
make: Entering directory `/media/sf_linuxpasty/data/ahmedtest4'
mkdir -p Isolate1
any2fasta.pl /media/sf_linuxpasty/data/Pm70.fna > ref.fa
samtools faidx ref.fa
fq --quiet --ref ref.fa /media/sf_linuxpasty/data/P1234_1.fastq.gz /media/sf_linuxpasty/data/P1234_2.fastq.gz > Isolate1/yield.dirty.tab
Calculating depth, using size 2295190
trimmomatic PE -threads 3 /media/sf_linuxpasty/data/P1234_1.fastq.gz /media/sf_linuxpasty/data/P1234_2.fastq.gz Isolate1/R1.fq.gz /dev/null Isolate1/R2.fq.gz /dev/null ILLUMINACLIP:/home/manager/.linuxbrew/Cellar/nullarbor/1.01/bin/../conf/trimmomatic.fa:1:30:11 LEADING:10 TRAILING:10 MINLEN:30
TrimmomaticPE: Started with arguments:
-threads 3 /media/sf_linuxpasty/data/P1234_1.fastq.gz /media/sf_linuxpasty/data/P1234_2.fastq.gz Isolate1/R1.fq.gz /dev/null Isolate1/R2.fq.gz /dev/null ILLUMINACLIP:/home/manager/.linuxbrew/Cellar/nullarbor/1.01/bin/../conf/trimmomatic.fa:1:30:11 LEADING:10 TRAILING:10 MINLEN:30
Using PrefixPair: 'AGATGTGTATAAGAGACAG' and 'AGATGTGTATAAGAGACAG'
Using PrefixPair: 'TACACTCTTTCCCTACACGACGCTCTTCCGATCT' and 'GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT'

Using Long Clipping Sequence: 'GTCTCGTGGGCTCGGAGATGTGTATAAGAGACAG'

Using Long Clipping Sequence: 'TTTTTTTTTTAATGATACGGCGACCACCGAGATCTACAC'

Using Long Clipping Sequence: 'TCGTCGGCAGCGTCAGATGTGTATAAGAGACAG'

Using Long Clipping Sequence: 'TTTTTTTTTTCAAGCAGAAGACGGCATACGA'

Using Long Clipping Sequence: 'CTGTCTCTTATACACATCTGACGCTGCCGACGA'

Using Long Clipping Sequence: 'AGATCGGAAGAGCTCGTATGCCGTCTTCTGCTTG'

Using Long Clipping Sequence: 'AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC'

Using Long Clipping Sequence: 'TACACTCTTTCCCTACACGACGCTCTTCCGATCT'

Using Long Clipping Sequence: 'GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT'

Using Long Clipping Sequence: 'AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTA'

Using Long Clipping Sequence: 'AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT'

Using Long Clipping Sequence: 'AGATCGGAAGAGCGGTTCAGCAGGAATGCCGAG'

Skipping duplicate Clipping Sequence: 'AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC'

Using Long Clipping Sequence: 'AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT'

Using Long Clipping Sequence: 'AGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCGATCTCGTATGCCGTCTTCTGCTTG'

Using Long Clipping Sequence:
'CAAGCAGAAGACGGCATACGAGATCGGTCTCGGCATTCCTGCTGAACCGCTCTTCCGATCT'

Skipping duplicate Clipping Sequence: 'AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTA'

Using Long Clipping Sequence: 'CTGTCTCTTATACACATCTCCGAGCCCACGAGAC'

Using Long Clipping Sequence: 'AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT'

ILLUMINACLIP: Using 2 prefix pairs, 17 forward/reverse sequences, 0 forward only sequences, 0 reverse only sequences

Quality encoding detected as phred33

Input Read Pairs: 507695 Both Surviving: 507446 (99.95%) Forward Only Surviving: 138 (0.03%)
Reverse Only Surviving: 96 (0.02%) Dropped: 15 (0.00%)

TrimmomaticPE: Completed successfully

fq --quiet --ref ref.fa Isolate1/R1.fq.gz Isolate1/R2.fq.gz > Isolate1/yield.clean.tab

Calculating depth, using size 2295190
kraken --threads 3 --preload --paired Isolate1/R1.fq.gz Isolate1/R2.fq.gz | kraken-report > Isolate1/kraken.tab

Loading database... complete.

507446 sequences (153.26 Mbp) processed in 23.622s (1288.9 Kseq/m, 389.28 Mbp/m).
490768 sequences classified (96.71%)
16678 sequences unclassified (3.29%)
rm -f -r Isolate1/megahit
mkdir -p Isolate1
megahit --min-count 3 --k-list 21,31,41,53,75,97,111,127 -t 3 --memory 0.5 -1 Isolate1/R1.fq.gz -2 Isolate1/R2.fq.gz --out-dir Isolate1/megahit --min-contig-len 500
7.0Gb memory in total.

Using: 3.852Gb.

MEGAHIT v1.0.3

--- [Wed Apr 6 10:15:46 2016] Start assembly. Number of CPU threads 3 ---

--- [Wed Apr 6 10:15:46 2016] k list: 21,31,41,53,75,97,111,127 ---

make: *** [Isolate1/contigs.fa] Error 1

make: Leaving directory `/media/sf_linuxpasty/data/ahmedtest4'

make again keeps appending --force to cmdline

not critical but would be good to fix

--force option sometimes re-performs mapping and SNP calls

I want to run nullarbor on a subset of my samples that have already been run through nullarbor.
For organisation purposes, and for ease of viewing of just that subset for external parties, I want to regenerate the nullarbor webpage report, otherwise I would have just run snippy-core.

I created a new project folder, with symlinks to the sample folders in the main nullarbor directory. I created a new samples.tab file with the list of isolates in the subset for analysis and generated a new makefile from this.

However, when running nullarbor from this new project folder with --force, nullarbor repeats the mapping and SNP calling process. Why is this? The reference is unchanged.

Roary fails if only 1 isolate

Abricate table doesn't handle duplicate genes in report!

it's hashing on 'GENE' but that may not be unique !

eg.

/home/tseemann/tmp/6008.fna     gi|384860682|ref|NC_017341.1|   897649  898380  erm(A)  1-732/732       =============== 0       100.00  99.86
/home/tseemann/tmp/6008.fna     gi|384860682|ref|NC_017341.1|   1733079 1733810 erm(A)  1-732/732       =============== 0       100.00  99.86