voutcn / megahit Goto Github PK

View Code? Open in Web Editor NEW

562.0 36.0 134.0 3.13 MB

Ultra-fast and memory-efficient (meta-)genome assembler

Home Page: http://www.ncbi.nlm.nih.gov/pubmed/25609793

License: GNU General Public License v3.0

C++ 81.25% C 14.76% Python 3.47% CMake 0.47% Dockerfile 0.05%

bioinformatics metagenomics succinct data-structures genomics genome-assembly

megahit's People

Contributors

Stargazers

Watchers

Forkers

ctb simonzhangsm heguangzhu ch11y lgautier genomewalker jgi-bioinformatics bfoster-lbl sebhtml bioinformaticsarchive teweiluo epruesse yjarosz shaman-narayanasamy mkoohim fw1121 yvanzi annafhuff likelet mjenior scottdaniel celiosantosjr wangwenjie95 weihongying cmsmoo silask koadman skerker rajaldebnath zhssakura ausomeshan chunlaw tankmermaid ericdeveaud tba123 tuqiang2014 rickcopin ilnamkang sugalic smaegol davidvilanova rpucheq stefano04 frebio gt758215 mouradbioinfo esontiveros luanxiao wangzhennan14 xmuyulab chris-rands l1n ssarria fischuu hivlab alexpersa7 liupfskygre marszal-cns diegoibt trevoranz zhangyujie0823 siyuanzhangb liuyangying jianshu93 kunandi deminu hnumcc kasperrv alienzj superkits ahmed-shibl venetiadrose jayrbolton cerebis 672234993 mza0150 dawei1203 zhk8111 mattmcl4475 hughcross minghao2016 wbyu linxingchen lrf2019 zjyzjjzmt jinyidan richardyin1996 zhaoxia413 tw7649116 zhang-hzau xushifen ujjwalsh sdy2813 driftbio santosh653 mfkiwl sprucify konradotto mmastert dongfang1021

megahit's Issues

Insufficient memory warning

I receive a flurry of these warnings over and over again in the terminal:

[B::ReadInputFile WRANING] No enough memory to hold all the reads. Only the first 37292817 reads are kept.

Is this supposed to happen? FYI, I am using the "long read" bug fixed version.

Confusion between "-t threads" and $OMP_NUM_THREADS

If I choose to use more than 8 threads I get this warning:

[WARNING I::InitGlobalData] Number of threads reset to omp_get_max_threads()=8

This occurs because my $OMP_NUM_THREADS is 8.

Perhaps you could default -t to be equal to getenv("OMP_NUM_THREADS") if that variable exists?

Support --prefix option for output filenames?

This is a feature request for a --prefix option to control the naming of all the output files in the output folder.

For example, if --prefix Ecoli then we get files Ecoli.log and Ecoli.contigs.fa and so on.

(Not sure about opts.txt for your checkpointing)

"builder build" error: 'std::bad_alloc'

Hi,
I've been trying to run Megahit with the following parameters:
megahit -m 92e9 --k-min 27 --k-max 127 --k-step 10 -o megahitout --continue
-1 [list of R1 libraries]
-2 [list of R2 libraries]

I allocate 16 processes and 96 GB memory for this in our cluster
But I get 'std::bad_alloc' error when it tries the k27. Any idea why?

See the ending of the log below
--- [Fri Oct 2 17:42:55 2015] Building graph for k = 27 ---
/home/z3382651/bin/megahit/megahit_sdbg_build seq2sdbg --host_mem 70000000000 --mem_flag 1 --gpu_mem 0 --output_prefix megahitout/tmp/k27/27 --num_cpu_threads 16 -k 27 --kmer_from 0 --num_edge_files 5 --input_prefix megahitout/tmp/k27/27 --need_mercy
[sdbg_builder.cpp : 341] Host memory to be used: 70000000000
[sdbg_builder.cpp : 342] Number CPU threads: 16
[cx1.h : 450] Preparing data...
[cx1_seq2sdbg.cpp : 406] Number edges: 6770942229
[cx1_seq2sdbg.cpp : 446] Bases to reserve: 236982978008, number contigs: 0, number multiplicity: 8463677786
[cx1_seq2sdbg.cpp : 452] Before reading, sizeof seq_package: 59245744512, multiplicity vector: 8463677786
[cx1_seq2sdbg.cpp : 467] Adding mercy edges...
terminate called after throwing an instance of 'St9bad_alloc'
what(): std::bad_alloc
Error occurs when running "builder build" for k = 27; please refer to megahitout/log for detail
[Exit code -6]
MEGAHIT v1.0.2
--- [Wed Oct 7 01:29:17 2015] Start assembly. Number of CPU threads 16 ---
--- [Wed Oct 7 01:29:17 2015] k list: 27,37,47,57,67,77,87,97,107,117,127 ---
--- [Wed Oct 7 01:29:17 2015] Building graph for k = 27 ---
/home/z3382651/bin/megahit/megahit_sdbg_build seq2sdbg --host_mem 70000000000 --mem_flag 1 --gpu_mem 0 --output_prefix megahitout/tmp/k27/27 --num_cpu_threads 16 -k 27 --kmer_from 0 --num_edge_files 5 --input_prefix megahitout/tmp/k27/27 --need_mercy
[sdbg_builder.cpp : 341] Host memory to be used: 70000000000
[sdbg_builder.cpp : 342] Number CPU threads: 16
[cx1.h : 450] Preparing data...
[cx1_seq2sdbg.cpp : 406] Number edges: 6770942229
[cx1_seq2sdbg.cpp : 446] Bases to reserve: 236982978008, number contigs: 0, number multiplicity: 8463677786
[cx1_seq2sdbg.cpp : 452] Before reading, sizeof seq_package: 59245744512, multiplicity vector: 8463677786
[cx1_seq2sdbg.cpp : 467] Adding mercy edges...
terminate called after throwing an instance of 'std::bad_alloc'
what(): std::bad_alloc
Error occurs when running "builder build" for k = 27; please refer to megahitout/log for detail
[Exit code -6]

250bp PE Miseq reads assembly k-max ?

Hello,

I'm working on soil metagenomic data, I've 250bp PE Miseq reads to assemble. After trimming the average length is 190 bp.
I have performed an assembly with --k-min 27 --k-max 127 --k-step 10.

I have 2 questions:

is K-max 127 long enough in this case ? is there a "rule" to determine k-max ?
can we set k-max to a higher value ? does it make any sense ?

Thanks for your help,

Error: Bucket 65536 too large for lv1: 0 > 0

Hi,

I am very keen to exploit megahit to assembly some large metagenome data sets but I have been unable to get it to run on my data which consists of paired reads between 80 and 151bp due to trimming. The program crashes quite soon with the error detailed below. Any help would be greatly appreciated.

Thanks,
Chris Quince

MEGAHIT v1.0.2
--- [Sun Oct 11 20:41:17 2015] Start assembly. Number of CPU threads 192 ---
--- [Sun Oct 11 20:41:17 2015] k list: 21,41,61,81,99 ---
--- [Sun Oct 11 20:41:17 2015] Converting reads to binaries ---
/home/chris/bin/megahit_asm_core buildlib B7_Assembly/tmp/reads.lib B7_Assembly/tmp/reads.lib
[read_lib_functions-inl.h : 209] Lib 0 (B7_12.csv): interleaved, 0 reads, 0 max length
[utils.h : 124] Real: 0.0060 user: 0.0027 sys: 0.0000 maxrss: 5636
--- [Sun Oct 11 20:41:17 2015] Extracting solid (k+1)-mers for k = 21 ---
cmd: /home/chris/bin/megahit_sdbg_build count -k 21 -m 2 --host_mem 2921736624537 --mem_flag 1 --gpu_mem 0 --output_prefix B7_Assembly/tmp/k21/21 --num_cpu_threads 192 --num_output_threads 64 --read_lib
_file B7_Assembly/tmp/reads.lib
[sdbg_builder.cpp : 114] Host memory to be used: 2921736624537
[sdbg_builder.cpp : 115] Number CPU threads: 192
[cx1.h : 450] Preparing data...
[read_lib_functions-inl.h : 253] Before reading, sizeof seq_package: 12
[read_lib_functions-inl.h : 258] After reading, sizeof seq_package: 12
[cx1_kmer_count.cpp : 104] 0 reads, 0 max read length
[cx1.h : 457] Preparing data... Done. Time elapsed: 0.0002
[cx1.h : 464] Preparing partitions and initialing global data...
[cx1_kmer_count.cpp : 195] 2 words per substring, 2 words per edge
[cx1_kmer_count.cpp : 332] Memory for reads: 12
[cx1_kmer_count.cpp : 333] max # lv.1 items = 0
[cx1.h : 480] Preparing partitions and initialing global data... Done. Time elapsed: 0.1580
[cx1.h : 486] Start main loop...
Bucket 65536 too large for lv1: 0 > 0
Error occurs when running "sdbg_builder count/read2sdbg", please refer to B7_Assembly/log for detail
[Exit code 1]

assembly restart clarification

Hi,
I have a general question regarding megahit assembler. Since it assembles kmers one after another, I was wondering if it is possible to restart a killed assembly and have it run to completion. I ask this because our cluster can have a wallclock limit at which point it will terminate the running job. It would be a nice feature to be able to restart it from where it left off.

Thanks,
bfoster

0.3.2b - error occurs when running "sdbg_builder count/read2sdbg"

I'm using latest git 0.3.2b and it seems to be spawning the old binary name?

--- [Sat Jul  4 11:46:34 2015] Extracting solid (k+1)-mers for k = 21 ---
Error occurs when running "sdbg_builder count/read2sdbg", please refer to ./megahit_out/log for detail

Should be megahit_sdbg_build count ?

Command line defaults in megahit - maybe $ENV variables?

Megahit has a lot of compulsory command line options that are difficult for users to choose.

Ideally --memory and --max-read-len would not be necessary but I understand you would like to leave those, but could they be set via environment variables, eg.

export MEGAHIT_MEMORY=1E10
export MEGAHIT_MAXREADLEN=602
export MEGAHIT_THREADS=8

Or possibly just a single variable:

export MEGAHIT_OPTIONS="--cpu-only -t 8 -m 1E10 -l 602 --min-count 3"

Is there a way to automatically choose --cpu-only if it wasn't compiled against GPU ?

Broken backward compatibility with -o option

Something has changed in 0.2.1 from 0.2.0a ?

I now get this message which has broken my pipelines....

Output directory xxxxx/ already exists, please change the parameter -o to another value to avoid overwriting.

Can you add a --force option to allow writing into an existing folder?

intermediate_contigs/XXX.final.contigs.fa are all 0 bytes long

Are these 0 length files expected?

2608 k21.addi.fa
4400 k21.contigs.fa
   0 k21.final.contigs.fa
2184 k21.local.fa
2636 k41.addi.fa
3712 k41.contigs.fa
   0 k41.final.contigs.fa
 940 k41.local.fa
2252 k61.addi.fa
2936 k61.contigs.fa
   0 k61.final.contigs.fa
 284 k61.local.fa
1080 k81.addi.fa
2780 k81.contigs.fa
   0 k81.final.contigs.fa
 136 k81.local.fa
   0 k99.addi.fa
2748 k99.contigs.fa
   0 k99.final.contigs.fa

200bp lower limit?

Hi. We ran two metagenomes (Illumina HiSeq 2500 + trimmed + quality checked) with megahit and the resulting sequence lengths, in both cases, started at exactly 200bp. Is this supposed to happen or did we do something wrong?
I would expect a) lots of sequences not to assemble, so lots of short sequences at the low end of the spectrum and b) smaller assembled sequences to have (random?) different lengths.
Thanks.

Please add a --version switch

For example:

% megahit --version
0.21

This is needed for Galaxy etc to determine versions.

instabilty in head

I work for the US DOE Joint Genome Institute, where we assemble 100's of metagenomes and metatranscriptomes. One of our analysts checked out the repo, from head, a few weeks ago, and had no problems running MEGAHIT. Since then myself, and other analysts have checked out from head, and we have had problems running MEGAHIT. Would it be possible to tag a stable revision, or provide a stable build as a download? We would really like to use MEGAHIT for our assemblies, but are concerned about the stability of the master branch.

Using long reads in MEGAHIT

Dear authors,

I would like your opinion about using contigs or long reads (in fasta format) within MEGAHIT. Is the software, at its current state suitable for such input?

I look forward to hear your thoughts.

Best regards,
Shaman

Report integer instead of floating point contig lengths

[STAT] 285 contigs, total 2866475 bp, min 200 bp, max 186864 bp, avg 10057.81 bp, N50 48426 bp

Can you round the stats to integers? 10057.81 bp looks a bit unusual :-)

Memory Setting Error

Hi, I've searched all over the issues and online elsewhere and I can't find a reference to this, perhaps you can help.

When I try this command:

megahit --cpu-only -m 60e9 -max_read_len 250 -r input.fastq

I get this error message:

Traceback (most recent call last):
  File "/opt/software/megahit/20141001--GCC-4.4.5/bin/megahit", line 584, in <module>
    sys.exit(main())
  File "/opt/software/megahit/20141001--GCC-4.4.5/bin/megahit", line 513, in main
    host_mem = long(value)
ValueError: invalid literal for long() with base 10: '60e9'

Could I have compiled incorrectly or do I need to adjust the memory? Memory should not be an issue as I am on a shared server. I don't know what the error is with the '60e9' string.

Thanks so much for your time!

Singlet sequences

Works great so far, but I don't see a file (or a command line option to produce a file) with the singlets, or any annotation file of which reads went into which contigs (which would allow me to get the singlet reads).

Distribution of indegree and outdegree for contigs

Operating system

2.6.32-431.29.2.el6.x86_64

Compiler

gcc (GCC) 4.4.7 20120313 (Red Hat 4.4.7-4)

Version

megahit/0.1.4-1

Dataset

Cow Rumen (SRS150404)

Command

./megahit -m 500000000000 --cpu-only -l 300 -o megahit-test-2015-01-23-2 \
    --input-cmd "zcat SRS150404/*.fastq.gz" \
    --k-min 21 --k-max 99 --k-step 10 \
    --min-count 2 \
    --num-cpu-threads 32 \
    | tee Log

Output

contig_41_185619613_length_5

Question

According to their names, all my (intermediate) contigs have a ingoing degree of 0 and an outdegree of 0 or they are loops:

I would expected some DNA repeats to produce an ingoing degree of 2 and/or an outgoing degree of 2.

Also, palindrome vertices are loops with 1 vertex, right ?

Building the code appears broken on OS X

Hi,

Running make terminates prematurely with:

mkdir -p ./bin/
g++-4.9 -O3 -Wall -funroll-loops -march=core2 -fomit-frame-pointer -maccumulate-outgoing-args -fprefetch-loop-arrays -static-libgcc -fopenmp -g -std=c++0x -lm -mpopcnt -c succinct_dbg.cpp -o succinct_dbg.o
In file included from succinct_dbg.cpp:29:0:
mem_file_checker-inl.h: In function 'FILE* OpenFileAndCheck64(const char*, const char*)':
mem_file_checker-inl.h:31:37: error: 'fopen64' was not declared in this scope
     if ((fp = fopen64(filename, mode)) == NULL){
                                     ^

From what I am gathering fopen64 is not standard C++

error

Hi,

I'm having a little trouble with megahit on some big soil data I have, hoping there might be some easy solution

$ time megahit --k-min 27 --k-max 87 --k-step 10 -t44 -m 912000000000 -l 270 --input-cmd './cmd' --cpu-only
Number of CPU threads 44
[Thu Oct  2 03:20:55 2014]: Building graph for k = 27
[Thu Oct  2 21:14:49 2014]: Assembling contigs from SdBG for k = 27
[Fri Oct  3 22:20:03 2014]: Extracting iterative edges from k = 27 to 37
Error occurs when running iterator for k = 27 to k = 37 
write error code 32
pigz abort: write error on <stdout>

real    3404m54.755s
user    86936m39.940s
sys     3029m56.860s

the ./cmd is quite lengthy as there is lots of long paths, but basically it is a simple ungzipping of many files:

pigz -cd reads1.fq.gz reads2.fq.gz reads3.fq.gz #etc.

Any ideas?
Thanks,
ben

unclear input requirement

Hi,
I am usure if megahit does not take into consideration paired-end info or if the paired-end reads need to be interleaved into a single file. The latter can easily be done using another tool from your group - fq2fa - from idba.

Building the code broken on OS X (part II)

Hi,

After fixing issue #5, make is choking. The first errors are below:

In file included from cx1_functions.cpp:34:0:
sdbg_builder_util.h:101:5: error: 'pthread_spinlock_t' does not name a type
     pthread_spinlock_t lv1_items_scanning_lock;  
     ^
sdbg_builder_util.h:171:5: error: 'pthread_barrier_t' does not name a type
     pthread_barrier_t output_barrier;
     ^
cx1_functions.cpp: In function 'void phase1::InitGlobalData(global_data_t&)':
cx1_functions.cpp:368:32: error: 'struct global_data_t' has no member named 'lv1_items_scanning_lock'
     pthread_spin_init(&globals.lv1_items_scanning_lock, 0);
                                ^
cx1_functions.cpp:368:58: error: 'pthread_spin_init' was not declared in this scope
     pthread_spin_init(&globals.lv1_items_scanning_lock, 0);
                                                          ^
cx1_functions.cpp: In function 'void* phase1::Lv1ScanToFillOffsetsThread(void*)':
cx1_functions.cpp:457:38: error: 'struct global_data_t' has no member named 'lv1_items_scanning_lock'
           pthread_spin_lock(&globals.lv1_items_scanning_lock); \

Print final.contigs.fa assembly statistics at end of log

It would be very useful to have the assembly statistics at the end of the log.

For example the ``<<< HERE` line below:

--- [Sun Jun 21 11:18:12 2015] Assembling contigs from SdBG for k = 99 ---
--- [Sun Jun 21 11:18:17 2015] Merging to output final contigs ---
--- [Sun Jun 21 11:18:17 2015] 129 contigs, min 203 bp, max 127311 bp, N50 53121 bp --- <<< HERE
--- [Sun Jun 21 11:18:17 2015] ALL DONE. Time elapsed: 67.086014 seconds ---

Incorporating paired end information in assemblies

Most Illumina sequencing data consist of paired end libraries. Considering this information within assemblies will improve accuracy.

I hope that this can be incorporated at some point.

fasta headers annotations?

Hi,
I have run an assembly with megahit on a very sparse metagenome. it was fast and it giving me a similar assembly as with the clc genomics assembler. :-)

Now I am trying to understand the fasta headers from the contig file, so I can decide what to do with the 1.7 million contigs?
e.g

k99_5 flag=0 multi=2.0435 len=213

I see my sequences have the flags, 0, 1 and 3. So what do these flags mean?

Then "multi", what is the relation of "multi" with the kmer counts and the coverage. I have not been able to find it.

"tmp" folder is not deleted at completion

Even though I don't use --keep-tmp-files it still leaves a tmp/ folder behind.

Is this expected?

Error when running "sdbg_builder count/read2sdbg"

I'm using v.0.3.3-a on a node with 128 GB RAM and get this error. I saw that there were a few other issues with this step, but the new version update didn't seem to help for me. Any idea what this error is about?

MEGAHIT v0.3.3-a
--- [Mon Jul 20 13:43:28 2015] Start assembly. Number of CPU threads 64 ---
--- [Mon Jul 20 13:43:28 2015] k list: 21,41,61,81,99 ---
--- [Mon Jul 20 13:43:28 2015] Converting reads to binaries ---
/p/home/mstein/tools/megahit_v0.3.3-a/megahit_asm_core buildlib ./megahit_out/tmp/reads.lib ./megahit_out/tmp/reads.lib
[read_lib_functions-inl.h : 194] Lib 0 (ALAn-2A_R1.trim.fastq.gz,ALAn-2A_R2.trim.fastq.gz): pe, 74864708 reads, 101 max length
[utils.h : 114] Real: 80.5623 user: 51.8952 sys: 4.7203 maxrss: 153216
--- [Mon Jul 20 13:44:49 2015] Extracting solid (k+1)-mers for k = 21 ---
cmd: /p/home/mstein/tools/megahit_v0.3.3-a/megahit_sdbg_build count -k 21 -m 2 --host_mem 121800974745 --mem_flag 1 --gpu_mem 0 --output_prefix ./megahit
_out/tmp/k21 --num_cpu_threads 64 --num_output_threads 21 --read_lib_file ./megahit_out/tmp/reads.lib
[sdbg_builder.cpp : 102] Host memory to be used: 121800974745
[sdbg_builder.cpp : 103] Number CPU threads: 64
[cx1.h : 313] Preparing data...
[cx1_kmer_count.cpp : 101] 74864708 reads, 101 max read length
[cx1.h : 320] Preparing data... Done. Time elapsed: 10.1978
[cx1.h : 327] Preparing partitions and initialing global data...
[cx1_kmer_count.cpp : 195] 2 words per substring, 2 words per edge
[cx1_kmer_count.cpp : 228] Memory for reads: 2470583652
[cx1_kmer_count.cpp : 229] max # lv.1 items = 788624446
[cx1_kmer_count.cpp : 230] max # lv.2 items = 3809570
[cx1.h : 339] Preparing partitions and initialing global data... Done. Time elapsed: 2.6485
[cx1.h : 345] Start main loop...
[cx1.h : 366] Lv1 scanning from bucket 0 to 3359
[cx1.h : 378] Lv1 scanning done. Large diff: 0. Time elapsed: 2.6994
Error occurs when running "sdbg_builder count/read2sdbg", please refer to ./megahit_out/log for detail
[Exit code -11]

Getting started

Hi,

I cloned the repo and built megahit without problems, but am having some trouble getting a test run to complete (start) without errors. It's not quite clear from the docs if you require or support fasta, fastq, and/or fastq.gz so I've tried both fastq and fasta as follows:

python ../megahit -r uazx.sample.fa -m 60000000000 -l 150

Which produces the following error

Number of CPU threads 32
[Tue Sep 30 09:03:47 2014]: Extracting solid (k+1)-mers for k = 21
sdbg_builder_cpu: cx1_functions.cpp:192: void phase1::ReadInputFile(global_data_t&): Assertion `packed_reads != __null' failed.
Error occurs when running "builder count"

Can you advise me on a correct command line and any specific input requirements e.g. paired, interleaved, fasta, etc. ? Thanks in advance.

Error occurs when assembling contigs for k=???

Hello,
When I tried to run megahit to do assembly, there was always error:
Error occurs when assembling contigs for k=25
[Exit code -4]
The kmer setting is the --k-min value, and I have tried different odd number from 15 ~ 25. None worked!
I checked the log file, which seems fine to read in all the reads, and parsed
My compiling went well, as no error or warning displayed. Use of the compiled and binary for both v0.2.1 and v0.3.0beta gave same error. See following log file:

MEGAHIT 0.3.0-beta
--- [Fri Jun 19 16:01:28 2015] Start assembly. Number of CPU threads 12 ---
--- [Fri Jun 19 16:01:28 2015] k list: 25,27,29,31,33,35,37,39,41,43,45,47,49,51,53,55,57,59,61,63,65,67,69,71,73,75,77,79,81,83,85,87,89,91,93,95,97,99,101,103,105,107,109,111,113,115,117,119,121,123 ---
--- [Fri Jun 19 16:01:28 2015] Extracting solid (k+1)-mers for k = 25 ---
cmd: /home/yifang/download-software/megahit_v0.3.0-beta_Linux_CUDA6.5_sm30/megahit_sdbg_build count -k 25 -m 2 --host_mem 128699280384 --mem_flag 1 --gpu_mem 0 --output_prefix S03/tmp/k25 --num_cpu_threads 12 --num_output_threads 4 --read_lib_file S03/tmp/reads.lib
[sdbg_builder.cpp : 101] Host memory to be used: 128699280384
[sdbg_builder.cpp : 102] Number CPU threads: 12
[cx1.h : 313] Preparing data...
[cx1_kmer_count.cpp : 101] 414255 reads, 245 max read length
[cx1.h : 320] Preparing data... Done. Time elapsed: 0.6206
[cx1.h : 327] Preparing partitions and initialing global data...
[cx1_kmer_count.cpp : 185] 2 words per substring, 3 words per edge
[cx1_kmer_count.cpp : 218] Memory for reads: 23971940
[cx1_kmer_count.cpp : 219] max # lv.1 items = 9636690
[cx1_kmer_count.cpp : 220] max # lv.2 items = 2097152
[cx1.h : 339] Preparing partitions and initialing global data... Done. Time elapsed: 0.1014
[cx1.h : 345] Start main loop...
[cx1.h : 366] Lv1 scanning from bucket 0 to 3127
[cx1.h : 378] Lv1 scanning done. Large diff: 0. Time elapsed: 0.1263
[cx1.h : 437] Lv1 fetching & sorting done. Time elapsed: 0.5203
[cx1.h : 366] Lv1 scanning from bucket 3127 to 7636
[cx1.h : 378] Lv1 scanning done. Large diff: 0. Time elapsed: 0.1240
[cx1.h : 437] Lv1 fetching & sorting done. Time elapsed: 0.5040
[cx1.h : 366] Lv1 scanning from bucket 7636 to 12575
[cx1.h : 378] Lv1 scanning done. Large diff: 0. Time elapsed: 0.1198
[cx1.h : 437] Lv1 fetching & sorting done. Time elapsed: 0.5035
[cx1.h : 366] Lv1 scanning from bucket 12575 to 19039
[cx1.h : 378] Lv1 scanning done. Large diff: 0. Time elapsed: 0.1209
[cx1.h : 437] Lv1 fetching & sorting done. Time elapsed: 0.5023
[cx1.h : 366] Lv1 scanning from bucket 19039 to 27609
[cx1.h : 378] Lv1 scanning done. Large diff: 0. Time elapsed: 0.1221
[cx1.h : 437] Lv1 fetching & sorting done. Time elapsed: 0.4988
[cx1.h : 366] Lv1 scanning from bucket 27609 to 35970
[cx1.h : 378] Lv1 scanning done. Large diff: 0. Time elapsed: 0.1223
[cx1.h : 437] Lv1 fetching & sorting done. Time elapsed: 0.5012
[cx1.h : 366] Lv1 scanning from bucket 35970 to 50072
[cx1.h : 378] Lv1 scanning done. Large diff: 0. Time elapsed: 0.1252
[cx1.h : 437] Lv1 fetching & sorting done. Time elapsed: 0.4976
[cx1.h : 366] Lv1 scanning from bucket 50072 to 65536
[cx1.h : 378] Lv1 scanning done. Large diff: 0. Time elapsed: 0.1154
[cx1.h : 437] Lv1 fetching & sorting done. Time elapsed: 0.2479
[cx1.h : 450] Main loop done. Time elapsed: 4.7557
[cx1.h : 456] Postprocessing...
[cx1_kmer_count.cpp : 636] Total number of candidate reads: 1414(3176)
[cx1_kmer_count.cpp : 646] Total number of solid edges: 273854
[cx1_kmer_count.cpp : 657] Output reads to binary file...
[cx1.h : 464] Postprocess done. Time elapsed: 0.2201
[utils.h : 114] Real: 5.7013 user: 40.9000 sys: 3.0360 maxrss: 203620
--- [Fri Jun 19 16:01:34 2015] Building graph for k = 25 ---
/home/yifang/download-software/megahit_v0.3.0-beta_Linux_CUDA6.5_sm30/megahit_sdbg_build seq2sdbg --host_mem 128699280384 --mem_flag 1 --gpu_mem 0 --output_prefix S03/tmp/k25 --num_cpu_threads 12 -k 25 --kmer_from 25 --num_edge_files 4 --input_prefix S03/tmp/k25 --need_mercy
[sdbg_builder.cpp : 310] Host memory to be used: 128699280384
[sdbg_builder.cpp : 311] Number CPU threads: 12
[cx1.h : 313] Preparing data...
[cx1_seq2sdbg.cpp : 383] Adding mercy edges...
[cx1_seq2sdbg.cpp : 361] Number of reads: 1414, Number of mercy edges: 25872
[cx1_seq2sdbg.cpp : 390] Done. Time elapsed: 0.2468
[cx1.h : 320] Preparing data... Done. Time elapsed: 0.2638
[cx1.h : 327] Preparing partitions and initialing global data...
[cx1_seq2sdbg.cpp : 516] 3 words per substring, num sequences: 299726, words per dummy node ($v): 2
[cx1_seq2sdbg.cpp : 549] Memory for sequence: 2547704
[cx1_seq2sdbg.cpp : 550] max # lv.1 items = 2097152
[cx1_seq2sdbg.cpp : 551] max # lv.2 items = 2097152
[cx1.h : 339] Preparing partitions and initialing global data... Done. Time elapsed: 0.0355
[cx1.h : 345] Start main loop...
[cx1.h : 366] Lv1 scanning from bucket 0 to 390625
[cx1.h : 378] Lv1 scanning done. Large diff: 0. Time elapsed: 0.1099
[cx1.h : 437] Lv1 fetching & sorting done. Time elapsed: 0.1334
[cx1.h : 450] Main loop done. Time elapsed: 0.3226
[cx1.h : 456] Postprocessing...
[cx1_seq2sdbg.cpp : 962] Number of $ A C G T A- C- G- T-:
[cx1_seq2sdbg.cpp : 965] 10125 158135 139137 136721 157921 4730 4171 4165 4596
[cx1_seq2sdbg.cpp : 977] Total number of edges: 619701
[cx1_seq2sdbg.cpp : 978] Total number of ONEs: 591914
[cx1_seq2sdbg.cpp : 979] Total number of v$ edges: 10125
[cx1_seq2sdbg.cpp : 980] Total number of $v edges: 10125
[cx1.h : 464] Postprocess done. Time elapsed: 0.0211
[utils.h : 114] Real: 0.6439 user: 1.4920 sys: 0.6360 maxrss: 408056
--- [Fri Jun 19 16:01:35 2015] Assembling contigs from SdBG for k = 25 ---
cmd: /home/yifang/download-software/megahit_v0.3.0-beta_Linux_CUDA6.5_sm30/megahit_asm_core assemble -s S03/tmp/k25 -o S03/tmp/k25 -t 12 --max_tip_len -1 --min_standalone 368 --prune_level 2 --merge_len 20 --merge_similar 0.98 --low_local_ratio 0.2 --min_depth 2
[assembler.cpp : 155] Loading succinct de Bruijn graph: S03/tmp/k25
Error occurs when assembling contigs for k = 25
[Exit code -4]

Could you please give me some advice on this problem?
Thanks a lot!
Yifang

Any information about k-mer coverage in "log" ?

I can't seem to find any information about k-mer coverage in the logs.

Is this information available to print?

(It would be per k-mer stage)

Torsten

Installation with GPU capability

Hi,
I'm trying to install MEGAHIT with GPU usage ( make use_gpu=1 )

So far it hasn't work, I solved some issues with my version of CUDA but now I am getting an error with the following cpp file: .lv2_gpu_functions_4B_sm300_nvvm_6.5_abi_x86_64.cpp

In function:
`device_stub__ZN3cub22RadixSortUpsweepKernelINS_23DeviceRadixSortDispatchILb0EjjiE16PtxUpsweepnelINS_23DeviceRadixSortDispatchILb0EjjiE16PtxUpsweepPolicyELb0EjiEEvPT1_PT2_S6_iibNS_13GridEvenShareIS6_EPolicyELb0EjiEEvPT1_PT2_S6_iibNS_13GridEvenShareIS6_EE(unsigned int, int, int, int, int, bool, cub::GridE(unsigned int_, int_, int, int, int, bool, cub::GridEvenShare&) [clone .constprop.84]'

undefined reference to cudaSetupArgument' undefined reference tocudaLaunch'

Has any encountered the same errors?

Put what read files + #reads you used/parsed at start of stderr/log

Example:

--- [Sun Jun 21 11:28:20 2015] Start assembly. Number of CPU threads 72 ---
--- [Sun Jun 21 11:28:20 2015] Using --12  reads.fq.gz
--- [Sun Jun 21 11:28:20 2015] Processed 19238118 reads, max len = 151 bp

Access to the assembly graph?

I would like to visualize the megahit assembly graph in Bandage.

But to do so the author @rrwick needs to write a parser for a assembly graph dump, or a FASTG output format.

Does megahit have the ability to save a graph?

source tree is missing citycrc.h

On line 632 of city.cpp you have:

ifdef SSE4_2

include <citycrc.h>

However citycrc.h is not in the source tree on github.

Copying the file from the city hash source repo appears to fix the problem:
https://code.google.com/p/cityhash/source/browse/trunk/src/citycrc.h

If you were expecting city hash to be installed separately then it would be good to mention that in the documentation for the build (though it seems that you don't because you have city.cpp and city.h in the source tree).

Cheers,
Bernie.

Installation error

Hi,

I have downloaded the megahit v0.2.1 and after unpacking when I run the make command I get the following error.

g++ -O3 -Wall -funroll-loops -fprefetch-loop-arrays -fopenmp -std=c++0x -static-libgcc -lm -mpopcnt -c succinct_dbg.cpp -o succinct_dbg.o
cc1plus: error: unrecognized command line option "-std=c++0x"
make: *** [succinct_dbg.o] Error 1

Any help in why this is happening would be awesome and greatly appreciated.

Thanks,

Bibaswan

--help should go to stdout, not stderr

The output of --help should go to standard output and not to standard error, so that it can be easily piped into a pager such as less.

Overwrite output folder

Hi,

This is not really a big issue.

I would like to know if there an option to force overwrite of output directory. Perhaps I missed this in the documentation somewhere.

At the moment, there is no specific option and I bypass it by using the --continue option (which is actually for checkpointing purposes) to achieve this. This is fine for me, but maybe some users would like to restart their assembly and retain the folder names :)

-Shaman-

Error occurs when assembling contigs for k = 101

I have a small metagenomic dataset which is causing an error with megahit version: 0.1 beta

Here is my command line:
python megahit --k-min 81 --k-max 101 --k-step 4 --cpu-only -m 100000000000 -l 150 -r out.p.fa

I get the following error:
[Wed Nov 12 16:10:32 2014]: Extracting solid (k+1)-mers for k = 81
[Wed Nov 12 16:11:04 2014]: Building graph for k = 81
[Wed Nov 12 16:11:10 2014]: Assembling contigs from SdBG for k = 81
[Wed Nov 12 16:11:13 2014]: Extracting iterative edges from k = 81 to 85
[Wed Nov 12 16:11:23 2014]: Building graph for k = 85
[Wed Nov 12 16:11:24 2014]: Assembling contigs from SdBG for k = 85
[Wed Nov 12 16:11:24 2014]: Extracting iterative edges from k = 85 to 89
[Wed Nov 12 16:11:27 2014]: Building graph for k = 89
[Wed Nov 12 16:11:27 2014]: Assembling contigs from SdBG for k = 89
[Wed Nov 12 16:11:27 2014]: Extracting iterative edges from k = 89 to 93
[Wed Nov 12 16:11:31 2014]: Building graph for k = 93
[Wed Nov 12 16:11:31 2014]: Assembling contigs from SdBG for k = 93
[Wed Nov 12 16:11:31 2014]: Extracting iterative edges from k = 93 to 97
[Wed Nov 12 16:11:34 2014]: Building graph for k = 97
[Wed Nov 12 16:11:34 2014]: Assembling contigs from SdBG for k = 97
[Wed Nov 12 16:11:35 2014]: Extracting iterative edges from k = 97 to 101
[Wed Nov 12 16:11:38 2014]: Building graph for k = 101
[Wed Nov 12 16:11:38 2014]: Assembling contigs from SdBG for k = 101
Error occurs when assembling contigs for k = 101

Support for option like "--k-only"

There are various applications where you want to use a single k value for a fast, rough assembly.

An option like --k-only 51 would be helpful, instead of -k-min 51 --k-max 51.

Not very important, just an idea.

SuccinctDBG error in v0.2.0

Hello,
I tried installing v0.2.0 and received the error below. I was able to install v.0.1.4 without any problems.

$ make CXX=/usr/gcc-4.9.2/bin/g++-4.9.2
/usr/gcc-4.9.2/bin/g++-4.9.2 -O3 -Wall -funroll-loops -fprefetch-loop-arrays -fopenmp -std=c++0x -static-libgcc -lm -mpopcnt assembler.cpp rank_and_select.o succinct_dbg.o assembly_algorithms.o branch_group.o options_description.o unitig_graph.o compact_sequence.o -lz -o megahit_assemble
Undefined symbols for architecture x86_64:
  "std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >::__init(char const*, unsigned long)", referenced from:
      SuccinctDBG::LoadFromFile(char const*) in succinct_dbg.o
  "std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >::append(char const*)", referenced from:
      SuccinctDBG::LoadFromFile(char const*) in succinct_dbg.o
  "std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >::~basic_string()", referenced from:
      SuccinctDBG::LoadFromFile(char const*) in succinct_dbg.o
ld: symbol(s) not found for architecture x86_64
collect2: error: ld returned 1 exit status
make: *** [megahit_assemble] Error 1

I'm on OSX 10.9.5 and just installed GCC 4.9.2.

Thanks,
Mia

Compiler warning "igorning packed attribute"

Is this affecting optimization?

gcc (Homebrew gcc 5.1.0) 5.1.0

In file included from hash_map.h:17:0,
                 from unitig_graph.h:29,
                 from assembler.cpp:33:
hash_table.h: In instantiation of 'struct HashTableNode<std::pair<long int, unsigned int> >':
hash_table.h:602:22:   required from 'void HashTable<Value, Key, HashFunc, ExtractKey, EqualKey>::clear() [with Value = std::pair<long int, unsigned int>; Key = long int; HashFunc = Hash<long int>; ExtractKey = Select1st<std::pair<long int, unsigned int> >; EqualKey = std::equal_to<long int>]'
hash_table.h:277:14:   required from 'HashTable<Value, Key, HashFunc, ExtractKey, EqualKey>::~HashTable() [with Value = std::pair<long int, unsigned int>; Key = long int; HashFunc = Hash<long int>; ExtractKey = Select1st<std::pair<long int, unsigned int> >; EqualKey = std::equal_to<long int>]'
hash_map.h:32:7:   required from here
hash_table.h:50:7: warning: ignoring packed attribute because of unpacked non-POD field 'std::pair<long int, unsigned int> HashTableNode<std::pair<long int, unsigned int> >::value'

Make a new release 3.0b2 to reflect missing crc.h file

I'm about to advertise megahit 3.0b but I don't want you to get lots of github issues regarding the .tar.gz missing the citycrc.h file #39

I'll wait until you make a new release?

Is multi-threaded megahit deterministic?

If I run megahit multiple times with 64 threads say, will I get a different assembly result?

ie. is the algorithm deterministic?

(Velvet was not deterministic in scaffolding PE reads, but gave similar results)

Clarify support for .gz and .bz2 in -1/-2/--12 parameters

Maybe add an EXAMPLES section to the documentation which includes .fq.gz and .fastq.bz2 to make it clear you now support it?

Thank for adding .gz support - it will make it MUCH easier for people to use.

Error occurs when running "sdbg_builder count/read2sdbg"

I encountered this error running Megahit 0.3.0-beta-3:
MEGAHIT 0.3.0-beta3
--- [Fri Jun 26 11:20:33 2015] Start assembly. Number of CPU threads 16 ---
--- [Fri Jun 26 11:20:33 2015] k list: 31,41,51,61,71,81,91,101,111,121 ---
--- [LIB INFO] interleaved PE reads: unmerged/S.174_CBK.fastq.gz
--- [LIB INFO] interleaved PE reads: unmerged/S.178_YBM.fastq.gz
--- [LIB INFO] interleaved PE reads: unmerged/S.180_BBC.fastq.gz
--- [LIB INFO] interleaved PE reads: unmerged/S.182_BBI.fastq.gz
--- [LIB INFO] interleaved PE reads: unmerged/S.184_BBB.fastq.gz
--- [LIB INFO] interleaved PE reads: unmerged/S.186_BBA.fastq.gz
--- [LIB INFO] interleaved PE reads: unmerged/S.188_CBA.fastq.gz
--- [LIB INFO] interleaved PE reads: unmerged/S.190_CBC.fastq.gz
--- [LIB INFO] interleaved PE reads: unmerged/S.41_BBB.fastq.gz
--- [LIB INFO] interleaved PE reads: unmerged/S.42_YBB.fastq.gz
--- [LIB INFO] interleaved PE reads: unmerged/S.43_YBM.fastq.gz
--- [LIB INFO] Single-end reads: merged/S.174_CBK.fastq.gz
--- [LIB INFO] Single-end reads: merged/S.178_YBM.fastq.gz
--- [LIB INFO] Single-end reads: merged/S.180_BBC.fastq.gz
--- [LIB INFO] Single-end reads: merged/S.182_BBI.fastq.gz
--- [LIB INFO] Single-end reads: merged/S.184_BBB.fastq.gz
--- [LIB INFO] Single-end reads: merged/S.186_BBA.fastq.gz
--- [LIB INFO] Single-end reads: merged/S.188_CBA.fastq.gz
--- [LIB INFO] Single-end reads: merged/S.190_CBC.fastq.gz
--- [LIB INFO] Single-end reads: merged/S.41_BBB.fastq.gz
--- [LIB INFO] Single-end reads: merged/S.42_YBB.fastq.gz
--- [LIB INFO] Single-end reads: merged/S.43_YBM.fastq.gz
--- [Fri Jun 26 11:20:33 2015] Extracting solid (k+1)-mers for k = 31 ---
Error occurs when running "sdbg_builder count/read2sdbg"
[Exit code 255]

The program was run with these settings:
$HOME/lib/megahit-0.3.0-beta3/megahit -r
merged/S.174_CBK.fastq.gz,
merged/S.178_YBM.fastq.gz,
merged/S.180_BBC.fastq.gz,
merged/S.182_BBI.fastq.gz,
merged/S.184_BBB.fastq.gz,
merged/S.186_BBA.fastq.gz,
merged/S.188_CBA.fastq.gz,
merged/S.190_CBC.fastq.gz,
merged/S.41_BBB.fastq.gz,
merged/S.42_YBB.fastq.gz,
merged/S.43_YBM.fastq.gz
--12 unmerged/S.174_CBK.fastq.gz,
unmerged/S.178_YBM.fastq.gz,
unmerged/S.180_BBC.fastq.gz,
unmerged/S.182_BBI.fastq.gz,
unmerged/S.184_BBB.fastq.gz,
unmerged/S.186_BBA.fastq.gz,
unmerged/S.188_CBA.fastq.gz,
unmerged/S.190_CBC.fastq.gz,
unmerged/S.41_BBB.fastq.gz,
unmerged/S.42_YBB.fastq.gz,
unmerged/S.43_YBM.fastq.gz
-m 220e9
--k-min 31
--k-max 121
--k-step 10
-t 16 \

Any insight would be appreciated.

Question about building SdBG of next k.

Hi,
I have trouble to understand how megahit build graph of k > min_k. It seems megahit does not read from the original read files for k > k_min. Instead megahit reads in a contig file and multi file and remaining read file. What are the multi file and remaining read file? Why these files will suffice to build graph of next k?
thanks

Broken contigs?

Thank you for adding FASTG support to MEGAHIT! I gave it a try on a simple dataset (synthetic reads from a plasmid sequence), and used Bandage to compare the results to Velvet at the same k-mer.

Here's the 61mer Velvet graph:

And here's the 61mer MEGAHIT graph:

You can see that while they have the same overall structure, some contigs in the MEGAHIT graph are broken into pieces, resulting in 'dead ends'. It is most notable for one of the long contigs on the left side of the images. This results in contigs that are not as long as they could be.

The data is fairly ideal (error-free reads and decent read depth), so I'm not sure why this would be happening. I could email you the reads if you wanted to try it for yourselves.

error while compiling extract_pe_reads.cpp

Trying to make on Centos release 6.7 system:
everything "makes" fine (megahit/megahit_asm_core/megahit_sdbg_build are all build)
make megahit_toolkit breaks on the compilation of tools/extract_pe_reads.cpp

g++ -g -O2 -Wall -Wno-unused-function -Wno-array-bounds -D__STDC_FORMAT_MACROS -funroll-loops -fprefetch-loop-arrays -fopenmp -I. -std=c++0x -static-libgcc -m64 -mpopcnt -DPACKAGE_VERSION=""v1.0.3-5-g2efa1ec"" tools/toolkit.cpp tools/contigs_to_fastg.cpp tools/read_stat.cpp tools/trim_low_qual_tail.cpp tools/filter_by_len.cpp tools/extract_pe_reads.cpp -lm -lz -lpthread -o megahit_toolkit
tools/extract_pe_reads.cpp: In function ‘int main_extract_pe(int, char*)’:
tools/extract_pe_reads.cpp:106: error: expected initializer before ‘:’ token
tools/extract_pe_reads.cpp:110: error: could not convert ‘kseq_destroy(seq)’ to ‘bool’
tools/extract_pe_reads.cpp:111: error: expected ‘)’ before ‘;’ token
make: ** [megahit_toolkit] Error 1

I'm running: g++ (GCC) 4.4.7 20120313 (Red Hat 4.4.7-16)

setting memory

I noticed that megahit has an auto detect feature for specifying gpu ram. I was wondering if it would be feasible to add the same for the -m option. For example I added the following to my local copy of megahit:

    elif option in ("-m", "--memory"):
        if value == "auto":
            opt.host_mem = available_mem() * 0.90
        else:
            opt.host_mem = long(float(value))

    def available_mem():
        mem = long()
        if sys.platform.find("linux") != -1:
            mem = long(float(os.popen("free").readlines()[1].split()[1]) * 1024 )
        elif sys.platform.find("darwin") != -1:
            mem = long(float(os.popen("vm_stat").readlines()[1].split()[2]) * 4096 )
        else:
            raise Usage(megahit_version_str + '\nos not determined for "-m auto" memory calculation.' )
        return mem

I am not sure if that is the best way get available memory, but an auto feature helps usability when I am sending megahit jobs to different hardware.