Comments (11)
If -l0
is not specified, hifiasm -p prefix --high-het
will enable -l2
. Hifiasm with -l2
performs purge_dup steps so it should have bigger 50.
from hifiasm.
Well, hifiasm ends up merging several scaffolds because none are originally over 25 Mbp such that I think 4 or so are over 25 Mbp with hig-het. I can provide the assembly stats tomorrow. I guess I don’t understand how purge_dups can end up merging scaffolds.
from hifiasm.
Okay, so I changed the title of the post as I realized that I incorrectly described the problem in the title and the initial post.
$ /genetics/elbers/bin/hifiasm/hifiasm -o wm.asm -t75 -l0 ../HiFi-reads-rm-icecream.fasta.gz
$ awk '/^S/{print ">"$2"\n"$3}' wm.asm.p_ctg.gfa | fold > wm.asm.p_ctg.l0.fa
# note that N50 is a number and not a length with bbstats.sh
# note that L50 is a length and not a number with bbstats.sh
$ /opt/bbmap/bbstats.sh wm.asm.p_ctg.l0.fa
A C G T N IUPAC Other GC GC_stdev
0.3364 0.1635 0.1636 0.3365 0.0000 0.0000 0.0000 0.3271 0.0378
Main genome scaffold total: 1926
Main genome contig total: 1926
Main genome scaffold sequence total: 1089.879 MB
Main genome contig sequence total: 1089.879 MB 0.000% gap
Main genome scaffold N/L50: 89/3.043 MB
Main genome contig N/L50: 89/3.043 MB
Main genome scaffold N/L90: 570/287.908 KB
Main genome contig N/L90: 570/287.908 KB
Max scaffold length: 23.994 MB
Max contig length: 23.994 MB
Number of scaffolds > 50 KB: 1235
% main genome in scaffolds > 50 KB: 98.07%
Minimum Number Number Total Total Scaffold
Scaffold of of Scaffold Contig Contig
Length Scaffolds Contigs Length Length Coverage
-------- -------------- -------------- -------------- -------------- --------
All 1,926 1,926 1,089,878,911 1,089,878,911 100.00%
2.5 KB 1,926 1,926 1,089,878,911 1,089,878,911 100.00%
5 KB 1,924 1,924 1,089,869,399 1,089,869,399 100.00%
10 KB 1,924 1,924 1,089,869,399 1,089,869,399 100.00%
25 KB 1,734 1,734 1,085,989,968 1,085,989,968 100.00%
50 KB 1,235 1,235 1,068,806,588 1,068,806,588 100.00%
100 KB 964 964 1,049,179,798 1,049,179,798 100.00%
250 KB 610 610 991,796,034 991,796,034 100.00%
500 KB 384 384 910,161,360 910,161,360 100.00%
1 MB 237 237 803,609,777 803,609,777 100.00%
2.5 MB 114 114 616,698,085 616,698,085 100.00%
5 MB 41 41 360,473,780 360,473,780 100.00%
10 MB 11 11 158,615,155 158,615,155 100.00%
$ /genetics/elbers/bin/hifiasm/hifiasm -o wm.asm -t75 --high-het ../HiFi-reads-rm-icecream.fasta.gz
$ awk '/^S/{print ">"$2"\n"$3}' wm.asm.p_ctg.gfa | fold > wm.asm.p_ctg.high-het.fa
$ /opt/bbmap/bbstats.sh wm.asm.p_ctg.high-het.fa
A C G T N IUPAC Other GC GC_stdev
0.3361 0.1637 0.1639 0.3363 0.0000 0.0000 0.0000 0.3276 0.0448
Main genome scaffold total: 825
Main genome contig total: 825
Main genome scaffold sequence total: 784.534 MB
Main genome contig sequence total: 784.534 MB 0.000% gap
Main genome scaffold N/L50: 19/11.155 MB
Main genome contig N/L50: 19/11.155 MB
Main genome scaffold N/L90: 134/552.325 KB
Main genome contig N/L90: 134/552.325 KB
Max scaffold length: 47.052 MB
Max contig length: 47.052 MB
Number of scaffolds > 50 KB: 518
% main genome in scaffolds > 50 KB: 98.84%
Minimum Number Number Total Total Scaffold
Scaffold of of Scaffold Contig Contig
Length Scaffolds Contigs Length Length Coverage
-------- -------------- -------------- -------------- -------------- --------
All 825 825 784,534,196 784,534,196 100.00%
2.5 KB 825 825 784,534,196 784,534,196 100.00%
5 KB 823 823 784,524,684 784,524,684 100.00%
10 KB 823 823 784,524,684 784,524,684 100.00%
25 KB 730 730 782,625,060 782,625,060 100.00%
50 KB 518 518 775,437,403 775,437,403 100.00%
100 KB 379 379 765,444,834 765,444,834 100.00%
250 KB 229 229 741,486,833 741,486,833 100.00%
500 KB 140 140 709,699,869 709,699,869 100.00%
1 MB 93 93 677,986,980 677,986,980 100.00%
2.5 MB 61 61 623,843,181 623,843,181 100.00%
5 MB 36 36 533,469,808 533,469,808 100.00%
10 MB 21 21 420,184,726 420,184,726 100.00%
25 MB 5 5 174,385,273 174,385,273 100.00%
So it appears that hifiasm with --high-het
or without -l0
is merging scaffolds?
Unless there is something wrong with the gfa file during the conversion with awk '/^S/{print ">"$2"\n"$3}' wm.asm.p_ctg.gfa | fold > wm.asm.p_ctg.high-het.fa
EDIT
with -l1
, one does not get merged scaffolds
$ /genetics/elbers/bin/hifiasm/hifiasm -o wm.asm -t75 -l1 ../HiFi-reads-rm-icecream.fasta.gz
$ awk '/^S/{print ">"$2"\n"$3}' wm.asm.p_ctg.gfa | fold > wm.asm.p_ctg.l1.fa
$ /opt/bbmap/bbstats.sh wm.asm.p_ctg.l1.fa
A C G T N IUPAC Other GC GC_stdev
0.3362 0.1637 0.1638 0.3363 0.0000 0.0000 0.0000 0.3275 0.0395
Main genome scaffold total: 1403
Main genome contig total: 1403
Main genome scaffold sequence total: 943.349 MB
Main genome contig sequence total: 943.349 MB 0.000% gap
Main genome scaffold N/L50: 69/3.603 MB
Main genome contig N/L50: 69/3.603 MB
Main genome scaffold N/L90: 396/358.325 KB
Main genome contig N/L90: 396/358.325 KB
Max scaffold length: 23.994 MB
Max contig length: 23.994 MB
Number of scaffolds > 50 KB: 920
% main genome in scaffolds > 50 KB: 98.44%
Minimum Number Number Total Total Scaffold
Scaffold of of Scaffold Contig Contig
Length Scaffolds Contigs Length Length Coverage
-------- -------------- -------------- -------------- -------------- --------
All 1,403 1,403 943,348,525 943,348,525 100.00%
2.5 KB 1,403 1,403 943,348,525 943,348,525 100.00%
5 KB 1,401 1,401 943,339,013 943,339,013 100.00%
10 KB 1,401 1,401 943,339,013 943,339,013 100.00%
25 KB 1,272 1,272 940,699,597 940,699,597 100.00%
50 KB 920 920 928,669,349 928,669,349 100.00%
100 KB 735 735 915,092,089 915,092,089 100.00%
250 KB 474 474 872,365,387 872,365,387 100.00%
500 KB 306 306 811,809,855 811,809,855 100.00%
1 MB 206 206 739,587,945 739,587,945 100.00%
2.5 MB 106 106 584,871,864 584,871,864 100.00%
5 MB 40 40 352,486,974 352,486,974 100.00%
10 MB 11 11 158,615,155 158,615,155 100.00%
from hifiasm.
My 2 cents; To see this clear, you are saying that its not just the n50 that increased. Increased n50 makes sense with purging the assembly. (if you would take an NG50, it should stay the same, right?)
However, your problem is that you reach ~616Mb with 140 contigs with your "default -l0" assembly, and you reach ~623Mb with 61 contigs for your "high-het -l2" assembly.
I don't think persee the high-het option is 'merging' contigs, although I don't know the algorithm, it can also be that because of different alignment parameters (which we can't see, or are not exposed) it just makes the assembler making bigger contigs.
Maybe the unitig joining step is different, maybe the unitigs themselves are already different?
Since the high-het option is experimental, I think it would help chhylp123 if you could check somehow if these bigger contigs are actually right, or that there are contain more errors.
Do you agree that It would make life easier if you would run 2 assemblies, with just enabling/disabling the --high-het option (leaving -l identical)?
from hifiasm.
@HenrivdGeest thanks for you input. Regarding:
I think it would help chhylp123 if you could check somehow if these bigger contigs are actually right, or that there are contain more errors.
I am not sure how to do this with our current setup of a new genome that has never been assembled before with only HiFi reads other than examining the read alignments in IGV or something like that- something that I cannot do given time constraints at the moment.
Regarding:
Do you agree that It would make life easier if you would run 2 assemblies, with just enabling/disabling the --high-het option (leaving -l identical)?
Excellent suggestion, please see below:
$ /genetics/elbers/bin/hifiasm/hifiasm -o wm.asm -t75 -l2 --high-het ../HiFi-reads-rm-icecream.fasta.gz
Reads has been loaded.
Loading ma_hit_ts from disk...
ma_hit_ts has been read.
Loading ma_hit_ts from disk...
ma_hit_ts has been read.
[M::ha_assemble::61.378*1.00] ==> loaded corrected reads and overlaps from disk
Writing raw unitig GFA to disk...
Writing processed unitig GFA to disk...
[M::purge_dups] purge duplication coverage threshold: 35
[M::purge_dups] purge duplication coverage threshold: 35
[M::adjust_utg_by_primary] primary contig coverage range: [22, infinity]
Writing primary contig GFA to disk...
Writing alternate contig GFA to disk...
Inconsistency threshold for low-quality regions in BED files: 70%
[M::main] Version: 0.13-r308
[M::main] CMD: /genetics/elbers/bin/hifiasm/hifiasm -o wm.asm -t75 -l2 --high-het ../HiFi-reads-rm-icecream.fasta.gz
[M::main] Real time: 439.650 sec; CPU: 488.492 sec; Peak RSS: 12.760 GB
$ awk '/^S/{print ">"$2"\n"$3}' wm.asm.p_ctg.gfa | fold > wm.asm.p_ctg.l2-high-het.fa
$ /opt/bbmap/bbstats.sh wm.asm.p_ctg.l2-high-het.fa
A C G T N IUPAC Other GC GC_stdev
0.3361 0.1637 0.1639 0.3363 0.0000 0.0000 0.0000 0.3276 0.0448
Main genome scaffold total: 825
Main genome contig total: 825
Main genome scaffold sequence total: 784.534 MB
Main genome contig sequence total: 784.534 MB 0.000% gap
Main genome scaffold N/L50: 19/11.155 MB
Main genome contig N/L50: 19/11.155 MB
Main genome scaffold N/L90: 134/552.325 KB
Main genome contig N/L90: 134/552.325 KB
Max scaffold length: 47.052 MB
Max contig length: 47.052 MB
Number of scaffolds > 50 KB: 518
% main genome in scaffolds > 50 KB: 98.84%
Minimum Number Number Total Total Scaffold
Scaffold of of Scaffold Contig Contig
Length Scaffolds Contigs Length Length Coverage
-------- -------------- -------------- -------------- -------------- --------
All 825 825 784,534,196 784,534,196 100.00%
2.5 KB 825 825 784,534,196 784,534,196 100.00%
5 KB 823 823 784,524,684 784,524,684 100.00%
10 KB 823 823 784,524,684 784,524,684 100.00%
25 KB 730 730 782,625,060 782,625,060 100.00%
50 KB 518 518 775,437,403 775,437,403 100.00%
100 KB 379 379 765,444,834 765,444,834 100.00%
250 KB 229 229 741,486,833 741,486,833 100.00%
500 KB 140 140 709,699,869 709,699,869 100.00%
1 MB 93 93 677,986,980 677,986,980 100.00%
2.5 MB 61 61 623,843,181 623,843,181 100.00%
5 MB 36 36 533,469,808 533,469,808 100.00%
10 MB 21 21 420,184,726 420,184,726 100.00%
25 MB 5 5 174,385,273 174,385,273 100.00%
$ /genetics/elbers/bin/hifiasm/hifiasm -o wm.asm -t75 -l2 ../HiFi-reads-rm-icecream.fasta.gz
Reads has been loaded.
Loading ma_hit_ts from disk...
ma_hit_ts has been read.
Loading ma_hit_ts from disk...
ma_hit_ts has been read.
[M::ha_assemble::63.147*1.00] ==> loaded corrected reads and overlaps from disk
Writing raw unitig GFA to disk...
Writing processed unitig GFA to disk...
^[[B^[[B[M::purge_dups] purge duplication coverage threshold: 35
[M::purge_dups] purge duplication coverage threshold: 35
[M::adjust_utg_by_primary] primary contig coverage range: [22, infinity]
Writing primary contig GFA to disk...
Writing alternate contig GFA to disk...
Inconsistency threshold for low-quality regions in BED files: 70%
[M::main] Version: 0.13-r308
[M::main] CMD: /genetics/elbers/bin/hifiasm/hifiasm -o wm.asm -t75 -l2 ../HiFi-reads-rm-icecream.fasta.gz
[M::main] Real time: 394.678 sec; CPU: 443.092 sec; Peak RSS: 12.759 GB
$ awk '/^S/{print ">"$2"\n"$3}' wm.asm.p_ctg.gfa | fold > wm.asm.p_ctg.l2-no-high-het.fa
$ /opt/bbmap/bbstats.sh wm.asm.p_ctg.l2-no-high-het.fa
A C G T N IUPAC Other GC GC_stdev
0.3361 0.1637 0.1639 0.3363 0.0000 0.0000 0.0000 0.3276 0.0448
Main genome scaffold total: 825
Main genome contig total: 825
Main genome scaffold sequence total: 784.534 MB
Main genome contig sequence total: 784.534 MB 0.000% gap
Main genome scaffold N/L50: 19/11.155 MB
Main genome contig N/L50: 19/11.155 MB
Main genome scaffold N/L90: 134/552.325 KB
Main genome contig N/L90: 134/552.325 KB
Max scaffold length: 47.052 MB
Max contig length: 47.052 MB
Number of scaffolds > 50 KB: 518
% main genome in scaffolds > 50 KB: 98.84%
Minimum Number Number Total Total Scaffold
Scaffold of of Scaffold Contig Contig
Length Scaffolds Contigs Length Length Coverage
-------- -------------- -------------- -------------- -------------- --------
All 825 825 784,534,196 784,534,196 100.00%
2.5 KB 825 825 784,534,196 784,534,196 100.00%
5 KB 823 823 784,524,684 784,524,684 100.00%
10 KB 823 823 784,524,684 784,524,684 100.00%
25 KB 730 730 782,625,060 782,625,060 100.00%
50 KB 518 518 775,437,403 775,437,403 100.00%
100 KB 379 379 765,444,834 765,444,834 100.00%
250 KB 229 229 741,486,833 741,486,833 100.00%
500 KB 140 140 709,699,869 709,699,869 100.00%
1 MB 93 93 677,986,980 677,986,980 100.00%
2.5 MB 61 61 623,843,181 623,843,181 100.00%
5 MB 36 36 533,469,808 533,469,808 100.00%
10 MB 21 21 420,184,726 420,184,726 100.00%
25 MB 5 5 174,385,273 174,385,273 100.00%
And you are correct that I did "assume" that hifiasm was merging contigs/scaffolds without actually verifying that.
from hifiasm.
What is the input data? Haplotig duplicate purging at the -l2 level joins haplotigs from different haplotypes which is disallowed at -l0. For heterozygous samples, -l2 is highly recommended. This is why tools like purge_dups exist.
from hifiasm.
It is a Dipteran fly with an estimated 1.5% heterozygosity (based on using genomescope on the HiFi reads - Illumina reads from 5 other individuals suggest similar heterozygosity levels) with ~30x HiFi read coverage.
from hifiasm.
GenomeScope seems to overestimate heterozygosity at our hand. The real heterozygosity is probably closer to 1%. At this level, definitely use -l2. As I remember, you need even higher heterozygosity for --high-het to kick in.
from hifiasm.
Ok. If I run purge_dups manually on the p_ctg from -l0 output, I don’t end up with larger scaffolds, so l am a little confused.
from hifiasm.
Haplotig duplicate purging at the -l2 level joins haplotigs from different haplotypes
Standalone purge_dups doesn't do this.
from hifiasm.
Ok thank you so much!
from hifiasm.
Related Issues (20)
- Output interpretation with HiFi+ONT+HiC with inbred samples + `-l0` HOT 1
- low BUSCO scores HOT 1
- Mitigate Overlapping Sequence Assignments in Haplotypes HOT 3
- Help!!! Segmentation fault (core dumped) HOT 1
- Question about the depth of ONT ultra-long reads HOT 1
- Homotetraploid, super-large genome, with different parameters, the size of p_utg varies greatly? HOT 1
- setting K parameter in yak HOT 2
- how to make the correct genome size estimation for allotetraploid species? HOT 2
- Possible missing one haplotype in human assemblies HOT 2
- No haploid.gfa files output in trio-binning mode HOT 3
- Hifi + Hi-c + ONT assembly fails
- In Trio-binning, always more on hap1 despite (almost) same sequences for paternal and maternal
- discontinuous assembly with shorter pacbio hifi reads but high coverage HOT 2
- Is x20 of Hifi data enough to construct draft assembly of 6.5Gb genome? HOT 1
- line 8: 110334 Aborted(core dumped) HOT 1
- Ultra Long intergration failed: no output for UL kmer counting HOT 3
- missing 8Mb sequences in the assembly HOT 5
- Empty haplotype 2 gfa files by ONT integration HOT 1
- Basic Question About HiFi Input HOT 3
- Spend too long times to run hifiasm HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from hifiasm.