Giter Site home page Giter Site logo

Comments (11)

chhylp123 avatar chhylp123 commented on May 30, 2024

If -l0 is not specified, hifiasm -p prefix --high-het will enable -l2. Hifiasm with -l2 performs purge_dup steps so it should have bigger 50.

from hifiasm.

jelber2 avatar jelber2 commented on May 30, 2024

Well, hifiasm ends up merging several scaffolds because none are originally over 25 Mbp such that I think 4 or so are over 25 Mbp with hig-het. I can provide the assembly stats tomorrow. I guess I don’t understand how purge_dups can end up merging scaffolds.

from hifiasm.

jelber2 avatar jelber2 commented on May 30, 2024

Okay, so I changed the title of the post as I realized that I incorrectly described the problem in the title and the initial post.

$ /genetics/elbers/bin/hifiasm/hifiasm -o wm.asm -t75 -l0 ../HiFi-reads-rm-icecream.fasta.gz
$ awk '/^S/{print ">"$2"\n"$3}' wm.asm.p_ctg.gfa | fold > wm.asm.p_ctg.l0.fa

# note that N50 is a number and not a length with bbstats.sh
# note that L50 is a length and not a number with bbstats.sh

$ /opt/bbmap/bbstats.sh wm.asm.p_ctg.l0.fa
A	C	G	T	N	IUPAC	Other	GC	GC_stdev
0.3364	0.1635	0.1636	0.3365	0.0000	0.0000	0.0000	0.3271	0.0378

Main genome scaffold total:         	1926
Main genome contig total:           	1926
Main genome scaffold sequence total:	1089.879 MB
Main genome contig sequence total:  	1089.879 MB  	0.000% gap
Main genome scaffold N/L50:         	89/3.043 MB
Main genome contig N/L50:           	89/3.043 MB
Main genome scaffold N/L90:         	570/287.908 KB
Main genome contig N/L90:           	570/287.908 KB
Max scaffold length:                	23.994 MB
Max contig length:                  	23.994 MB
Number of scaffolds > 50 KB:        	1235
% main genome in scaffolds > 50 KB: 	98.07%


Minimum 	Number        	Number        	Total         	Total         	Scaffold
Scaffold	of            	of            	Scaffold      	Contig        	Contig  
Length  	Scaffolds     	Contigs       	Length        	Length        	Coverage
--------	--------------	--------------	--------------	--------------	--------
    All 	         1,926	         1,926	 1,089,878,911	 1,089,878,911	 100.00%
 2.5 KB 	         1,926	         1,926	 1,089,878,911	 1,089,878,911	 100.00%
   5 KB 	         1,924	         1,924	 1,089,869,399	 1,089,869,399	 100.00%
  10 KB 	         1,924	         1,924	 1,089,869,399	 1,089,869,399	 100.00%
  25 KB 	         1,734	         1,734	 1,085,989,968	 1,085,989,968	 100.00%
  50 KB 	         1,235	         1,235	 1,068,806,588	 1,068,806,588	 100.00%
 100 KB 	           964	           964	 1,049,179,798	 1,049,179,798	 100.00%
 250 KB 	           610	           610	   991,796,034	   991,796,034	 100.00%
 500 KB 	           384	           384	   910,161,360	   910,161,360	 100.00%
   1 MB 	           237	           237	   803,609,777	   803,609,777	 100.00%
 2.5 MB 	           114	           114	   616,698,085	   616,698,085	 100.00%
   5 MB 	            41	            41	   360,473,780	   360,473,780	 100.00%
  10 MB 	            11	            11	   158,615,155	   158,615,155	 100.00%


$ /genetics/elbers/bin/hifiasm/hifiasm -o wm.asm -t75 --high-het ../HiFi-reads-rm-icecream.fasta.gz
$ awk '/^S/{print ">"$2"\n"$3}' wm.asm.p_ctg.gfa | fold > wm.asm.p_ctg.high-het.fa

$ /opt/bbmap/bbstats.sh wm.asm.p_ctg.high-het.fa 
A	C	G	T	N	IUPAC	Other	GC	GC_stdev
0.3361	0.1637	0.1639	0.3363	0.0000	0.0000	0.0000	0.3276	0.0448

Main genome scaffold total:         	825
Main genome contig total:           	825
Main genome scaffold sequence total:	784.534 MB
Main genome contig sequence total:  	784.534 MB  	0.000% gap
Main genome scaffold N/L50:         	19/11.155 MB
Main genome contig N/L50:           	19/11.155 MB
Main genome scaffold N/L90:         	134/552.325 KB
Main genome contig N/L90:           	134/552.325 KB
Max scaffold length:                	47.052 MB
Max contig length:                  	47.052 MB
Number of scaffolds > 50 KB:        	518
% main genome in scaffolds > 50 KB: 	98.84%


Minimum 	Number        	Number        	Total         	Total         	Scaffold
Scaffold	of            	of            	Scaffold      	Contig        	Contig  
Length  	Scaffolds     	Contigs       	Length        	Length        	Coverage
--------	--------------	--------------	--------------	--------------	--------
    All 	           825	           825	   784,534,196	   784,534,196	 100.00%
 2.5 KB 	           825	           825	   784,534,196	   784,534,196	 100.00%
   5 KB 	           823	           823	   784,524,684	   784,524,684	 100.00%
  10 KB 	           823	           823	   784,524,684	   784,524,684	 100.00%
  25 KB 	           730	           730	   782,625,060	   782,625,060	 100.00%
  50 KB 	           518	           518	   775,437,403	   775,437,403	 100.00%
 100 KB 	           379	           379	   765,444,834	   765,444,834	 100.00%
 250 KB 	           229	           229	   741,486,833	   741,486,833	 100.00%
 500 KB 	           140	           140	   709,699,869	   709,699,869	 100.00%
   1 MB 	            93	            93	   677,986,980	   677,986,980	 100.00%
 2.5 MB 	            61	            61	   623,843,181	   623,843,181	 100.00%
   5 MB 	            36	            36	   533,469,808	   533,469,808	 100.00%
  10 MB 	            21	            21	   420,184,726	   420,184,726	 100.00%
  25 MB 	             5	             5	   174,385,273	   174,385,273	 100.00%

So it appears that hifiasm with --high-het or without -l0 is merging scaffolds?

Unless there is something wrong with the gfa file during the conversion with awk '/^S/{print ">"$2"\n"$3}' wm.asm.p_ctg.gfa | fold > wm.asm.p_ctg.high-het.fa

EDIT
with -l1, one does not get merged scaffolds

$ /genetics/elbers/bin/hifiasm/hifiasm -o wm.asm -t75 -l1 ../HiFi-reads-rm-icecream.fasta.gz
$ awk '/^S/{print ">"$2"\n"$3}' wm.asm.p_ctg.gfa | fold > wm.asm.p_ctg.l1.fa

$ /opt/bbmap/bbstats.sh wm.asm.p_ctg.l1.fa 
A	C	G	T	N	IUPAC	Other	GC	GC_stdev
0.3362	0.1637	0.1638	0.3363	0.0000	0.0000	0.0000	0.3275	0.0395

Main genome scaffold total:         	1403
Main genome contig total:           	1403
Main genome scaffold sequence total:	943.349 MB
Main genome contig sequence total:  	943.349 MB  	0.000% gap
Main genome scaffold N/L50:         	69/3.603 MB
Main genome contig N/L50:           	69/3.603 MB
Main genome scaffold N/L90:         	396/358.325 KB
Main genome contig N/L90:           	396/358.325 KB
Max scaffold length:                	23.994 MB
Max contig length:                  	23.994 MB
Number of scaffolds > 50 KB:        	920
% main genome in scaffolds > 50 KB: 	98.44%


Minimum 	Number        	Number        	Total         	Total         	Scaffold
Scaffold	of            	of            	Scaffold      	Contig        	Contig  
Length  	Scaffolds     	Contigs       	Length        	Length        	Coverage
--------	--------------	--------------	--------------	--------------	--------
    All 	         1,403	         1,403	   943,348,525	   943,348,525	 100.00%
 2.5 KB 	         1,403	         1,403	   943,348,525	   943,348,525	 100.00%
   5 KB 	         1,401	         1,401	   943,339,013	   943,339,013	 100.00%
  10 KB 	         1,401	         1,401	   943,339,013	   943,339,013	 100.00%
  25 KB 	         1,272	         1,272	   940,699,597	   940,699,597	 100.00%
  50 KB 	           920	           920	   928,669,349	   928,669,349	 100.00%
 100 KB 	           735	           735	   915,092,089	   915,092,089	 100.00%
 250 KB 	           474	           474	   872,365,387	   872,365,387	 100.00%
 500 KB 	           306	           306	   811,809,855	   811,809,855	 100.00%
   1 MB 	           206	           206	   739,587,945	   739,587,945	 100.00%
 2.5 MB 	           106	           106	   584,871,864	   584,871,864	 100.00%
   5 MB 	            40	            40	   352,486,974	   352,486,974	 100.00%
  10 MB 	            11	            11	   158,615,155	   158,615,155	 100.00%

from hifiasm.

HenrivdGeest avatar HenrivdGeest commented on May 30, 2024

My 2 cents; To see this clear, you are saying that its not just the n50 that increased. Increased n50 makes sense with purging the assembly. (if you would take an NG50, it should stay the same, right?)
However, your problem is that you reach ~616Mb with 140 contigs with your "default -l0" assembly, and you reach ~623Mb with 61 contigs for your "high-het -l2" assembly.
I don't think persee the high-het option is 'merging' contigs, although I don't know the algorithm, it can also be that because of different alignment parameters (which we can't see, or are not exposed) it just makes the assembler making bigger contigs.
Maybe the unitig joining step is different, maybe the unitigs themselves are already different?
Since the high-het option is experimental, I think it would help chhylp123 if you could check somehow if these bigger contigs are actually right, or that there are contain more errors.
Do you agree that It would make life easier if you would run 2 assemblies, with just enabling/disabling the --high-het option (leaving -l identical)?

from hifiasm.

jelber2 avatar jelber2 commented on May 30, 2024

@HenrivdGeest thanks for you input. Regarding:
I think it would help chhylp123 if you could check somehow if these bigger contigs are actually right, or that there are contain more errors. I am not sure how to do this with our current setup of a new genome that has never been assembled before with only HiFi reads other than examining the read alignments in IGV or something like that- something that I cannot do given time constraints at the moment.

Regarding:
Do you agree that It would make life easier if you would run 2 assemblies, with just enabling/disabling the --high-het option (leaving -l identical)? Excellent suggestion, please see below:

$ /genetics/elbers/bin/hifiasm/hifiasm -o wm.asm -t75 -l2 --high-het ../HiFi-reads-rm-icecream.fasta.gz
Reads has been loaded.
Loading ma_hit_ts from disk... 
ma_hit_ts has been read.
Loading ma_hit_ts from disk... 
ma_hit_ts has been read.
[M::ha_assemble::61.378*1.00] ==> loaded corrected reads and overlaps from disk
Writing raw unitig GFA to disk... 
Writing processed unitig GFA to disk... 
[M::purge_dups] purge duplication coverage threshold: 35
[M::purge_dups] purge duplication coverage threshold: 35
[M::adjust_utg_by_primary] primary contig coverage range: [22, infinity]
Writing primary contig GFA to disk... 
Writing alternate contig GFA to disk... 
Inconsistency threshold for low-quality regions in BED files: 70%
[M::main] Version: 0.13-r308
[M::main] CMD: /genetics/elbers/bin/hifiasm/hifiasm -o wm.asm -t75 -l2 --high-het ../HiFi-reads-rm-icecream.fasta.gz
[M::main] Real time: 439.650 sec; CPU: 488.492 sec; Peak RSS: 12.760 GB

$ awk '/^S/{print ">"$2"\n"$3}' wm.asm.p_ctg.gfa | fold > wm.asm.p_ctg.l2-high-het.fa

$ /opt/bbmap/bbstats.sh wm.asm.p_ctg.l2-high-het.fa
A	C	G	T	N	IUPAC	Other	GC	GC_stdev
0.3361	0.1637	0.1639	0.3363	0.0000	0.0000	0.0000	0.3276	0.0448

Main genome scaffold total:         	825
Main genome contig total:           	825
Main genome scaffold sequence total:	784.534 MB
Main genome contig sequence total:  	784.534 MB  	0.000% gap
Main genome scaffold N/L50:         	19/11.155 MB
Main genome contig N/L50:           	19/11.155 MB
Main genome scaffold N/L90:         	134/552.325 KB
Main genome contig N/L90:           	134/552.325 KB
Max scaffold length:                	47.052 MB
Max contig length:                  	47.052 MB
Number of scaffolds > 50 KB:        	518
% main genome in scaffolds > 50 KB: 	98.84%


Minimum 	Number        	Number        	Total         	Total         	Scaffold
Scaffold	of            	of            	Scaffold      	Contig        	Contig  
Length  	Scaffolds     	Contigs       	Length        	Length        	Coverage
--------	--------------	--------------	--------------	--------------	--------
    All 	           825	           825	   784,534,196	   784,534,196	 100.00%
 2.5 KB 	           825	           825	   784,534,196	   784,534,196	 100.00%
   5 KB 	           823	           823	   784,524,684	   784,524,684	 100.00%
  10 KB 	           823	           823	   784,524,684	   784,524,684	 100.00%
  25 KB 	           730	           730	   782,625,060	   782,625,060	 100.00%
  50 KB 	           518	           518	   775,437,403	   775,437,403	 100.00%
 100 KB 	           379	           379	   765,444,834	   765,444,834	 100.00%
 250 KB 	           229	           229	   741,486,833	   741,486,833	 100.00%
 500 KB 	           140	           140	   709,699,869	   709,699,869	 100.00%
   1 MB 	            93	            93	   677,986,980	   677,986,980	 100.00%
 2.5 MB 	            61	            61	   623,843,181	   623,843,181	 100.00%
   5 MB 	            36	            36	   533,469,808	   533,469,808	 100.00%
  10 MB 	            21	            21	   420,184,726	   420,184,726	 100.00%
  25 MB 	             5	             5	   174,385,273	   174,385,273	 100.00%

$ /genetics/elbers/bin/hifiasm/hifiasm -o wm.asm -t75 -l2 ../HiFi-reads-rm-icecream.fasta.gz
Reads has been loaded.
Loading ma_hit_ts from disk... 
ma_hit_ts has been read.
Loading ma_hit_ts from disk... 
ma_hit_ts has been read.
[M::ha_assemble::63.147*1.00] ==> loaded corrected reads and overlaps from disk
Writing raw unitig GFA to disk... 
Writing processed unitig GFA to disk... 
^[[B^[[B[M::purge_dups] purge duplication coverage threshold: 35
[M::purge_dups] purge duplication coverage threshold: 35
[M::adjust_utg_by_primary] primary contig coverage range: [22, infinity]
Writing primary contig GFA to disk... 
Writing alternate contig GFA to disk... 
Inconsistency threshold for low-quality regions in BED files: 70%
[M::main] Version: 0.13-r308
[M::main] CMD: /genetics/elbers/bin/hifiasm/hifiasm -o wm.asm -t75 -l2 ../HiFi-reads-rm-icecream.fasta.gz
[M::main] Real time: 394.678 sec; CPU: 443.092 sec; Peak RSS: 12.759 GB

$ awk '/^S/{print ">"$2"\n"$3}' wm.asm.p_ctg.gfa | fold > wm.asm.p_ctg.l2-no-high-het.fa

$ /opt/bbmap/bbstats.sh wm.asm.p_ctg.l2-no-high-het.fa
A	C	G	T	N	IUPAC	Other	GC	GC_stdev
0.3361	0.1637	0.1639	0.3363	0.0000	0.0000	0.0000	0.3276	0.0448

Main genome scaffold total:         	825
Main genome contig total:           	825
Main genome scaffold sequence total:	784.534 MB
Main genome contig sequence total:  	784.534 MB  	0.000% gap
Main genome scaffold N/L50:         	19/11.155 MB
Main genome contig N/L50:           	19/11.155 MB
Main genome scaffold N/L90:         	134/552.325 KB
Main genome contig N/L90:           	134/552.325 KB
Max scaffold length:                	47.052 MB
Max contig length:                  	47.052 MB
Number of scaffolds > 50 KB:        	518
% main genome in scaffolds > 50 KB: 	98.84%


Minimum 	Number        	Number        	Total         	Total         	Scaffold
Scaffold	of            	of            	Scaffold      	Contig        	Contig  
Length  	Scaffolds     	Contigs       	Length        	Length        	Coverage
--------	--------------	--------------	--------------	--------------	--------
    All 	           825	           825	   784,534,196	   784,534,196	 100.00%
 2.5 KB 	           825	           825	   784,534,196	   784,534,196	 100.00%
   5 KB 	           823	           823	   784,524,684	   784,524,684	 100.00%
  10 KB 	           823	           823	   784,524,684	   784,524,684	 100.00%
  25 KB 	           730	           730	   782,625,060	   782,625,060	 100.00%
  50 KB 	           518	           518	   775,437,403	   775,437,403	 100.00%
 100 KB 	           379	           379	   765,444,834	   765,444,834	 100.00%
 250 KB 	           229	           229	   741,486,833	   741,486,833	 100.00%
 500 KB 	           140	           140	   709,699,869	   709,699,869	 100.00%
   1 MB 	            93	            93	   677,986,980	   677,986,980	 100.00%
 2.5 MB 	            61	            61	   623,843,181	   623,843,181	 100.00%
   5 MB 	            36	            36	   533,469,808	   533,469,808	 100.00%
  10 MB 	            21	            21	   420,184,726	   420,184,726	 100.00%
  25 MB 	             5	             5	   174,385,273	   174,385,273	 100.00%

And you are correct that I did "assume" that hifiasm was merging contigs/scaffolds without actually verifying that.

from hifiasm.

lh3 avatar lh3 commented on May 30, 2024

What is the input data? Haplotig duplicate purging at the -l2 level joins haplotigs from different haplotypes which is disallowed at -l0. For heterozygous samples, -l2 is highly recommended. This is why tools like purge_dups exist.

from hifiasm.

jelber2 avatar jelber2 commented on May 30, 2024

It is a Dipteran fly with an estimated 1.5% heterozygosity (based on using genomescope on the HiFi reads - Illumina reads from 5 other individuals suggest similar heterozygosity levels) with ~30x HiFi read coverage.

from hifiasm.

lh3 avatar lh3 commented on May 30, 2024

GenomeScope seems to overestimate heterozygosity at our hand. The real heterozygosity is probably closer to 1%. At this level, definitely use -l2. As I remember, you need even higher heterozygosity for --high-het to kick in.

from hifiasm.

jelber2 avatar jelber2 commented on May 30, 2024

Ok. If I run purge_dups manually on the p_ctg from -l0 output, I don’t end up with larger scaffolds, so l am a little confused.

from hifiasm.

lh3 avatar lh3 commented on May 30, 2024

Haplotig duplicate purging at the -l2 level joins haplotigs from different haplotypes

Standalone purge_dups doesn't do this.

from hifiasm.

jelber2 avatar jelber2 commented on May 30, 2024

Ok thank you so much!

from hifiasm.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.