Giter Site home page Giter Site logo

partial contigs with half-reduced coverage ccs reads is misassembled? No, by contrast, it has assembled very long and novel tandem repeat sequeces about hifiasm HOT 11 OPEN

chhylp123 avatar chhylp123 commented on May 29, 2024
partial contigs with half-reduced coverage ccs reads is misassembled? No, by contrast, it has assembled very long and novel tandem repeat sequeces

from hifiasm.

Comments (11)

chhylp123 avatar chhylp123 commented on May 29, 2024

Could you please zoom in the utg graph around the this 20-Mb region? I'd like to see how the subgraph looks like. Also, could you please show the following numbers at hifiasm log?

[M::ha_pt_gen] peak_hom: []; peak_het: []
[M::purge_dups] purge duplication coverage threshold: []

from hifiasm.

mydongan avatar mydongan commented on May 29, 2024

Thanks! I aligned 5 Mb sequence of 20-Mb region to all utgs fasta sequences, and I found it mapped to the utg000017l (47M).
image

image

The following information of hifiasm log are listed as below:
[M::ha_pt_gen] peak_hom: 25; peak_het: -1
[M::purge_dups] purge duplication coverage threshold: 31

from hifiasm.

lh3 avatar lh3 commented on May 29, 2024

Based on the mapping of genetic markers, can you assign this 20Mb to other chromosomes?

from hifiasm.

mydongan avatar mydongan commented on May 29, 2024

Thank you! Dr Li. Very strange, this 20 Mb region did not have any genetic markers.

from hifiasm.

lh3 avatar lh3 commented on May 29, 2024

A few more things to try:

  • Blast pieces from this 20Mb region against the "nt" database and check the top hits.
  • Run RepeatMasker to check the repeat content.
  • When you map genetic markers, do you see any hits to this 20Mb or do most hits here have ambiguous mappings?

from hifiasm.

mydongan avatar mydongan commented on May 29, 2024

Thank you very much for your suggestions!

  • I have blast 5 Mb retrieved from this 20Mb region againt the nt database, and all the top 10 hits are the same plant sequences with mine, thus we could exclude sequence pollution. Furter, I also aligned this 5 Mb sequence to an high-quality reference genome (contig N50 47 Mb), and I found this sequence partially mapped to many unanchored scaffolds.
  • I have done repeat annotaion with EDTA, but only performed the LTR annotation of this pipeline, I found that LTR density is lower in this 20 Mb region of chr7.
    image
  • Thank you for reminding me. I have filtering the markers and retained unique mapped genetic marker, therefore, I misunderstood that this 20Mb region covered with no genetic markers. So, I recheck the markers, and found that this region is ambiguous mapping with many markers which located in different linkage groups.
  • Maybe this region is rDNA or other repeat elements?

from hifiasm.

lh3 avatar lh3 commented on May 29, 2024
  • Is your sample inbred diploid –– two sets of nearly identical chromosomes?
  • You should check rDNA and centromere satellite in this 20Mb.
  • Run HiCanu and see how HiCanu assembles this region.

from hifiasm.

chhylp123 avatar chhylp123 commented on May 29, 2024
  • If your sample is inbred, it should be better to disable purge_dups using '-l0'.
  • To find the corresponding unitigs at r_utg of this 20Mb region, a better way is to find the reads at this region (A-lines in p_ctg), and then grep them at r_utg. I assume it should correspond to the tangle between utg000017l and utg000018l. The safe way is to drop the 20Mb region of p_ctg at the boundaries of tangle if it is a potential misassembly.

from hifiasm.

mydongan avatar mydongan commented on May 29, 2024

Thanks all !

Yes, it is a inbred haploid, het is 0.232% when I did survey analysis, and I assembled the genome using "-l0".

After doing repeat annotation, 85% of this region was annotated as 180-bp knob repeat which is a specific tandem repeat in plants.
image
Therefore, this region has not been assembled by previous studies, and thus proved that HIFI reads and hifiasm are very efficient and accurate for assembly long tandem repeats. Thank you all again!
Furthermore, I do nucmer alignment using utg000017l and itself, an we can also seen the terminal 11 Mb are tandem repeat.
image

However, I still not understand why the ccs reads coverage reduced half in this region.

from hifiasm.

lh3 avatar lh3 commented on May 29, 2024

As someone was referring to this issue, I have reread the thread. I am seeing:

  • A 20Mb region on the chr7 scaffold that has half of the expected coverage.
  • The first 5Mb in this 20Mb is located at the end of utg000017l.

If this description is right, this is not a contig misassembly. You have an inbred diploid genome. One possibility is that this region is diverged between the two haplotypes although the rest of the genome is nearly homozygous. The solution is to remove the diverged copy from the primary assembly. By the way, when you scaffolded the contigs, have you discarded prefix.a_ctg.gfa?

from hifiasm.

mydongan avatar mydongan commented on May 29, 2024

Maybe you are right, this repeat region with half coverage may be divergence rapidly between the two haplotypes. Yes, I only use prefix.p_ctg.gfa for further assembly.

from hifiasm.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.