Giter Site home page Giter Site logo

TypeError issue about hapo-g HOT 12 CLOSED

aldendirks avatar aldendirks commented on June 26, 2024
TypeError issue

from hapo-g.

Comments (12)

aldendirks avatar aldendirks commented on June 26, 2024

The error appeared when launching Hapo-G:

cat ~/logs/polish-phase-59191521.log

Checking dependencies...
	Found bwa
	Found samtools

Non alphanumeric characters detected in fasta headers. Renaming sequences.

Generating bwa index...
Done in 19 seconds

Launching mapping on genome...
Done in 244 seconds

Indexing the BAM file...
Done in 40 seconds

Fragmenting the genome into 36 chunks of 692,460 bases (depending of scaffold sizes)
Done in 0 seconds

Extracting bam for each chunk
Done in 52 seconds

Launching Hapo-G on each chunk

from hapo-g.

bistace avatar bistace commented on June 26, 2024

Hello,

I think that you must have downloaded Hapo-G just when I made changes. If you clone the master branch, everything should be fixed (I don't encounter this error on my side anymore).

from hapo-g.

aldendirks avatar aldendirks commented on June 26, 2024

Thank you. Do I need to delete the old version and rebuild, or is there a command just to update it and not have to rebuild? Sorry I'm not so familiar with git

from hapo-g.

aldendirks avatar aldendirks commented on June 26, 2024

I installed it more recently than the changes made in GitHub, according to the timestamp. When I run git pull origin master I get the message "Already up to date."

from hapo-g.

aldendirks avatar aldendirks commented on June 26, 2024

For good measure I decided to rerun it and I got the same error message.

from hapo-g.

bistace avatar bistace commented on June 26, 2024

I just reran it too and don't face the issue. I've still pushed changes to try and fix the problem. Could you please run git pull and launch Hapo-G again?

from hapo-g.

aldendirks avatar aldendirks commented on June 26, 2024

It works! Thank you.

from hapo-g.

aldendirks avatar aldendirks commented on June 26, 2024

Something of an unrelated question... the output is hapog.changes and hapog.fasta. I thought the genome might be separated into haplotigs. I see that hapog.changes has 65823 lines, 183 are homo and 65640 are hetero. (By the way I polished with NextPolish beforehand as recommended in the paper). But the FASTA file looks to be a single unphased genome. Is there a way to separate the haplotigs? Otherwise what does the hapog.fasta file represent?

from hapo-g.

bistace avatar bistace commented on June 26, 2024

Hapo-G is a polishing software, that means that it corrects errors that are remaining after a genome assembly. The main selling point of Hapo-G is that it can correct phasing errors in the contigs but it is not an assembler, it will not try to add any new sequences.

As an example, let's consider the following heterozygous genome:

maternal hap.: ACCGTTA
paternal hap.: ATCGTGA

We can see that they differ by two bases, a C in 2nd position is associated with a T in 6th position in the maternal haplotype while a T in 2nd position is associated with an G in 6th position in the paternal haplotype.

Now there are five cases:

  1. the genome assembler did a perfect jobs and outputted both haplotypes without any errors, we do not have anything to do in this case, and both sequences will be outputted unchanged in the hapog.fasta file.
  2. the genome assembler outputted only one of the two haplotypes but the phasing is correct. As an example, it could have outputted the maternal haplotype with a C in 2nd position and T in 6th. Again in this case, we do not have to change anything and the input sequence will be outputted without any change in the hapog.fasta file.
  3. the genome assembler outputted only one of the two haplotypes but this time the phasing is not correct. As an example a C in 2nd position but a G in 6th. In this case, we will replace the G by a T, as the maternal haplotype was the one we encountered first in the sequence. This is this corrected sequence that will be outputted in hapog.fasta, the original sequence is discarded.
  4. the genome assembler outputted both haplotypes but with phasing errors. In this case, we try to correct these issues and output corrected sequences to the hapog.fasta file.
  5. the genome assembler outputted any number of sequences and there are base errors like insertions, deletions, etc... that are not necessarily related to phasing errors. We try to fix them and output the corrected sequences to the hapog.fasta file.

In your case, there were 65640 changes impacting heterozygous sites (could be cases 3, 4 or 5) in your assembly and 183 impacting homozygous sites (case 5). If you want more info on the hapog.changes file, we will add a paragraph in the Readme but in the mean time, you can read #16.

from hapo-g.

aldendirks avatar aldendirks commented on June 26, 2024

Thank you very much for this explanation, extremely helpful! So, since I used an assembler that did not attempt to separate out haplotigs (Flye on Nanopore long reads), I have a single, mixed "haplotype" sequence that is the result of collapsing heterozygous positions in the diploid genome. Hapo-G polishes the contigs with haplotype awareness resulting in – in my situation – 226 contigs that are a mix of "maternal" and "paternal" haplotypes. Is this correct? Can I assume that the sequences in each contig (mostly?) belong to the same haplotype, but from contig to contig I cannot know if Hapo-G chose to phase according to the "maternal" or the "paternal" haplotype (since we have no ability in the first place to know which haplotype a read belongs to)? Therefore, in my case, the final polished genome is a product of both haplotypes, with any given position being correct but belonging to the maternal or paternal haplotype, and not knowing which one.

from hapo-g.

bistace avatar bistace commented on June 26, 2024

Thank you very much for this explanation, extremely helpful!

You're welcome, that was a pleasure!

So, since I used an assembler that did not attempt to separate out haplotigs (Flye on Nanopore long reads), I have a single, mixed "haplotype" sequence that is the result of collapsing heterozygous positions in the diploid genome. Hapo-G polishes the contigs with haplotype awareness resulting in – in my situation – 226 contigs that are a mix of "maternal" and "paternal" haplotypes. Is this correct?

Yes, this is correct.

Can I assume that the sequences in each contig (mostly?) belong to the same haplotype

That will depend on how heterozygous the genome is. If the heterozygous SNPs are separated by more that a read pair length (in case of Illumina paired-end reads, this is R1 size + R2 size so around 300 bases), the phasing will be complex because we cannot chain reads of the same haplotype. If the heterozygous sites are separated by less than a read pair length, then you can assume that you will have large chunks of the genome correctly phased with (or without in the best case) switches to the other haplotypes.

In most cases, you will be in this situation:

Large chunk of hap1 - Homozygous region - Large chunk of hap2 - Homozygous region - Large chunk of hap1 ... etc

If you did not already, I recommend that you run GenomeScope to get an estimation of the heterozygosity rate of the genome.

but from contig to contig I cannot know if Hapo-G chose to phase according to the "maternal" or the "paternal" haplotype (since we have no ability in the first place to know which haplotype a read belongs to)? Therefore, in my case, the final polished genome is a product of both haplotypes, with any given position being correct but belonging to the maternal or paternal haplotype, and not knowing which one.

Yes, you are again correct.

from hapo-g.

aldendirks avatar aldendirks commented on June 26, 2024

Thanks again, very helpful!

from hapo-g.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.