Comments (12)
The error appeared when launching Hapo-G:
cat ~/logs/polish-phase-59191521.log
Checking dependencies...
Found bwa
Found samtools
Non alphanumeric characters detected in fasta headers. Renaming sequences.
Generating bwa index...
Done in 19 seconds
Launching mapping on genome...
Done in 244 seconds
Indexing the BAM file...
Done in 40 seconds
Fragmenting the genome into 36 chunks of 692,460 bases (depending of scaffold sizes)
Done in 0 seconds
Extracting bam for each chunk
Done in 52 seconds
Launching Hapo-G on each chunk
from hapo-g.
Hello,
I think that you must have downloaded Hapo-G just when I made changes. If you clone the master branch, everything should be fixed (I don't encounter this error on my side anymore).
from hapo-g.
Thank you. Do I need to delete the old version and rebuild, or is there a command just to update it and not have to rebuild? Sorry I'm not so familiar with git
from hapo-g.
I installed it more recently than the changes made in GitHub, according to the timestamp. When I run git pull origin master
I get the message "Already up to date."
from hapo-g.
For good measure I decided to rerun it and I got the same error message.
from hapo-g.
I just reran it too and don't face the issue. I've still pushed changes to try and fix the problem. Could you please run git pull
and launch Hapo-G again?
from hapo-g.
It works! Thank you.
from hapo-g.
Something of an unrelated question... the output is hapog.changes
and hapog.fasta
. I thought the genome might be separated into haplotigs. I see that hapog.changes
has 65823 lines, 183 are homo and 65640 are hetero. (By the way I polished with NextPolish beforehand as recommended in the paper). But the FASTA file looks to be a single unphased genome. Is there a way to separate the haplotigs? Otherwise what does the hapog.fasta
file represent?
from hapo-g.
Hapo-G is a polishing software, that means that it corrects errors that are remaining after a genome assembly. The main selling point of Hapo-G is that it can correct phasing errors in the contigs but it is not an assembler, it will not try to add any new sequences.
As an example, let's consider the following heterozygous genome:
maternal hap.: ACCGTTA
paternal hap.: ATCGTGA
We can see that they differ by two bases, a C in 2nd position is associated with a T in 6th position in the maternal haplotype while a T in 2nd position is associated with an G in 6th position in the paternal haplotype.
Now there are five cases:
- the genome assembler did a perfect jobs and outputted both haplotypes without any errors, we do not have anything to do in this case, and both sequences will be outputted unchanged in the
hapog.fasta
file. - the genome assembler outputted only one of the two haplotypes but the phasing is correct. As an example, it could have outputted the maternal haplotype with a C in 2nd position and T in 6th. Again in this case, we do not have to change anything and the input sequence will be outputted without any change in the
hapog.fasta
file. - the genome assembler outputted only one of the two haplotypes but this time the phasing is not correct. As an example a C in 2nd position but a G in 6th. In this case, we will replace the G by a T, as the maternal haplotype was the one we encountered first in the sequence. This is this corrected sequence that will be outputted in
hapog.fasta
, the original sequence is discarded. - the genome assembler outputted both haplotypes but with phasing errors. In this case, we try to correct these issues and output corrected sequences to the
hapog.fasta
file. - the genome assembler outputted any number of sequences and there are base errors like insertions, deletions, etc... that are not necessarily related to phasing errors. We try to fix them and output the corrected sequences to the
hapog.fasta
file.
In your case, there were 65640 changes impacting heterozygous sites (could be cases 3, 4 or 5) in your assembly and 183 impacting homozygous sites (case 5). If you want more info on the hapog.changes
file, we will add a paragraph in the Readme but in the mean time, you can read #16.
from hapo-g.
Thank you very much for this explanation, extremely helpful! So, since I used an assembler that did not attempt to separate out haplotigs (Flye on Nanopore long reads), I have a single, mixed "haplotype" sequence that is the result of collapsing heterozygous positions in the diploid genome. Hapo-G polishes the contigs with haplotype awareness resulting in – in my situation – 226 contigs that are a mix of "maternal" and "paternal" haplotypes. Is this correct? Can I assume that the sequences in each contig (mostly?) belong to the same haplotype, but from contig to contig I cannot know if Hapo-G chose to phase according to the "maternal" or the "paternal" haplotype (since we have no ability in the first place to know which haplotype a read belongs to)? Therefore, in my case, the final polished genome is a product of both haplotypes, with any given position being correct but belonging to the maternal or paternal haplotype, and not knowing which one.
from hapo-g.
Thank you very much for this explanation, extremely helpful!
You're welcome, that was a pleasure!
So, since I used an assembler that did not attempt to separate out haplotigs (Flye on Nanopore long reads), I have a single, mixed "haplotype" sequence that is the result of collapsing heterozygous positions in the diploid genome. Hapo-G polishes the contigs with haplotype awareness resulting in – in my situation – 226 contigs that are a mix of "maternal" and "paternal" haplotypes. Is this correct?
Yes, this is correct.
Can I assume that the sequences in each contig (mostly?) belong to the same haplotype
That will depend on how heterozygous the genome is. If the heterozygous SNPs are separated by more that a read pair length (in case of Illumina paired-end reads, this is R1 size + R2 size so around 300 bases), the phasing will be complex because we cannot chain reads of the same haplotype. If the heterozygous sites are separated by less than a read pair length, then you can assume that you will have large chunks of the genome correctly phased with (or without in the best case) switches to the other haplotypes.
In most cases, you will be in this situation:
Large chunk of hap1 - Homozygous region - Large chunk of hap2 - Homozygous region - Large chunk of hap1 ... etc
If you did not already, I recommend that you run GenomeScope to get an estimation of the heterozygosity rate of the genome.
but from contig to contig I cannot know if Hapo-G chose to phase according to the "maternal" or the "paternal" haplotype (since we have no ability in the first place to know which haplotype a read belongs to)? Therefore, in my case, the final polished genome is a product of both haplotypes, with any given position being correct but belonging to the maternal or paternal haplotype, and not knowing which one.
Yes, you are again correct.
from hapo-g.
Thanks again, very helpful!
from hapo-g.
Related Issues (20)
- UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb1 HOT 6
- Conda install fails HOT 3
- Polish before or after Redundans HOT 1
- ModuleNotFoundError HOT 1
- Error with --threads 1 HOT 1
- Minimum coverage for correction? HOT 2
- PackageNotFoundError HOT 4
- How do I get a vcf file? HOT 2
- ERRORs when using newest version on conda HOT 1
- hapog introduce non-codon character in the polished sequence HOT 13
- Question: how HAPO-G handles N in reads? HOT 4
- Question: how about polishing a phased diploid genome assembly? HOT 2
- Memory issue HOT 5
- Explanation of hapog.changes HOT 1
- long and short reads HOT 1
- Does HAPO-G correct ambiguous bases in fasta
- Error with --threads 1 - persisting HOT 4
- Plant genome HOT 7
- High RAM consumption in plant genome HOT 8
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from hapo-g.