institut-de-genomique / hapo-g Goto Github PK

View Code? Open in Web Editor NEW

20.0 4.0 2.0 181 KB

Hapo-G is a tool that aims to improve the quality of genome assemblies by polishing the consensus with accurate reads.

Home Page: http://www.genoscope.cns.fr/hapog/

License: Other

Python 29.41% CMake 18.90% C 50.12% Shell 1.57%

genomics polishing bioinformatics genome

hapo-g's Introduction

Hapo-G - Haplotype-Aware Polishing of Genomes

Hapo-G (pronounced like apogee) is a tool that aims to improve the quality of genome assemblies by polishing the consensus with accurate reads.

Publication : link

In case of troubles when using or installing the software, please open up an issue by clicking here.

Dependencies

Hapo-G depends on some software and libraries:

GCC and G++ (Hapo-G has been tested with GCC 4.9.2 and GCC 7.3.0)
Autoconf with minimum 2.69 version (to build HTSlib)
Python3 (minimum version 3.6)
HTSlib
BioPython
BWA
Samtools

Installation

Installation with conda

conda create -n hapog
conda activate hapog
conda install -c bioconda hapog

Installing from Github

First, clone this repository:

git clone https://github.com/institut-de-genomique/HAPO-G hapog

If htslib is already installed on your system, go to the next point Build with existing htslib. If you want Hapo-G to download and install htslib for you, go to the point Build with a new htslib install

Build with existing htslib

Building with an existing htslib ensures that Hapo-G and Samtools are using the same version of the library and should reduce compatibility issues. To build with an existing htslib, do:

cd hapog
bash build.sh -l path_to_htslib

If samtools is already installed on your system at /home/user/samtools, htslib is probably installed at /home/user/samtools/htslib.

Build with a new htslib

Hapo-G can download and compile htslib for you, to do so, please run:

cd hapog
bash build.sh

If everything went as expected, a binary of Hapo-G was created in build/ and a symlink was created in the bin/ folder

Using Hapo-G

Before running Hapo-G, you should make sure that BWA and Samtools are in your $PATH:

which bwa
which samtools

Standard pipeline

You can launch Hapo-G by using the Python3 script in its root directory:

python3 HAPOG_ROOT/hapog.py \
  --genome assembly.fasta \   # Fasta file of the genome to polish
  --pe1 R1.fastq.gz \         # Illumina R1 reads in .fastq or .fastq.gz, can be given multiple times
  --pe2 R2.fastq.gz \         # Illumina R2 reads in .fastq or .fastq.gz, can be given multiple times
  -o polishing \              # Output directory
  -t 36 \                     # Number of threads to use
  -u                          # Include unpolished sequences in the output

NOTE: If you installed Hapo-G using conda, you can invoke it by directly running hapog.

Skipping the mapping step

The mapping step can be skipped if a sorted BAM file is provided via the -b switch. Please verify that your fasta headers don't contain any non-alphanumerical characters (-and _are accepted) before launching Hapo-G.

IMPORTANT: The BAM file should not contain secondary alignments, these could lead to non-ACTG characters being introduced in the consensus sequence. You can use Minimap2's option secondary=no to produce a SAM file with no secondary alignments.

A typical command line with a bam file would look like this:

python3 HAPOG_ROOT/hapog.py \
  --genome assembly.fasta \   # Fasta file of the genome to polish
  -b mapping.sorted.bam       # Sorted BAM file
  -o polishing \              # Output directory
  -t 36 \                     # Number of threads to use
  -u                          # Include unpolished sequences in the output

Output files

`hapog_results/hapog.fasta`

The corrected sequences. Hapo-G will parse the read alignments to the genome and focus on phasing errors (i.e the assembly switched from one haplotype to the other) and base errors (insertions, deletions, mismatches) that may be related or not to phasing errors. Remember to include the -u flag to tell Hapo-G to output sequences with no reads mapped and thus could not be changed.

Hapo-G will not add any new contigs or scaffolds to the assembly if, as an example, one of the haplotype is missing in the input assembly file. Instead, it will correct the haplotype that is present in the input file and output a corrected version of the sequence that is phased as best as we could with the data at hand.

As an example, let’s consider the following heterozygous genome:

maternal hap.: ACCGTTA
paternal hap.: ATCGTGA

If the assembler outputted a version with one phasing error (the C in 2nd position is associated with a G in 6th position) and one deletion in 4th position:

assembly: ACC-TGA

Then, if Hapo-G was able to correct all the errors, the hapog.fasta file will contain:

hapog.fasta: ACCGTTA

`hapog_results/hapog.changes`

This file is a tabulated file that gives more information on what Hapo-G did to the input assembly. It has eight columns that show:

Name of the input sequence where the change was made
Position in the input sequence where the change was made
Nucleotide(s) at the position in column 2
Nucleotide(s) that will replace the nucleotide(s) shown in column 3
Name of the read that is used as the current template. In the Hapo-G algorithm, we try to follow a read for as long as possible to not switch haplotypes. If an error is found in the template read, we switch to a different read of the same haplotype hetero if the change is only present in one of the two possible haplotypes (i.e a phasing error). homo if the change is present in both haplotypes
Ratio of reads from the same haplotypes as the template read that validate the changes
Ratio of reads from both haplotypes that validate the changes

Here is an example:

Contig_1	1000	ref=A	read=TA	readname=read_2	hetero	ratio1=0.7419	ratio2=0.4237
Contig_1  2000  ref=T read=G  readname=read_2 homo    ratio1=0.8142 ratio2=0.8323

We can see that on the contig Contig_1, Hapo-G found a phasing error (hetero on the first line) and replaced a A at position 1000 by TA. This change was validated by 74.19% of reads of the same haplotype as the template read (ratio1) and by 42.37% of reads if we do not discriminate on which haplotype they belong to. It also found a mismatch at position 2000 (homo) and replaced a T by a G. This change was validated by 83% of reads of no matter which haplotype (ratio2).

Acknowledgements

Some Cmake files have been taken and/or modified from several projects. We would like to thank:

panguangze for their FindHTSLIB.cmake library
L. Kärkkäinen for their LibFindMacros.cmake library

hapo-g's People

Contributors

Stargazers

Watchers

Forkers

sajjadasaf ningshuang-yao

hapo-g's Issues

Error with --threads 1 - persisting

Hi,

just found that #21 is still an issue in the latest version in the case that sequence names contained non-alphanumeric characters. The reason is that when creating a simple symbolic link to the input fasta, that file doesn't have the sanitized sequence names so you'll get sth like [E::faidx_adjust_position] The sequence "Contig0" was not found in logs/hapog_chunks_1.e.

ERROR: Hapo-G didn't finish successfully, exit code: 127

Hello,

I am trying to use Hapo-G by feeding it a bam
python3.7 ../hapog/hapog.py --genome AAC_HP2.canu_r1_m3.RENAMED.fa -b AAC_HP2.canu_r1_m3.illumina.sorted.bam -o AAC_HP2_hg1 -t 6 -u

All is going well it appears until the 'Launching Hapo-G on each chunk' step
Looking at the logs, the hapo_chunks error files contain this message
"home/hapog/bin/hapog: error while loading shared libraries: libhts.so.2: cannot open shared object file: No such file or directory"

I see that this problem can come from a problem with installing samtools, however running samtools in the previous step (and I use it constantly) has no problem so it's only in this step that it cannot find the library..
And the chunks for each bam are there and appear ok

Seen this problem before?
Thanks

long read vs short read polishing

Hi,
I was reading through your paper while running a few rounds of hapo-g polishing and noticed the paper and github only mention polishing using short reads, but there is an option for using long read data in the hapog.py script. I was wondering how effective long read polishing is using hapo-g?

How many times to polish an assembly

Dear author,

I am wondering if it's recommended to polish an assembly more than once consecutively to get the best results with Hapo-G? Or once is enough? In my case I assembled my plant diploid genome with Canu using Nanopore data sequenced on R9.4.1 pores and basecalled with Guppy 4.0.11; and I have ~250X assembly coverage with 150PE Illumina data for polishing with Hapo-G. Thank you.

Romulo

can you add it to bioconda? thanks!

PackageNotFoundError

Hi,

I have installed Hapo-G through bioconda on our cluster using the instructions of the README and everything went just fine. However, as soon as I try Hap-G using hapog or hapog --help, I get the following error:

I would really appreciate if you could help :)

Thank you!
Guillaume

It is possible to start from already produced BAM files?

Dear developers,

Thanks for an interesting tool. I wonder if it is possible to side-load BAM files in HAPO-G instead of aligning raw sequence data internally.

ERROR: Hapo-G didn't finish successfully, exit code: -9

Hi, I am relatively new to bioinformatics and I was hoping to use your tool to improve my assembly draft. I have a 690MB draft assembly and raw reads of 260x coverage. The tool was running for more than 3 days and then it stopped citing.

Command: hapog.py --genome Draft.fa --single /AC_SP6/PromethION.fq --threads 8
ERROR: Hapo-G didn't finish successfully, exit code: -9
Faulty command: miniconda3/envs/SLR/bin/build/hapog -b chunks_bam/chunks_2.bam -f chunks/chunks_2.fasta -o hapog_chunks/chunks_2.fasta -c hapog_chunks/chunks_2.changes

I am led to believe that this is related to an out of memory error the server gave out for the run. I was running on 450gb mem and 16 cores. the command is below. I would like to know if there was a way to work around memory limits? If this issue is due to the size of the data fed to the tool would using a subset of the sequencing data fix the issue? And also would it be possible to resume the run instead of starting over?

Thank you for your help

output, adding sequences from reads, and ratios?

hello and thank you for this nifty program, it's become essential in resolving a concerning issue with our assemblies. I was wondering if you could provide some insight into how HAPO-G is making decisions about re-incorporating read data? Below you'll find some insertions that have been introduced, I was wondering how reliable they were given the ratios and what they mean. Thank you.

Scaff_3	493505	ref=A	read=ataA	readname=A00406:87:HFW2HDSXY:3:1360:1515:14231	hetero	ratio1=0.7419	ratio2=0.4237
Scaff_3	918557	ref=G	read=atttG	readname=A00406:87:HFW2HDSXY:3:2556:25852:32142	hetero	ratio1=0.8154	ratio2=0.5510
Scaff_3	928348	ref=A	read=attccA	readname=A00406:87:HFW2HDSXY:3:1674:28185:6277	hetero	ratio1=0.4722	ratio2=0.2250

Explanation of hapog.changes

Could you please explain the changes in the hapog.changes. The Hapo-G version is 1.3.3.

case1
contig_1 2112 ref=T read=C readname=xxx homo ratio1=1.0000 ratio2=1.0000
Does case1 mean C is homo among reads and different from ref?

case2
contig_1 93010 ref=T read=- readname=xxx hetero ratio1=0.2692 ratio2=0.2692
Does case2 mean - is hetero among reads, say -/T ?

case3
contig_1 119614 ref=G read=gG readname=xxx homo ratio1=1.0000 ratio2=1.0000
What does low-case g mean in the case3?

Thanks

Question: how HAPO-G handles N in reads?

Hi HAPO-G team,

Thank you very much for developing this tool. I am just wondering how HAPO-G handles Ns in the reads? I know nextpolish would introduce Ns into the assembly after polishing if there are any Ns in the reads, https://nextpolish.readthedocs.io/en/latest/FAQ.html#why-does-the-contig-n50-of-polished-genome-become-shorter-or-why-does-the-polished-genome-contains-some-extra-n. What will happen if there are Ns in reads after HAPO-G polishing?

Thank you very much and look forward to your explanation.

Best wishes,
Yutang

TypeError issue

Hello,

I'm receiving this error when running Hapo-G:

Traceback (most recent call last): File "/home/adirks/apps/hapog/hapog.py", line 208, in <module> pipeline.launch_hapog(args.hapog_bin, args.hapog_threads) File "/home/adirks/apps/hapog/hapog/pipeline.py", line 202, in launch_hapog while len([p for p in procs if p.poll() is None]) >= parallel_jobs: TypeError: '>=' not supported between instances of 'int' and 'str'

Any help would be much appreciated!

conda running code

Hi,

I installed HAPO-G in conda
which code should I use for running?

With this code:
module load python
source activate assembly-Y
python3 HAPOG_ROOT/hapog.py --genome /users/PHS0338/jpac1984/data/SuperExt.scaff.fasta \ # Fasta file of the genome to polish
-b ../MaSuRCA-4.0.5/TGI_myse/SuperExt.scaff.fasta.alignSorted.bam -o myse -t 48

I got the following log:
python3: can't open file 'HAPOG_ROOT/hapog.py': [Errno 2] No such file or directory
/var/spool/slurmd/job5228454/slurm_script: line 12: -b: command not found

Thanks;

ERRORs when using newest version on conda

Hi,

When I use newest HAPO-G version on conda, I got following errors
Traceback (most recent call last):
File "/Bio/User/kxie/software/anaconda3/envs/hapog/bin/hapog", line 33, in
sys.exit(load_entry_point('hapog==1.3.4', 'console_scripts', 'hapog')())
File "/Bio/User/kxie/software/anaconda3/envs/hapog/lib/python3.10/site-packages/hapog/cli.py", line 190, in main
pipeline.launch_hapog(args.hapog_bin)
File "/Bio/User/kxie/software/anaconda3/envs/hapog/lib/python3.10/site-packages/hapog/pipeline.py", line 172, in launch_hapog
subprocess.Popen(
File "/Bio/User/kxie/software/anaconda3/envs/hapog/lib/python3.10/subprocess.py", line 969, in __init__
self._execute_child(args, executable, preexec_fn, close_fds,
File "/Bio/User/kxie/software/anaconda3/envs/hapog/lib/python3.10/subprocess.py", line 1845, in _execute_child
raise child_exception_type(errno_num, err_msg, err_filename)
FileNotFoundError: [Errno 2] No such file or directory: '/Bio/User/kxie/software/anaconda3/envs/hapog/lib/python3.10/site-packages/hapog_build/hapog'

Do you know why?

Best,
Kun

Do you recommend to polish hifi assemble results (like hifiasm or hicanu results) ?

Hi there,

Do you recommend to polish hifi assemble results (like hifiasm or hicanu results) ? And why ?

Best,
Kun

error using the -b option

When I run the following command line

hapog.py --genome reference.fasta -b alignment.bam -o HapoG -t 16 -u

I get the following error

Checking dependencies...
	Found BWA.
	Found Samtools.

Indexing the BAM file...

ERROR: Couldn't index bam file
Command '['samtools', 'index', 'bam/aln.sorted.bam']' returned non-zero exit status 127.

The HapoG directory is created, the HapoG/bam/aln.sorted.bam symbolic link exists. What should I change?

Conda install fails

Hello,
I get the following error when I use conda install. Can you help me solve it? Thanks!

UnsatisfiableError: The following specifications were found to be incompatible with your system:

feature:/linux-64::__glibc==2.35=0
hapog -> libgcc-ng[version='>=9.3.0'] -> __glibc[version='>=2.17']

Compiling Error

I received various errors on not finding pthead, curl, and crypto references. I think your CMakeLists.txt can be improved by adding

target_link_libraries(hapog pthread)
target_link_libraries(hapog curl)
target_link_libraries(hapog crypto)

I did this and was able to get it to compile successfully.

long and short reads

Can haop-G accept long and short reads simultaneously? Thanks.

Just to confirm whether I get the right output files

Hello, I ran the HAPO-G yesterday for polishing, and it looks like it successfully finished (I got the message in the end "Thanks for using Hapo-G, have a great day :-)"). But I don't quite get the output files, and I attach a picture below to show. I don't see the polished fasta file. If the the assembly.fasta is the correct file, the file size seems a bit small (and I think this is a soft link to the original assembly, right? it appeared immediately I started the run.)? So may I ask where is the polished assembly? Thanks!

How do I get a vcf file?

Hi,

I need vcf file produced by HAPO-G.
Does HAPO-G have the ability to produce a VCF-like file that contains information about which bases have changed ?

Docs question: what does the -u option do?

Hi,

Please can you explain the difference in behaviour to be expected with and without the option:

-u                          # Include unpolished sequences in the output

Thanks,

Rich

Error with --threads 1

Hi!
Found an error while writing a galaxy tool for Hapo-G:

ln: failed to create symbolic link 'chunks/chunks_1.fasta': No such file or directory
ln: failed to create symbolic link 'chunks_bam/chunks_1.bam': No such file or directory

This happens when running with the option --threads 1, any other value or omitting the option works well (default value seems to be 8)

I'll set to 2 by default in galaxy, but I guess it would be good to fix it at some point

hapog introduce non-codon character in the polished sequence

Hi, I used hapo-g for a while now, and recently just found that somehow for two of my assemblies, hapo-g add = in the genome sequence.

I didn't notice it until I used the annotation lift-over tools to do the annotation transfer, and the annotation tool complains about the non-codon error. So I checked the hapog.changes file, and I got some of the lines as below:

Chr2_RagTag     4520897 ref=G   read=C  readname=m64045_210203_183949/19792785/ccs      hetero  ratio1=0.4444   ratio2=0.2660
Chr2_RagTag     4521143 ref=T   read==  readname=m64045_210203_183949/161679341/ccs     hetero  ratio1=0.1429   ratio2=0.2840
Chr2_RagTag     4521158 ref=C   read==  readname=m64045_210203_183949/161679341/ccs     hetero  ratio1=0.1111   ratio2=0.2840
Chr2_RagTag     4521179 ref=G   read==  readname=m64045_210203_183949/161679341/ccs     hetero  ratio1=0.1429   ratio2=0.2740
Chr2_RagTag     4521193 ref=T   read==  readname=m64045_210203_183949/161679341/ccs     hetero  ratio1=0.1429   ratio2=0.2720
Chr2_RagTag     4521219 ref=A   read==  readname=m64045_210203_183949/161679341/ccs     hetero  ratio1=0.1250   ratio2=0.3000
Chr2_RagTag     4521267 ref=C   read==  readname=m64045_210203_183949/161679341/ccs     hetero  ratio1=0.1429   ratio2=0.2800
Chr2_RagTag     4521292 ref=A   read==  readname=m64045_210203_183949/161679341/ccs     hetero  ratio1=0.1429   ratio2=0.2660
Chr2_RagTag     4521331 ref=A   read==  readname=m64045_210203_183949/161679341/ccs     hetero  ratio1=0.1250   ratio2=0.2520
Chr2_RagTag     4521347 ref=A   read==  readname=m64045_210203_183949/161679341/ccs     hetero  ratio1=0.1429   ratio2=0.2680
Chr2_RagTag     4521406 ref=T   read==  readname=m64045_210203_183949/161679341/ccs     hetero  ratio1=0.1429   ratio2=0.2700
Chr2_RagTag     4521415 ref=T   read==  readname=m64045_210203_183949/161679341/ccs     hetero  ratio1=0.1429   ratio2=0.2440
Chr2_RagTag     4521656 ref=A   read=G  readname=A00478:143:HY7CFDRXX:2:2252:6705:34648 hetero  ratio1=0.5000   ratio2=0.3580
Chr2_RagTag     4521672 ref=C   read=G  readname=A00478:143:HY7CFDRXX:2:2219:11894:20744        hetero  ratio1=0.7692   ratio2=0.2260
Chr2_RagTag     4521869 ref=A   read=C  readname=m64045_210203_183949/120783781/ccs     hetero  ratio1=0.0500   ratio2=0.2360
Chr2_RagTag     4522031 ref=G   read=T  readname=A00478:143:HY7CFDRXX:2:2175:6623:6230  hetero  ratio1=0.1250   ratio2=0.2600

May I ask what shall I do about it? Thanks!

Error in bwa mem and samtools sort, return code: 256

Hello,

I am having some Issues running HAPO-G, I have tried using both the conda installation as well as compiling it separately, and under both situations, I get the same error when running the pipeline.

Launching mapping on genome...
Error in bwa mem and samtools sort, return code: 256
Faulty command: bash -c 'bwa mem -t 36 assembly.fasta /path/to/read_R1 /path/to/read_R2 2> logs/bwa_mem.e | samtools sort -m 5G -@ 36 -o bam/aln.sorted.bam - 2> logs/samtools_sort.e

when I look at the the log files its looks like and issue arrose in bwa_mem.e :

[M::bwa_idx_load_from_disk] read 0 ALT contigs
[M::process] read 2434218 sequences (360000259 bp)...
[M::process] read 2432188 sequences (360000049 bp)...
[M::mem_pestat] # candidate unique pairs for (FF, FR, RF, RR): (20, 600263, 17, 19)
[M::mem_pestat] analyzing insert size distribution for orientation FF...
[M::mem_pestat] (25, 50, 75) percentile: (133, 253, 595)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 1519)
[M::mem_pestat] mean and std.dev: (315.11, 307.58)
[M::mem_pestat] low and high boundaries for proper pairs: (1, 1981)
[M::mem_pestat] analyzing insert size distribution for orientation FR...
[M::mem_pestat] (25, 50, 75) percentile: (185, 234, 290)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 500)
[M::mem_pestat] mean and std.dev: (239.67, 78.97)
[M::mem_pestat] low and high boundaries for proper pairs: (1, 605)
[M::mem_pestat] analyzing insert size distribution for orientation RF...
[M::mem_pestat] (25, 50, 75) percentile: (61, 100, 6386)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 19036)
[M::mem_pestat] mean and std.dev: (1621.82, 2653.72)
[M::mem_pestat] low and high boundaries for proper pairs: (1, 25361)
[M::mem_pestat] analyzing insert size distribution for orientation RR...
[M::mem_pestat] (25, 50, 75) percentile: (159, 181, 443)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 1011)
[M::mem_pestat] mean and std.dev: (324.58, 259.23)
[M::mem_pestat] low and high boundaries for proper pairs: (1, 1362)
[M::mem_pestat] skip orientation FF
[M::mem_pestat] skip orientation RF
[M::mem_pestat] skip orientation RR
[M::mem_process_seqs] Processed 2434218 reads in 446.364 CPU sec, 13.348 real sec
[fputs] Broken pipe

I am unsure exactly how to proceed do you have any suggestions?

ERROR: Hapo-G didn't finish successfully, exit code: -8

Hi,
Ran Hapo-g successfully twice, on the third round I get this error.

hapog.py \
	--genome /home/jon/Working_Files/hapo-g/round_2/hapog_results/hapog.fasta \
	--single /home/jon/Working_Files/sea_cuke_species_data/apostichopus_japonicus/apostichopus_japonicus_raw_genome_seq_data/pacbio_SRR6282347.1.fastq \
	--output /home/jon/Working_Files/hapo-g/round_3 \
	--threads 70 \
	-u

Checking dependencies...
	Found BWA.
	Found Samtools.

Non alphanumeric characters detected in fasta headers. Renaming sequences.

Launching mapping on genome...
Done in 2809 seconds

Indexing the BAM file...
Done in 286 seconds

Fragmenting the genome into 70 chunks of 11,489,433 bases (depending of scaffold sizes)
Done in 11 seconds

Extracting bam for each chunk
Done in 60 seconds

Launching Hapo-G on each chunk
ERROR: Hapo-G didn't finish successfully, exit code: -8
Faulty command: /home/jon/miniconda3/envs/hapog/bin/build/hapog -b chunks_bam/chunks_69.bam -f chunks/chunks_69.fasta -o hapog_chunks/chunks_69.fasta -c hapog_chunks/chunks_69.changes

Memory issue

Hi HAPO-G team,

I hope you are well and thank you very much again for your nice work.

Recently, while I was using hapog a lot to polish my assemblies, I met a problem which is same to the issue described in #5. At the error correction step, bam files were processed in parallel with the same number of jobs as the number of threads set for read alignment. My machine has 60 CPUs and 900 Gb memory, however, with 60 CPUs, haplog used more than 900 Gb memory, which caused it to stop.

Maybe a way to avoid large memory usage is to not submit all the correction jobs in one go, for example, for my machine, maybe first process the first 30 jobs and then finish the next 30 ones. So, is it possible to have another option for haplog to control the number of bam files processed in parallel to avoid large memory usage?

I am using Illumina pair-end short reads correcting ONT-based assemblies with the latest hapog.

Look forward to your reply and thank you very much.

Oh, by the way, I found in the bwa mem command, the memory usage for samtools sort is set to 5 Gb, which might not be good for machines with not a very high memory.

Best wishes,
Yutang

exit code: -6

Hi,
I am trying to polish an ONT assembly with HiFi reads, but get this error

Launching Hapo-G on each chunk
ERROR: Hapo-G didn't finish successfully, exit code: -6
Faulty command: /path to hapog/build/hapog -b chunks_bam/chunks_1.bam -f chunks/chunks_1.fasta -o hapog_chunks/chunks_1.fasta -c hapog_chunks/chunks_1.changes

I am running hapo-g with the following command
python3 /path to hapog/hapog.py --genome genome.fasta --single hifi-reads.fastq -o 1 -t 16 -u
bwa and samtools are both installed and included in the PATH.

Thanks.

Question: how about polishing a phased diploid genome assembly?

Hi HAPO-G team,

Last night I started HAPO-G to polish a phased diploid genome assembly made from ONT reads with a size around 4.4 Gb using 60x Illumina short reads, just after 16 hours, it finished successfully. I have to say I really like HAPO-G, and it really amazed me by its nicely organized output folder structure and its fast speed.

However, I want to confirm something with you:

First, is it a good practice to polish a diploid assembly with short reads from whole genome sequencing? I polished the diploid assembly instead of each haplotype separately because I thought when map reads to a phased diploid assembly, reads will be mapped accordingly to the right haplotype where they are from so that less phase switches will occur after polishing. What do you think?
I checked the hapog.changes file from my last run with the phased diploid assembly, I found most (93% = lines with homo/all lines) of the changes were classified as homo. I guess this suggests that my assembly is really phased (at least most alleles were separated), and reads did map to the corresponding haplotype correctly. What do you think? By the way, for the homo variation, ratio1 and ratio2 are mostly 1. I guess the ratio means the allele frequency you described in your paper, but why there are two ratios?

Thank you very much again for making this brilliant tool and look forward to your reply.

Best wishes,
Yutang

ModuleNotFoundError

I was trying to run hapo-g on a cluster on four nodes each with 128GB of RAM. I installed hapog through conda as it was written in the Readme.md. However, it always gives me an error which says

"from Bio import SeqIO"

ModuleNotFoundError: No module named 'Bio' "

even if all the required libraries and modules are there.

What could be the possible reason for this and how should it be fixed?

Any input will be much appreciated!

Does HAPO-G correct ambiguous bases in fasta

May I know if HAPO-G corrects ambiguous bases in fasta? Thanks

Minimum coverage for correction?

Hi, sorry I couldn't see this anywhere but just wanted to check does HapoG take into account depth when doing correction? I have a Nanopore assembly of ~400mb from 40x coverage and short-reads of around 20x coverage. I'm aware that the short-reads aren't as high depth as commonly used but unfortunately it was all that was available!

I tried Pilon and it outputs when it has too low depth to call, just wanted to check if HapoG does a similar QC before correction?

Thanks, have to say this is an awesome program, the run-time savings alone are so nice!

Polish before or after Redundans

Hello,
I was wondering if I should use Hapog on my assembly before using Redundans or after using Redundans (which does gap closing and returns a homozygous assembly)?
Thank you.

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb1

Hi there,

I used Hapo-G on my genome, starting with a bam file created by bwa-mem2. I installed Hapo-G via conda. My command was:

hapog.py \
    --genome flye/assembly.fasta \
    -b flye/paired.mapping.bam \
    -o 1_hapo_g/ -t 5 -u

The resulting file in hapog_results/hapog.fasta mostly contains the polished sequences, but it has some nonsense lines at the ends of some contigs which break downstream analysis. For example:

GAAGTGCTCAAGGTCCCTTCTTTATACTCCACCACTCTCGTGTTTATCGTCCCGAACCTT
:
00887:409:HN3LMDRXY:2:2259:2293:9940TC!<90><B7><B2><F5>MU <B1><86>^A

GGATGCTCGTTTGTCCATTGTTTGTCCATCTAAG<B1><86>^A

The error message was:

Traceback (most recent call last):
  File "/home/OXFORDNANOLABS/lly/miniconda3/envs/hapog/bin/hapog.py", line 161, in <module>
    pipeline.include_unpolished(args.input_genome)
  File "/mmfs1/uslinuxhome/lly/miniconda3/envs/hapog/bin/lib/pipeline.py", line 246, in include_unpolished
    for line in open("hapog_results/hapog.fasta"):
  File "/home/OXFORDNANOLABS/lly/miniconda3/envs/hapog/lib/python3.10/codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb1 in position 3574: invalid start byte

institut-de-genomique / hapo-g Goto Github PK

hapo-g's Introduction

Hapo-G - Haplotype-Aware Polishing of Genomes

Dependencies

Installation

Installation with conda

Installing from Github

Build with existing htslib

Build with a new htslib

Using Hapo-G

Standard pipeline

Skipping the mapping step

Output files

hapog_results/hapog.fasta

hapog_results/hapog.changes

Acknowledgements

hapo-g's People

Contributors

Stargazers

Watchers

Forkers

hapo-g's Issues

Recommend Projects

Recommend Topics

Recommend Org

`hapog_results/hapog.fasta`

`hapog_results/hapog.changes`