Hello dear developers, I am using Hapo-G for the first time to polis

Hello, thanks for your response, I found something like this: <div class="snippet-

High RAM consumption in plant genome about hapo-g HOT 10 CLOSED

enriquepola1996 commented on June 29, 2024

High RAM consumption in plant genome

from hapo-g.

Comments (10)

bistace commented on June 29, 2024 1

Hello,

the hapog.py script is just a big wrapper around the hapog binary that is written in C. It seems that the mapping step went as expected so it is possible to launch the hapog binary on each chunk individually and concatenate the results at the end.

If you installed HAPO-G via conda, you should find a binary called hapog_bin that you can launch on each chunk. If you compiled it from source, it is located in hapog_build/hapog. Here is something that should work (not tested so there may be some syntax errors):

set -euo pipefail

# Path to the hapog binary. For conda it is hapog_bin, 
# for manual installation it is the full path to hapog_build/hapog
HAPOG=REPLACE_ME

# Look in the hapog/chunks_bam output directory
# Indicate how many chunks there are
CHUNKS=REPLACE_ME

# The output directory of the run that failed due to RAM
# We want to use the mappings that were generated
cd YOUR_PREVIOUS_HAPOG_OUTPUT_DIRECTORY

for chunk in {0..${CHUNKS}}
do
    ${HAPOG} -b chunks_bam/${chunk}.bam \
        -f chunks/${chunk}.fasta \
        -o hapog_chunks/${chunk}.fasta \
        -c hapog_chunks/${chunk}.changes \
        1> logs/hapog_${chunk}.o 2> logs/hapog_${chunk}.e
done

# Everything went as expected, concatenate the results
mkdir -p hapog_results
cat hapog_chunks/*.fasta > hapog_results/hapog.fasta
cat hapog_chunks/*.changes > hapog_results/hapog.changes

Do not hesitate to come back to me if you encounter any problem or to say if that worked!

from hapo-g.

enriquepola1996 commented on June 29, 2024 1

Hello, thank you again very much for your support and yes, I polished a softmask genome. I will recover the missing scaffolds as indicated.

My assembly statistics are as follows:

Non polish genome
Statistics without reference	
# contigs	11948
# contigs (>= 0 bp)	11948
# contigs (>= 1000 bp)	11815
Largest contig	6559239
Total length	2675680810
Total length (>= 0 bp)	2675680810
Total length (>= 1000 bp)	2675578986
N50	581467
N90	109645
auN	885971
L50	1210
L90	5264
GC (%)	38.18
Mismatches	
# N's per 100 kbp	0
# N's	0


Polish genome (one round with Hapo-G)
Statistics without reference	
# contigs	10303
# contigs (>= 0 bp)	10303
# contigs (>= 1000 bp)	10197
Largest contig	6560227
Total length	2604764496
Total length (>= 0 bp)	2604764496
Total length (>= 1000 bp)	2604683834
N50	609613
N90	130055
auN	908215
L50	1151
L90	4748
GC (%)	38.35
Mismatches	
# N's per 100 kbp	0
# N's	65

It is certainly quite fragmented.

from hapo-g.

bistace commented on June 29, 2024 1

Thank you for reporting back!

from hapo-g.

enriquepola1996 commented on June 29, 2024

Hello, thanks for your response, I found something like this:

[epola@server hapog]$ ls
LICENSE.md README.md bin build.sh conda_files hapog hapog.py hapog_build requirements.txt setup.cfg setup.py src

[epola@server hapog]$ cd hapog_build/
[epola@server hapog_build]$ ls
CMakeCache.txt CMakeFiles Makefile cmake_install.cmake hapog

This correct?

My data:

[epola@server Hapo-G]$ ls
assembly.fasta  bam  chunks  chunks_bam  cmds  hapog_chunks  logs
[epola@server Hapo-G]$ ls chunks_bam/
chunks_1.bam   chunks_12.bam  chunks_15.bam  chunks_18.bam  chunks_20.bam  chunks_5.bam  chunks_8.bam
chunks_10.bam  chunks_13.bam  chunks_16.bam  chunks_19.bam  chunks_3.bam   chunks_6.bam  chunks_9.bam
chunks_11.bam  chunks_14.bam  chunks_17.bam  chunks_2.bam   chunks_4.bam   chunks_7.bam

I was thinking of launching something like this:

#!/bin/bash
#PBS -N HapoG_Chunks
#PBS -l walltime=700:00:00
#PBS -l nodes=1:ppn=20,vmem=700gb
#PBS -o hapog_chunks_output.log
#PBS -e hapog_chunks_error.log
#PBS -q ensam

# Load the necessary modules
module load hapog/1.3.7
module load samtools/1.9
module load bwa/0.7.17

# Change to the working directory
cd $PBS_O_WORKDIR

set -euo pipefail

# Path to the hapog binary
HAPOG=/data/software/hapog/hapog_build/hapog

#Indicate how many chunks there are
CHUNKS=20

# The output directory of the run that failed due to RAM
# We want to use the mappings that were generated
cd  user/Hapo-G


for chunk in $(seq 1 ${CHUNKS})
do
 ${HAPOG} -b chunks_bam/chunks_${chunk}.bam \
 -f chunks/chunks_${chunk}.fasta \
 -o hapog_chunks/chunks_${chunk}.fasta \
 -c hapog_chunks/chunks_${chunk}.changes \
 1> logs/hapog_${chunk}.o 2> logs/hapog_${chunk}.e
donated

# Everything went as expected, concatenate the results
mkdir -p hapog_results
cat hapog_chunks/*.fasta > hapog_results/hapog.fasta
cat hapog_chunks/*.changes > hapog_results/hapog.changes

What do you think? Thank so much for your help.

from hapo-g.

enriquepola1996 commented on June 29, 2024

Should I delete the hapog_chunks folder before running it?

from hapo-g.

bistace commented on June 29, 2024

Everything looks good, you can keep the folder as it will not create it by itself.

from hapo-g.

enriquepola1996 commented on June 29, 2024

Hello,

Thank you very much for the support, my process has ended successfully. I just have a couple of questions, is there a problem if I do the polishing on a previously masked genome? I'm seeing that the hapog.fasta result is an unmasked genome, it changed. I have masked the genome again, but I just want to make sure there is no problem putting a masked genome in for polishing. I saw that the number of scaffolds decreased by 10% and the N50 increased slightly.
I would also like to see what is the best way to see if the polishing was successful, what metrics could give me a signal.

I greatly appreciated your responses and again thank you very much for providing solutions to users.

from hapo-g.

enriquepola1996 commented on June 29, 2024

Also the other question I have, is it advisable to do several rounds of polishing with Hapo-G?

from hapo-g.

bistace commented on June 29, 2024

Hello,

by masked genome, do you mean that the masked bases are in lower-case or Ns? If it is the former then yes, HAPO-G outputs everything as upper-case. In the case of Ns, then it should not have changed anything.

Regarding the number of scaffolds decreasing, as you are not using the wrapper hapog.py script, the unpolished sequence are not written to the resulting fasta file. These unpolished sequences are the ones that had no alignments in the BAM file. To get them back, you would need to see which ones are missing from your original fasta file and copy them to the results. This is done like this in the wrapper (the genome parameter is your original assembly fasta file path):

def include_unpolished(genome):
    initial_contig_names = set()
    if os.path.exists("correspondance.txt"):
        with open("correspondance.txt") as corr:
            for line in corr:
                _, initial = line.strip("\n").split("\t")
                initial_contig_names.add(initial)
    else:
        with open(genome) as genome_file:
            for line in genome_file:
                if line.startswith(">"):
                    initial_contig_names.add(line[1:].strip("\n"))

    polished_contig_names = set()
    with open("hapog_results/hapog.fasta") as fasta:
        for line in fasta:
            if line.startswith(">"):
                contig_name = line[1:].strip("\n").replace("_polished", "")
                polished_contig_names.add(contig_name)

    with open("hapog_results/hapog.fasta", "a") as out:
        genome_file = open(genome)
        for record in SeqIO.parse(genome_file, "fasta"):
            if record.description.replace("_polished", "") not in polished_contig_names:
                out.write(record.format("fasta"))
        genome_file.close()

If we deal with Nanopore data, we routinely do one round of medaka polishing with Nanopore reads and two rounds of HAPO-G with Illumina reads.

I am a bit curious about your genome assembly stats as we never reach these high RAM levels, even on very large genomes. Could you please tell me what are the N50 and maximum size of your sequences?

from hapo-g.

enriquepola1996 commented on June 29, 2024

Thank you very much, the program ran well, you can close this issue.

from hapo-g.

High RAM consumption in plant genome about hapo-g HOT 10 CLOSED

Comments (10)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent