lbcb-sci / herro Goto Github PK

HERRO is a highly-accurate, haplotype-aware, deep-learning tool for error correction of Nanopore R10.4.1 or R9.4.1 reads (read length of >= 10 kbps is recommended).

License: Other

Rust 90.85% Shell 4.81% Python 2.94% Dockerfile 1.40%

herro's People

Contributors

Stargazers

Watchers

Forkers

jelber2 nkkarpov dinindusenanayake pythseq jguhlin maickrau hp2048 ivanv87 gokalpcelik dehui333

herro's Issues

Memory errors?

Hi,

I'm trying to run Herro using an A100 card. Not clear why it is running out of memory. I was running with -b 128 so I've dropped that down.

I'm running with singularity. Any suggestions?

Thanks

[00:01:26] Processing 1/? batch _
[>---------------------------------------] 93/90774 [W manager.cpp:340] Warning: FALLBACK path has been taken inside: runCudaFusionGroup. This is an indication that codegen Failed for some reason.
To debug try disable codegen fallback path via setting the env variable export PYTORCH_NVFUSER_DISABLE=fallback
(function runCudaFusionGroup)
thread '' panicked at src/inference.rs:172:64:
called Result::unwrap() on an Err value: Torch("The following operation failed in the TorchScript interpreter.\nTraceback of TorchScript (most recent call last):\nRuntimeError: The following operation failed in the TorchScript interpreter.\nTraceback of TorchScript, serialized code (most recent call last):\n File "code/torch/model.py", line 36, in fallback_cuda_fuser\n x0 = torch.permute(x, [0, 3, 1, 2])\n qn = self.qn\n sliced_sequences_concatenated = (qn).forward(x0, target_positions, lengths, )\n ~~~~~~~~~~~ <--- HERE\n fc2 = self.fc2\n _1 = (fc2).forward(sliced_sequences_concatenated, )\n File "code/torch/transformer.py", line 16, in forward\n _0 = torch.torch.nn.utils.rnn.pad_sequence\n context_read = self.context_read\n x0 = (context_read).forward(x, )\n ~~~~~~~~~~~~~~~~~~~~~ <--- HERE\n context_pos = self.context_pos\n x1 = (context_pos).forward(x0, )\n File "code/torch/torch/nn/modules/container.py", line 15, in forward\n _2 = getattr(self, "2")\n input0 = (_0).forward(input, )\n input1 = (_1).forward(input0, )\n ~~~~~~~~~~~ <--- HERE\n return (_2).forward(input1, )\n def len(self: torch.torch.nn.modules.container.Sequential) -> int:\n File "code/torch/torch/nn/modules/batchnorm.py", line 35, in forward\n weight = self.weight\n bias = self.bias\n _3 = _0(input, running_mean, running_var, weight, bias, bn_training, 0.10000000000000001, 1.0000000000000001e-05, )\n ~~ <--- HERE\n return _3\n def _check_input_dim(self: torch.torch.nn.modules.batchnorm.BatchNorm2d,\n File "code/torch/torch/nn/functional.py", line 52, in batch_norm\n else:\n pass\n _6 = torch.batch_norm(input, weight, bias, running_mean, running_var, training, momentum, eps, True)\n ~~~~~~~~~~~~~~~~ <--- HERE\n return _6\ndef relu(input: Tensor,\n\nTraceback of TorchScript, original code (most recent call last):\n File "/raid/scratch/stanojevicd/projects/haec-BigBird/model.py", line 157, in fallback_cuda_fuser\n sliced_sequences_concatenated = torch.cat(encoded)'''\n x = x.permute((0, 3, 1, 2))\n sliced_sequences_concatenated = self.qn(x, target_positions, lengths)\n ~~~~~~~ <--- HERE\n \n # list of tensors of shape (selected_token_number, 1) -> (selected_token_number)\n File "/raid/scratch/stanojevicd/projects/haec-BigBird/transformer.py", line 36, in forward\n def forward(self, x: Tensor, target_positions: List[Tensor],\n lengths: Tensor) -> Tensor:\n x = self.context_read(x) # [B, I, L, R] -> [B, 128, L, R]\n ~~~~~~~~~~~~~~~~~ <--- HERE\n x = self.context_pos(x) # [B, 128, L, R] -> [B, 256, L, 1]\n x = x.squeeze(-1).transpose(1, 2) # [B, L, 256]\n File "/home/stanojevicd/miniforge3/envs/haec/lib/python3.11/site-packages/torch/nn/modules/container.py", line 215, in forward\n def forward(self, input):\n for module in self:\n input = module(input)\n ~~~~~~ <--- HERE\n return input\n File "/home/stanojevicd/miniforge3/envs/haec/lib/python3.11/site-packages/torch/nn/modules/batchnorm.py", line 171, in forward\n used for normalization (i.e. in eval mode when buffers are not None).\n """\n return F.batch_norm(\n ~~~~~~~~~~~~ <--- HERE\n input,\n # If buffers are not to be tracked, ensure that they won't be updated\n File "/home/stanojevicd/miniforge3/envs/haec/lib/python3.11/site-packages/torch/nn/functional.py", line 2478, in batch_norm\n _verify_batch_size(input.size())\n\n return torch.batch_norm(\n ~~~~~~~~~~~~~~~~ <--- HERE\n input, weight, bias, running_mean, running_var, training, momentum, eps, torch.backends.cudnn.enabled\n )\nRuntimeError: CUDA out of memory. Tried to allocate 4.64 GiB (GPU 0; 9.50 GiB total capacity; 4.74 GiB already allocated; 1.93 GiB free; 7.06 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF\n\n")
note: run with RUST_BACKTRACE=1 environment variable to display a backtrace
Aborted (core dumped)

Query on Minimum GPU requirement and Performance on duplex reads

Hi,

What is the minimum GPU VRAM required for running HERRO?
How does HERRO perform on error correction of ONT duplex reads?

Best,
Bikram Panda

Download of Singularity image (current link is dead)

Hi there

Is there an alternative location from where one can DL the herro.sif?

the given link (wget http://metals.zesoi.fer.hr:9080/herro/herro.sif) is not working for me

I also could not DL the models from that location - but I was able to DL them from zenodo

As an alternative I tried compiling it, but I am having trouble with rust (error: linking with cc failed: exit status: 1). I cannot use dorado correct because the data i wanna correct is 9.4.1 and dorado only include the herro model for 10.4.

Thanks

Cannot download models - server down?

Hi,

I cannot download the models, get:

wget http://metals.zesoi.fer.hr:9080/herro/model_R9_v0.1.pt
--2024-08-02 12:25:46-- http://metals.zesoi.fer.hr:9080/herro/model_R9_v0.1.pt
Resolving metals.zesoi.fer.hr (metals.zesoi.fer.hr)... 161.53.64.201
Connecting to metals.zesoi.fer.hr (metals.zesoi.fer.hr)|161.53.64.201|:9080... failed: Connection timed out.
Retrying.

etc..

Error-correction step error

Hello,
I try to run the Error-correction step but it doesn't work I get the following error:

thread '' panicked at /herro/src/inference.rs:197:70:
Cannot load model.: Torch("Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx\nException raised from device_count_impl at ../c10/cuda/CUDAFunctions.cpp:44 (most recent call first):\nframe #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits, std::allocator >) + 0x6b (0x2af52f05a6bb in /libs/libtorch/lib/libc10.so)\nframe #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, char const*) + 0xc9 (0x2af52f055769 in /libs/libtorch/lib/libc10.so)\nframe #2: c10::cuda::device_count_ensure_non_zero() + 0xd8 (0x2af52f64afe8 in /libs/libtorch/lib/libc10_cuda.so)\nframe #3: + 0x103931a (0x2af4d1e3931a in /libs/libtorch/lib/libtorch_cuda.so)\nframe #4: + 0x2c30f36 (0x2af4d3a30f36 in /libs/libtorch/lib/libtorch_cuda.so)\nframe #5: + 0x2c30ffb (0x2af4d3a30ffb in /libs/libtorch/lib/libtorch_cuda.so)\nframe #6: at::_ops::empty_strided::redispatch(c10::DispatchKeySet, c10::ArrayRefc10::SymInt, c10::ArrayRefc10::SymInt, c10::optionalc10::ScalarType, c10::optionalc10::Layout, c10::optionalc10::Device, c10::optional) + 0x1fb (0x2af518eb71fb in /libs/libtorch/lib/libtorch_cpu.so)\nframe #7: + 0x25ebc75 (0x2af5191ebc75 in /libs/libtorch/lib/libtorch_cpu.so)\nframe #8: at::_ops::empty_strided::call(c10::ArrayRefc10::SymInt, c10::ArrayRefc10::SymInt, c10::optionalc10::ScalarType, c10::optionalc10::Layout, c10::optionalc10::Device, c10::optional) + 0x168 (0x2af518ef2328 in /libs/libtorch/lib/libtorch_cpu.so)\nframe #9: + 0x1701f5f (0x2af518301f5f in /libs/libtorch/lib/libtorch_cpu.so)\nframe #10: at::native::_to_copy(at::Tensor const&, c10::optionalc10::ScalarType, c10::optionalc10::Layout, c10::optionalc10::Device, c10::optional, bool, c10::optionalc10::MemoryFormat) + 0x17e3 (0x2af5186a6cf3 in /libs/libtorch/lib/libtorch_cpu.so)\nframe #11: + 0x27d3603 (0x2af5193d3603 in /libs/libtorch/lib/libtorch_cpu.so)\nframe #12: at::_ops::_to_copy::redispatch(c10::DispatchKeySet, at::Tensor const&, c10::optionalc10::ScalarType, c10::optionalc10::Layout, c10::optionalc10::Device, c10::optional, bool, c10::optionalc10::MemoryFormat) + 0x103 (0x2af518b93c83 in /libs/libtorch/lib/libtorch_cpu.so)\nframe #13: + 0x25f01c8 (0x2af5191f01c8 in /libs/libtorch/lib/libtorch_cpu.so)\nframe #14: at::_ops::_to_copy::redispatch(c10::DispatchKeySet, at::Tensor const&, c10::optionalc10::ScalarType, c10::optionalc10::Layout, c10::optionalc10::Device, c10::optional, bool, c10::optionalc10::MemoryFormat) + 0x103 (0x2af518b93c83 in /libs/libtorch/lib/libtorch_cpu.so)\nframe #15: + 0x3a66271 (0x2af51a666271 in /libs/libtorch/lib/libtorch_cpu.so)\nframe #16: + 0x3a6681b (0x2af51a66681b in /libs/libtorch/lib/libtorch_cpu.so)\nframe #17: at::_ops::_to_copy::call(at::Tensor const&, c10::optionalc10::ScalarType, c10::optionalc10::Layout, c10::optionalc10::Device, c10::optional, bool, c10::optionalc10::MemoryFormat) + 0x201 (0x2af518c16651 in /libs/libtorch/lib/libtorch_cpu.so)\nframe #18: at::native::to(at::Tensor const&, c10::Device, c10::ScalarType, bool, bool, c10::optionalc10::MemoryFormat) + 0xfd (0x2af5186a505d in /libs/libtorch/lib/libtorch_cpu.so)\nframe #19: + 0x29a5612 (0x2af5195a5612 in /libs/libtorch/lib/libtorch_cpu.so)\nframe #20: at::_ops::to_device::call(at::Tensor const&, c10::Device, c10::ScalarType, bool, bool, c10::optionalc10::MemoryFormat) + 0x1c1 (0x2af518d95cd1 in /libs/libtorch/lib/libtorch_cpu.so)\nframe #21: torch::jit::Unpickler::readInstruction() + 0x1719 (0x2af51b766789 in /libs/libtorch/lib/libtorch_cpu.so)\nframe #22: torch::jit::Unpickler::run() + 0xa8 (0x2af51b767988 in /libs/libtorch/lib/libtorch_cpu.so)\nframe #23: torch::jit::Unpickler::parse_ivalue() + 0x2e (0x2af51b76953e in /libs/libtorch/lib/libtorch_cpu.so)\nframe #24: torch::jit::readArchiveAndTensors(std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&, std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&, std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&, c10::optional<std::function<c10::StrongTypePtr (c10::QualifiedName const&)> >, c10::optional<std::function<c10::intrusive_ptr<c10::ivalue::Object, c10::detail::intrusive_target_default_null_typec10::ivalue::Object > (c10::StrongTypePtr, c10::IValue)> >, c10::optionalc10::Device, caffe2::serialize::PyTorchStreamReader&, c10::Type::SingletonOrSharedTypePtrc10::Type (*)(std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&), std::shared_ptrtorch::jit::DeserializationStorageContext) + 0x529 (0x2af51b7241a9 in /libs/libtorch/lib/libtorch_cpu.so)\nframe #25: + 0x4b08c4b (0x2af51b708c4b in /libs/libtorch/lib/libtorch_cpu.so)\nframe #26: + 0x4b0b04b (0x2af51b70b04b in /libs/libtorch/lib/libtorch_cpu.so)\nframe #27: torch::jit::import_ir_module(std::shared_ptrtorch::jit::CompilationUnit, std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&, c10::optionalc10::Device, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits, std::allocator >, std::__cxx11::basic_string<char, std::char_traits, std::allocator >, std::hash<std::__cxx11::basic_string<char, std::char_traits, std::allocator > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits, std::allocator > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits, std::allocator > const, std::__cxx11::basic_string<char, std::char_traits, std::allocator > > > >&, bool, bool) + 0x3a2 (0x2af51b70f6c2 in /libs/libtorch/lib/libtorch_cpu.so)\nframe #28: torch::jit::import_ir_module(std::shared_ptrtorch::jit::CompilationUnit, std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&, c10::optionalc10::Device, bool) + 0x92 (0x2af51b70fa42 in /libs/libtorch/lib/libtorch_cpu.so)\nframe #29: torch::jit::load(std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&, c10::optionalc10::Device, bool) + 0xd1 (0x2af51b70fb71 in /libs/libtorch/lib/libtorch_cpu.so)\nframe #30: + 0x1de74e (0x55867173674e in /usr/bin/herro)\nframe #31: + 0xf3e9c (0x55867164be9c in /usr/bin/herro)\nframe #32: + 0xd2758 (0x55867162a758 in /usr/bin/herro)\nframe #33: + 0xd9d0c (0x558671631d0c in /usr/bin/herro)\nframe #34: + 0xf4a96 (0x55867164ca96 in /usr/bin/herro)\nframe #35: + 0x145375 (0x55867169d375 in /usr/bin/herro)\nframe #36: + 0x94ac3 (0x2af52f337ac3 in /lib/x86_64-linux-gnu/libc.so.6)\nframe #37: clone + 0x44 (0x2af52f3c8a04 in /lib/x86_64-linux-gnu/libc.so.6)\n")

I ran the following command: singularity exec --bind $DIR $DIR/bin/herro_v0.1.sif herro inference --read-alns batch_aln -m herro_model_v0.1.pt -b 64 herro_split.fastq.gz herro_cor.fasta

Do you know what is the origin of the error?
Thanks.

Heterozygous SNP counts for HG002 sample

Hi,
I was wondering if it is possible to obtain summary stats on the number of heterozygous SNPs for HG002 sample using corrected reads and compare with PacBio HiFi or GIAB data against CHM13-T2T or GRCh38. This will indicate if there is overcorrection in favour of a single allele at a locus. I understand that you are preparing manuscript and this may be included in it. Awaiting results eagerly.
Cheers

RUST_BACKTRACE=1

Hi, i get the error when running HERRO, and i check the fastq file, but there no appear to be non-canonical bases (A, T, C, G), Could you help me about how to troubleshoot the error?
thank you.

singularity run --nv herro.sif inference --read-alns batches_of_alignments -t 5 -d 0,1 -m model_R9_v0.1.pt -b 32 cyc.cor.fastq.gz cyc.cor.herro.fasta

[W graph_fuser.cpp:108] Warning: operator() profile_node %1243 : int[] = prim::profile_ivalue(%1241)
 does not have profile information (function operator())
thread '<unnamed>' panicked at /herro/src/haec_io.rs:144:9:
Out of bounds for 2-bit sequence decoding.
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

Growing differences between "dorado correct" and HERRO

Several new versions of Dorado have recently been made, with improvements in "dorado correct". At the same time, the code of HERRO hasn't been changing. Am I right that users should use "dorado correct" and not HERRO?

`RuntimeError: CUDA error: an illegal memory access was encountered`

I tried this command

apptainer run --nv \                                                                 
    --bind "/scratch":"/output" \
    --bind "/lizardfs/guarracino/ratty/sequencing-data/HXB10/ONT":"/input" \
    --bind "/lizardfs/guarracino/git/herro/resources":"/models" \
    herro.sif inference -d 0,1 -t 8 -b 32 -m /models/model_v0.1.pt /input/HXB10.fq.gz /output/HXB10.herro.fq.gz

I got this error:

INFO:    underlay of /etc/localtime required more than 50 (77) bind mounts
INFO:    underlay of /usr/bin/nvidia-debugdump required more than 50 (308) bind mounts
[00:03:31] Parsed 430454 reads.                                                                                                                                          
[00:52:46] Processing 1/? batch ⠄                                                                                                                                        
[00:52:46] Processing 1/? batch ⡀                                                                                                                                        
[00:55:02] Processing 1/? batch ⠁
[>---------------------------------------] 2/94965                                                                                                                       [W graph_fuser.cpp:108] Warning: operator() profile_node %1243 : int[] = prim::profile_ivalue(%1241)
[00:55:12] Processing 1/? batch ⠂
[>---------------------------------------] 19/94965                                                                                                                      thread '<unnamed>' panicked at src/inference.rs:172:64:
called `Result::unwrap()` on an `Err` value: Torch("The following operation failed in the TorchScript interpreter.\nTraceback of TorchScript, serialized code (most recent call last):\n  File \"code/__torch__/transformer.py\", line 26, in forward\n      _2 = torch.select(x2, 0, i)\n      _3 = annotate(List[Optional[Tensor]], [tp])\n      _4 = torch.append(batch, torch.index(_2, _3))\n                               ~~~~~~~~~~~ <--- HERE\n    x3 = _0(batch, True, 0., )\n    _5 = __torch__.transformer.create_mask(lengths, )\n\nTraceback of TorchScript, original code (most recent call last):\n  File \"/raid/scratch/stanojevicd/projects/haec-BigBird/transformer.py\", line 40, in forward\n        x = x.squeeze(-1).transpose(1, 2)  # [B, L, 256]\n    \n        batch = [x[i, tp] for i, tp in enumerate(target_positions)]\n                 ~~~~~~~~ <--- HERE\n        x = nn.utils.rnn.pad_sequence(batch, True)\n        mask = create_mask(lengths).to(device=x.device)\nRuntimeError: CUDA error: an illegal memory access was encountered\nCUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.\nFor debugging consider passing CUDA_LAUNCH_BLOCKING=1.\nCompile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.\n\n")
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Aborted (core dumped)

Can you help me?

Observation: Reads get split at certain intervals more often.

Hi,
I have a query regarding the following observation I found with HERRO corrected reads,

Reads get split at certain intervals more frequently.

In the following plots of read length distributions we are comparing
(A) Raw reads dataset vs preprocessed-reads (The reads from the preprocess.sh script of HERRO. Done to separate porechop and duplex-tools effect from HERRO-inference).
(B) Raw reads dataset vs hifiasm-ec reads dataset.
(C) Raw reads dataset vs HERRO-corrected reads dataset.
(D) Herro-corrected read length distribution with bin size set so that we can observe the spikes in the distribution.

[Please open this image in a new tab to zoom].

We can observe in the above figure (D) that there are spikes at certain bins (denoting more reads of that particular length being in that bin) and they appear at approximately regular intervals. This was not seen in the raw reads but only in the HERRO ec reads.

Is the splitting of reads into certain intervals more frequently happening due to the GPU or the model ?

Experiment Information:
Tools :

NanoPlot(https://github.com/wdecoster/NanoPlot) was used to get the read statistics.
Minimap2: Used to create the small read dataset of raw reads mapping to hg002 chr19(both haplotypes).
Hifiasm: To get the Hifiasm-ec reads and compare the HERRO reads against them.

Data:

We created a small read dataset of all reads mapping to hg002 chr19 both haplotypes from the ONT read data available from https://labs.epi2me.io/giab-2023.05/ at this link aws --no-sign-request s3 ls s3://ont-open-data/giab_2023.05/analysis/hg002/hac/PAO89685.pass.cram. We call this small read dataset as the raw reads here.

Edit: Removed read statistics table due to errors.

Thank You,
Bikram Kumar Panda
CDS, IISc

hifiasm options

I've tried herro on one of our read sets (with R9.4.1 model) and when I compare the reads before and after correction, by aligning them on the reference assembly, the result is very convincing (very few INDEL errors left.) But when I assemble the reads with hifiasm 0.19.8 I lose close to all reads (95%) during the first hifiasm correction step : comparing first and second kmer histograms in hifiasm logs. This is not the case when I assemble HiFi reads.

Which hifiasm options should I change to limit this phenomenon?

How to reduce RAM usage of "herro inference"?

Can I reduce RAM (not VRAM) usage of the error-correction stage of HERRO by tweaking some of its parameters?

Access to the models

Hi,
I would like to try Herro, as it looks very efficient. However, as many of us, I don't work on aws (and don't have an account). Is it possible to provide an other way to get access to the models and the image?

For the models, I would recommend using Zenodo (https://zenodo.org/). It is free and datasets can be up to 50 Gb, and your dataset will get a DOI and a direct download link. For the container image, I would recommend to store it in DockerHub (or the new github system), it will make it accessible and compatible with docker, singularity and apptainer users, and it will provide a version history.
Thanks for your program,
Best.

Cannot load model

Hi, I'm having the following problem when running with singularity:

INFO: Converting SIF file to temporary sandbox...
WARNING: underlay of /etc/localtime required more than 50 (77) bind mounts
WARNING: underlay of /usr/bin/nvidia-smi required more than 50 (308) bind mounts
thread '' panicked at /herro/src/inference.rs:197:70:
Cannot load model.: Torch("open file failed because of errno 2 on fopen: , file path: ../herro_models/model_R9_v0.1/model_R9_v0.1.pt\nException raised from RAIIFile at ../caffe2/serialize/file_adapter.cc:27 (most recent call first):\nframe #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits, std::allocator >) + 0x6b (0x7f40e285a6bb in /libs/libtorch/lib/libc10.so)\nframe #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&) + 0xbf (0x7f40e28555ef in /libs/libtorch/lib/libc10.so)\nframe #2: caffe2::serialize::FileAdapter::RAIIFile::RAIIFile(std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&) + 0x134 (0x7f40e6552f84 in /libs/libtorch/lib/libtorch_cpu.so)\nframe #3: caffe2::serialize::FileAdapter::FileAdapter(std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&) + 0x41 (0x7f40e65535f1 in /libs/libtorch/lib/libtorch_cpu.so)\nframe #4: caffe2::serialize::PyTorchStreamReader::PyTorchStreamReader(std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&) + 0x7f (0x7f40e6550a6f in /libs/libtorch/lib/libtorch_cpu.so)\nframe #5: torch::jit::import_ir_module(std::shared_ptrtorch::jit::CompilationUnit, std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&, c10::optionalc10::Device, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits, std::allocator >, std::__cxx11::basic_string<char, std::char_traits, std::allocator >, std::hash<std::__cxx11::basic_string<char, std::char_traits, std::allocator > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits, std::allocator > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits, std::allocator > const, std::__cxx11::basic_string<char, std::char_traits, std::allocator > > > >&, bool, bool) + 0x28d (0x7f40e770f5ad in /libs/libtorch/lib/libtorch_cpu.so)\nframe #6: torch::jit::import_ir_module(std::shared_ptrtorch::jit::CompilationUnit, std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&, c10::optionalc10::Device, bool) + 0x92 (0x7f40e770fa42 in /libs/libtorch/lib/libtorch_cpu.so)\nframe #7: torch::jit::load(std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&, c10::optionalc10::Device, bool) + 0xd1 (0x7f40e770fb71 in /libs/libtorch/lib/libtorch_cpu.so)\nframe #8: + 0x1de74e (0x55620c9d674e in herro)\nframe #9: + 0xf3e9c (0x55620c8ebe9c in herro)\nframe #10: + 0xd2758 (0x55620c8ca758 in herro)\nframe #11: + 0xd9d0c (0x55620c8d1d0c in herro)\nframe #12: + 0xf4a96 (0x55620c8eca96 in herro)\nframe #13: + 0x145375 (0x55620c93d375 in herro)\nframe #14: + 0x94ac3 (0x7f40e266bac3 in /lib/x86_64-linux-gnu/libc.so.6)\nframe #15: clone + 0x44 (0x7f40e26fca04 in /lib/x86_64-linux-gnu/libc.so.6)\n")
note: run with RUST_BACKTRACE=1 environment variable to display a backtrace
Aborted (core dumped)
INFO: Cleaning up image...

HD_DIR='pwd'
singularity run --nv --bind $HD_DIR:$HD_DIR herro.sif herro inference --read-alns ./040919_Agr_pod -m ../herro_models/model_R9_v0.1/model_R9_v0.1.pt -t 20 -b 6 ./040919_Agr_pod.prefix.fastq.gz ./040919_Agr_pod_corrected.fasta

How about Ploidy?

Hi,
I'm wondering if this method also would be applicable to a datasets of polyploid species. Any thoughts towards that direction?

Thanks!

`RUST_BACKTRACE=1`

Hello, I am currently running Herro in a singularity exec container and I receive the message below:

```Result::unwrap()on anErr` value: Custom { kind: UnexpectedEof, error: "incomplete frame" }
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace.'''

I have run Herro previously (a month or so ago) and it has worked properly and now I receive this message when running the program.

Please assist as soon as possible!

Best Regards,

Laura

HERRO's 10Kb input recommendation

I am using herro since last weekend when ONT included this into their dorado 0.7 basecaller (i am running through dorado). I have 10.4.1 ONT data of bees (some of them are male, i.e. haploid)

first I had an error about CUDA out of Memory some 618m into the analysis (time /opt/dorado-0.7.0-linux-x64/bin/dorado correct --verbose --threads 40 --device 'cuda:0' -b 56 --infer-threads 2 -m /opt/dorado-0.7.0-linux-x64/bin/herro-v1 $INPUTFOLDER/$INPUT.gz > $SPECIES.dorado.sup430.2kbQ90.herro-v1.fa)

setting the PYTORCH_CUDA_ALLOC_CONF and batchsize to AUTO it worked (but took 2days, basically as long as basecalling, using 19GB out of 24 on my RTX3090)
export PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:4096
time /opt/dorado-0.7.0-linux-x64/bin/dorado correct --verbose --threads 60 --device 'cuda:0' -b 0 --infer-threads 2 -m /opt/dorado-0.7.0-linux-x64/bin/herro-v1 $INPUTFOLDER/$INPUT.gz > $SPECIES.dorado.sup430.2kbQ90.herro-v1.fa

The resulting assemblies (various flye, hifiasm) look very promising, including some chromosomes T2T, where previously it wasn't T2T. The assemblies are longer, but at similar very good N50 etc stats, i.e. i am quite happy with this thus far.

Just for our data processing we had a question regarding reads less than 10kb.

Should we remove these entirely? We usually have alot of reads around that length, i.e. many that are 2-10kb, while only a certain fraction is above 10kb. The amount of 10kb+ reads sometimes is very good, thus filtering is no problem.

Is it better to run herro on a filtered dataset (10kb+) than all reads (incl the small ones?
If the small reads are included, are they also corrected or remain unmodified?
is it detrimental to leave shorter reads in?

I'm just trying to understand the potential risks/biases that are possible by not pruning the datasets (sometimes the input datasets are not ideally distributed, i.e. too short on average, too few long reads)

Thanks for your recommendation

Failed conda env create

Hi,

Thank you for your work on HERRO, I'm very eager to give it a shot. I'm facing some issues with the first step of the installation which is to install the dependencies with conda:

$ conda env create --file scripts/herro-env.yml                                                                                                                                                      
Collecting package metadata: done
Solving environment: failed

ResolvePackageNotFound:
  - libuuid=2.38.1
  - ld_impl_linux-64=2.40
  - ca-certificates=2023.7.22
  - libsqlite=3.44.0
  - openssl=3.1.4
  - libstdcxx-ng=13.2.0
  - libnsl=2.0.1
  - tk=8.6.13
  - libgomp=13.2.0
  - libgcc-ng=13.2.0
  - wheel=0.41.3

As you can see, there is a number of dependencies that are missing to install the other dependencies (which is somehow funny imo given that conda is supposed to automate this for you). Anyhow, I think the problem is really the build versions that are required for these packages because if I look at all versions available for the package tk:

$ conda search -f tk
Loading channels: done
# Name                       Version           Build  Channel
tk                            8.5.13               0  pkgs/free
tk                            8.5.15               0  pkgs/free
tk                            8.5.18               0  pkgs/free
tk                            8.5.19               0  conda-forge
tk                            8.5.19               1  conda-forge
tk                            8.5.19               2  conda-forge
tk                             8.6.6               0  conda-forge
tk                             8.6.6               1  conda-forge
tk                             8.6.6               2  conda-forge
tk                             8.6.6               3  conda-forge
tk                             8.6.6               4  conda-forge
tk                             8.6.6               5  conda-forge
tk                             8.6.7               0  conda-forge
tk                             8.6.7      h5979e9b_1  pkgs/main
tk                             8.6.7      hc745277_3  pkgs/main
tk                             8.6.8               0  conda-forge
tk                             8.6.8      h84994c4_0  conda-forge
tk                             8.6.8   h84994c4_1000  conda-forge
tk                             8.6.8      ha92aebf_0  conda-forge
tk                             8.6.8      hbc83047_0  pkgs/main
tk                             8.6.9   h84994c4_1000  conda-forge
tk                             8.6.9   h84994c4_1001  conda-forge
tk                             8.6.9      ha92aebf_0  conda-forge
tk                             8.6.9   hed695b0_1002  conda-forge
tk                             8.6.9   hed695b0_1003  conda-forge
tk                            8.6.10      h21135ba_1  conda-forge
tk                            8.6.10      hbc83047_0  pkgs/main
tk                            8.6.10      hed695b0_0  conda-forge
tk                            8.6.10      hed695b0_1  conda-forge
tk                            8.6.11      h1ccaba5_0  pkgs/main
tk                            8.6.11      h1ccaba5_1  pkgs/main
tk                            8.6.11      h21135ba_0  conda-forge
tk                            8.6.11      h27826a3_1  conda-forge
tk                            8.6.12      h1ccaba5_0  pkgs/main
tk                            8.6.12      h27826a3_0  conda-forge
tk                            8.6.14      h39e8969_0  pkgs/main

There is no version 8.6.13 as required per the YAML file. I tried to include those dependencies in the pip subsection of the YAML but it did not help. According to this Github issue, it is possible some build versions only exist on MacOS but not on Linux, could it be the problem here? They advise to explore the --no-builds option to conda env export.

Overall, I was also wondering if it wouldn't be simpler to provide a pre-built Singularity image? I am using Singularity because I am not sudo on my system but building the Singularity image will require sudo rights.

Thank you for the help!
Guillaume

herro inference error

Dear,
I encountered some issues in the final step of the herro pipeline.
In the last command (herro inference), the following error emerges:

[00:00:01] Parsed 2313 reads.                                                                                                                                                                              
[00:00:05] Processing 1/? batch ⢀
[>---------------------------------------] 4/698                                                                                                                                                           
thread '<unnamed>' panicked at src/inference.rs:172:64:
called `Result::unwrap()` on an `Err` value: Torch("Expected at most 4 argument(s) for operator 'forward', but received 5 argument(s). Declaration: forward(__torch__.model.PositionClassifier self, Tensor bases, Tensor qualities, Tensor[] target_positions) -> Tensor\nException raised from checkAndNormalizeInputs at /libtorch/include/ATen/core/function_schema_inl.h:383 (most recent call first):\nframe #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x6b (0x7fc7aa65a6bb in /libs/libtorch/lib/libc10.so)\nframe #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xbf (0x7fc7aa6555ef in /libs/libtorch/lib/libc10.so)\nframe #2: void c10::FunctionSchema::checkAndNormalizeInputs<c10::Type>(std::vector<c10::IValue, std::allocator<c10::IValue> >&, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, c10::IValue, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, c10::IValue> > > const&) const + 0x45b (0x55f489b0f2eb in herro)\nframe #3: torch::jit::Method::operator()(std::vector<c10::IValue, std::allocator<c10::IValue> >, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, c10::IValue, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, c10::IValue> > > const&) const + 0x173 (0x7fc7af01b663 in /libs/libtorch/lib/libtorch_cpu.so)\nframe #4: <unknown function> + 0x1ea9cd (0x55f489b0f9cd in herro)\nframe #5: <unknown function> + 0xaadaa (0x55f4899cfdaa in herro)\nframe #6: <unknown function> + 0xa9133 (0x55f4899ce133 in herro)\nframe #7: <unknown function> + 0xd2a11 (0x55f4899f7a11 in herro)\nframe #8: <unknown function> + 0xd9d0c (0x55f4899fed0c in herro)\nframe #9: <unknown function> + 0xf4a96 (0x55f489a19a96 in herro)\nframe #10: <unknown function> + 0x145375 (0x55f489a6a375 in herro)\nframe #11: <unknown function> + 0x94ac3 (0x7fc7aa46bac3 in /lib/x86_64-linux-gnu/libc.so.6)\nframe #12: <unknown function> + 0x126850 (0x7fc7aa4fd850 in /lib/x86_64-linux-gnu/libc.so.6)\n")
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Aborted (core dumped)

I was wondering if anybody can help me.

Best
Mario

"Processed 0 reads" but exit code still 0

Hi,

I'm trying to run Herro in a simple nextflow pipeline. The pipeline completes without errors, but herro infererence has actually not discovered any reads

[00:00:37] Processed 0 reads.

Exit code is 0.

Wouldn't this be better displayed as a failure and exit code 1 ? Output file sizes are 0 of course.

Read features are created in the herro_features dir and passed in to inference as herro_features.

Commands - is there an obvious error here ?

herro features -t 32 761339-20220711_1557_2B_PAM34256_44784ad0_ont_filt.fastq herro_features

herro inference -t 1 -b 64 -m /data/herro/model_R9_v0.1.pt --read-alns herro_features 761339-20220711_1557_2B_PAM34256_44784ad0_ont_filt.fastq 20220711_1557_2B_PAM34256_44784ad0_ont_filt-corrected_ont.fasta

Thanks

Out of bounds for 2-bit sequence decoding.

Getting the following error:

❯ nice nice singularity run --nv herro.sif inference -m model_R9_v0.1.pt -b 64 --read-alns batched_alignments out.fastq.gz corrected
[00:44:38] Parsed 9283361 reads.
[00:03:06] Processing 1/? batch ⠠
[>---------------------------------------] 3/68175
[W graph_fuser.cpp:108] Warning: operator() profile_node %1243 : int[] = prim::profile_ivalue(%1241)
[00:04:51] Processing 1/? batch ⢀
[>---------------------------------------] 844/68175
thread '<unnamed>' panicked at /herro/src/haec_io.rs:144:9:
Out of bounds for 2-bit sequence decoding.
[00:04:51] Processing 1/? batch ⠐
[>---------------------------------------] 844/68175                                                                                      0:     0x5567db11c2b1 - std::backtrace_rs::backtrace::libunwind::trace::ha637c64ce894333a
                               at /rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/std/src/../../backtrace/src/backtrace/libunwind.rs:104:5
   1:     0x5567db11c2b1 - std::backtrace_rs::backtrace::trace_unsynchronized::h47f62dea28e0c88d
                               at /rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/std/src/../../backtrace/src/backtrace/mod.rs:66:5
   2:     0x5567db11c2b1 - std::sys_common::backtrace::_print_fmt::h9eef0abe20ede486
                               at /rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/std/src/sys_common/backtrace.rs:67:5
   3:     0x5567db11c2b1 - <std::sys_common::backtrace::_print::DisplayBacktrace as core::fmt::Display>::fmt::hed7f999df88cc644
                               at /rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/std/src/sys_common/backtrace.rs:44:22
   4:     0x5567db067390 - core::fmt::rt::Argument::fmt::h1539a9308b8d058d
                               at /rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/core/src/fmt/rt.rs:142:9
   5:     0x5567db067390 - core::fmt::write::h3a39390d8560d9c9
                               at /rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/core/src/fmt/mod.rs:1120:17
   6:     0x5567db11a3ff - std::io::Write::write_fmt::h5fc9997dfe05f882
                               at /rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/std/src/io/mod.rs:1762:15
   7:     0x5567db11c094 - std::sys_common::backtrace::_print::h894006fb5c6f3d45
                               at /rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/std/src/sys_common/backtrace.rs:47:5
   8:     0x5567db11c094 - std::sys_common::backtrace::print::h23a2d212c6fff936
                               at /rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/std/src/sys_common/backtrace.rs:34:9
   9:     0x5567db11d4a7 - std::panicking::default_hook::{{closure}}::h8a1d2ee00185001a
  10:     0x5567db11d205 - std::panicking::default_hook::h6038f2eba384e475
                               at /rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/std/src/panicking.rs:292:9
  11:     0x5567db11da40 - std::panicking::rust_panic_with_hook::h2b5517d590cab22e
                               at /rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/std/src/panicking.rs:779:13
  12:     0x5567db11d7b9 - std::panicking::begin_panic_handler::{{closure}}::h233112c06e0ef43e
13:     0x5567db11c776 - std::sys_common::backtrace::__rust_end_short_backtrace::h6e893f24d7ebbff8
                               at /rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/std/src/sys_common/backtrace.rs:170:18
  14:     0x5567db11d572 - rust_begin_unwind
                               at /rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/std/src/panicking.rs:645:5
  15:     0x5567dafff975 - core::panicking::panic_fmt::hbf0e066aabfa482c
                               at /rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/core/src/panicking.rs:72:14
  16:     0x5567db094cd6 - herro::features::extract_features::{{closure}}::h3588956aabde7bf6
  17:     0x5567db0944a4 - core::slice::sort::insertion_sort_shift_left::h6a8cb7ba2acaadd2
  18:     0x5567db093da6 - core::slice::sort::merge_sort::ha0f0096f1486e1d5
  19:     0x5567db0ac7b8 - herro::features::extract_features::h00ea577d98be430b
  20:     0x5567db0b5a6d - std::sys_common::backtrace::__rust_begin_short_backtrace::hd2c9f1af20faea53
  21:     0x5567db0d0be6 - core::ops::function::FnOnce::call_once{{vtable.shim}}::h44881a552781c61d
  22:     0x5567db121375 - <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once::hc7eafaff61e32df9
                               at /rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/alloc/src/boxed.rs:2007:9
  23:     0x5567db121375 - <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once::h6ba4a5de48dd2304
                               at /rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/alloc/src/boxed.rs:2007:9
  24:     0x5567db121375 - std::sys::unix::thread::Thread::new::thread_start::he469335aef763e45
                               at /rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/std/src/sys/unix/thread.rs:108:17
  25:     0x7f0a5d66bac3 - <unknown>
  26:     0x7f0a5d6fca04 - __clone
  27:                0x0 - <unknown>
Aborted (core dumped)

About read length filter. Will it be a problem if I include reads shoter than 10kb?

Hello

I have used herro for several data sets of insect whole genome sequence, and obtained amazing results.
Herro + hifiasm generated more contiguous and high BUSCO score assemblies compared to flye while demanding less amount of memory.

Question
It seems that preprocess.sh omits reads shoter than 10,000 bp.
Is it because it's nesessary, or is it no problem to include shoter reads when assembling relatively small genomes?

Thanks

Question Regarding Error Rate Evaluation Using bamConcordance

Dear Lbcb-sci,

I am currently working on evaluating error correction using the bamConcordance script as described in your paper. In the appendix, you have provided instructions on how to use this script. However, the output files from the script contain error rates for each read individually.

I am particularly interested in understanding how you computed the overall mismatchBp, nonHpInsertionBp, nonHpDeletionBp, hpInsertionBp, and hpDeletionBp values for the entire dataset. I have attempted to calculate these metrics by summing all the errors and dividing by the total number of bases, as well as by first calculating the error rate for each read and then computing the overall error rate. However, neither approach matches the results provided in your paper for the A. thaliana dataset.

Could you please explain the method you used to calculate these overall error metrics? Your guidance would be greatly appreciated.

Thank you for your time and assistance.

Best regards,
Yichen

location of scripts in singularity container

Hi,

Thx for creating this tool and I'm curious to see how it performs on my dataset.

I'm using the prebuild singularity container, however I cannot find the location of the scripts required for preprocessing in the container. The required binaries for seqkit and porechops are also not available. Is this intentional?

Regards
Judith

model training

Hi
Will you share the code of HERRO model and training script? I'm wondering how to set S in the final part of the model which consists of two classification heads. In addition, the window size is 4096 which is the number of the base of the target read, there may exist a insertion at each position, so the exact length of each window will be lager than 4096. Is this right?

how to install herro without root

Core dumped at herro inference

Hello there,

I am trying to herro with quite a big dataset (4 Gb plant genome, ~72x depth). I already did the AvA step, But now I am struggling in the inference step:

The command I am using is:

herro="$herro_dir/herro.sif"   ## herro_dir -> /path/to/herro_cloned_repository
mnt_alns="/data/out_mappings"
mnt_reads="/data/ont_reads.fastq.gz"

singularity run --nv $herro inference -t 64 -m /herro/model_v0.1.pt --read-alns $mnt_alns -b 128 $mnt_reads /results/corrected_reads.fasta

But I am getting this error

Error log

thread '<unnamed>' panicked at src/inference.rs:172:64:
called `Result::unwrap()` on an `Err` value: Torch("The following operation failed in the TorchScript interpreter.
Traceback of TorchScript, serialized code (most recent call last):
  File \"code/__torch__/model.py\", line 31, in forward
    target_positions: List[Tensor]) -> Tuple[Tensor, Tensor]:
    embedding = self.embedding
    bases_embeds = (embedding).forward(bases, )
                    ~~~~~~~~~~~~~~~~~~ <--- HERE
    _0 = [bases_embeds, torch.unsqueeze(qualities, -1)]
    x = torch.cat(_0, -1)
  File \"code/__torch__/torch/nn/modules/sparse.py\", line 18, in forward
    _0 = __torch__.torch.nn.functional.embedding
    weight = self.weight
    _1 = _0(input, weight, 11, None, 2., False, False, )
         ~~ <--- HERE
    return _1
  File \"code/__torch__/torch/nn/functional.py\", line 37, in embedding
  else:
    input0 = input
  _3 = torch.embedding(weight, input0, padding_idx0, scale_grad_by_freq, sparse)
       ~~~~~~~~~~~~~~~ <--- HERE
  return _3
def batch_norm(input: Tensor,

Traceback of TorchScript, original code (most recent call last):
  File \"/raid/scratch/stanojevicd/projects/haec-BigBird/model.py\", line 118, in forward
        '''
        # (batch_size, sequence_length, num_alignment_rows, bases_embedding_size)
        bases_embeds = self.embedding(bases)
                       ~~~~~~~~~~~~~~ <--- HERE
    
        # concatenate base qualities to embedding vectors
  File \"/home/stanojevicd/miniforge3/envs/haec/lib/python3.11/site-packages/torch/nn/modules/sparse.py\", line 162, in forward
    def forward(self, input: Tensor) -> Tensor:
        return F.embedding(
               ~~~~~~~~~~~ <--- HERE
            input, self.weight, self.padding_idx, self.max_norm,
            self.norm_type, self.scale_grad_by_freq, self.sparse)
  File \"/home/stanojevicd/miniforge3/envs/haec/lib/python3.11/site-packages/torch/nn/functional.py\", line 2233, in embedding
        # remove once script supports set_grad_enabled
        _no_grad_embedding_renorm_(weight, input, max_norm, norm_type)
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
           ~~~~~~~~~~~~~~~ <--- HERE
RuntimeError: CUDA error: no kernel image is available for execution on the device
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Given the final lines of the log, I thought it was a stochastic error, but I ran it again and got the same, so it seems consistent. Do you have any idea of what could be happening?

Thanks in advance.

Checkpointing for minimap2 alignments

Hello,

I just thought I'd request a feature - would you consider implementing some kind of checkpointing for the minimap2 alignments prior to error correction (ie the "create_batched_alignments.sh" step)? I suspect this can be the most computationally expensive step of this pipeline for large datasets. I just had a pretty long job killed by our cluster before it could finish, and it's a bit devastating to have to relaunch from the beginning.

Thanks,
Chris

Does it fit for 1kb to 2kb read length data

I am doing 16S rna amplicon based ONT sequencing.

My target region is only of 1kb to 2kb in length.

Per sample reads is around 50K.

Will HEERO give me good error filtered estimates ?

Downloading Data failure and QV calculation

I've tried to download your corrected data via:
wget http://complex.zesoi.fer.hr/data/downloads/HG002.experimentalUL.corrected.fasta.gz
But it doesn't seem like work and i could not download the data. Could you check the link again?

And one more thing, How did you calculated the QV of corrected reads? - as far as i know the output format of corrected reads is .fasta which does not have QV. I'm sorry if i'm wrong about this.

LD_LIBRARY_PATH may not be well configured in Singurarity image

Hi,

Thank you for your interesting tool.

I got a following error when I executed herro inference in Singurarity.

$ singurarity run --nv herro.sif inference <args>
[00:01:02] Parsed 393084 reads.                                                                                                                                  
[00:00:16] Processing 1/? batch ⠠
[>---------------------------------------] 2/75403                                                                                                               
Could not load library libcudnn_cnn_infer.so.8. Error: libcuda.so: cannot open shared object file: No such file or directory
Aborted

So, I entered into the bash console first, configurered the LD_LIBRARY_PATH environmental variable inside, and executed herro inference as below.

$ singurarity exec --nv herro.sif bash
Singularity> export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda-11.7/compat/
Singularity> herro inference <args>

Then, it worked.

Best

Models inaccessible

Hi,
Thanks for the great work.
I would like to use Herro again on an old dataset, unfortunately the models are unavailable. Can you make them available again ?
Please consider uploading them on Zenodo, as it would make them more accessible, and you can even do versioning.
Thanks,
All the best.
Quentin

batch.py syntax error

I'm getting following sequence error for batch.py script

File "./batch.py", line 43
if (idx := rids.get(tname, None)) is not None:
^
SyntaxError: invalid syntax

Installation issues during conda env create

Hey herro team,
I'd like to test herro but unfortunately I face issues with installation step 1, the conda environment. The build fails when trying to download Porechop via pip. Do you see anything being wrong about my setup or the way I am installing herro?

Error message:

⚠️ Expand error log here


$ conda env create --file scripts/herro-env.yml
Retrieving notices: ...working... done
Channels:
 - conda-forge
 - bioconda
 - defaults
Platform: linux-64
Collecting package metadata (repodata.json): done
Solving environment: done
Downloading and Extracting Packages:
Preparing transaction: done
Verifying transaction: done
Executing transaction: done
Installing pip dependencies: \ Username for 'https://github.com'-/
Password for 'https://[email protected]'-/
Ran pip subprocess with arguments:
['/path/to/miniconda3/envs/herro/bin/python', '-m', 'pip', 'install', '-U', '-r', '/path/to/herro/scripts/condaenv.qnqwyn6k.requirements.txt', '--exists-action=b']
Pip subprocess output:
Collecting git+https://github.com/dehui333/Porechop.git (from -r /path/to/herro/scripts/condaenv.qnqwyn6k.requirements.txt (line 31))
  Cloning https://github.com/dehui333/Porechop.git to /local/job_5014742/pip-req-build-a2scptry
Pip subprocess error:
  Running command git clone --filter=blob:none --quiet https://github.com/dehui333/Porechop.git /local/job_5014742/pip-req-build-a2scptry
  remote: Support for password authentication was removed on August 13, 2021.
  remote: Please see https://docs.github.com/get-started/getting-started-with-git/about-remote-repositories#cloning-with-https-urls for information on currently recommended modes of authentication.
  fatal: Authentication failed for 'https://github.com/dehui333/Porechop.git/'
  error: subprocess-exited-with-error
  × git clone --filter=blob:none --quiet https://github.com/dehui333/Porechop.git /local/job_5014742/pip-req-build-a2scptry did not run successfully.
  │ exit code: 128
  ╰─> See above for output.
  note: This error originates from a subprocess, and is likely not a problem with pip.
error: subprocess-exited-with-error
× git clone --filter=blob:none --quiet https://github.com/dehui333/Porechop.git /local/job_5014742/pip-req-build-a2scptry did not run successfully.
│ exit code: 128
╰─> See above for output.
note: This error originates from a subprocess, and is likely not a problem with pip.
CondaEnvException: Pip failed

where each /path/to was manually censored from the log.

Details:

In general, never before have I been asked for password authentication when installing a conda env. So, I inspected the herro-env.yml file and found this line git+https://github.com/dehui333/Porechop.git in the pip section. I haven't seen such an install instruction before but, assuming that this works in general, I tried to look up the URL. However, if I copy+paste the Porechop URL into my browser I get an Error 404: not found. Also, when using GitHub's global search function there is (as of today) no public repository Porechop by a user dehui333.

System and setup:

NAME="Ubuntu"
VERSION="20.04.5 LTS (Focal Fossa)"

conda 24.1.2

Poor assembly metrics with hifiasm v0.19.8

Thank you for providing herro.

I've tested it on four ONT datasets one of which is public.
The number reads and nucleotides after correction are very variable, ranging in my case from 2 to 20% for the reads and 2 to 60% of the nucleotides. What are these metrics looking like in the cases you've tested?
The set which did not work has the highest coverage. Is there a coverage limit to respect?
For the sets which did work I tried an hifiasm (v0.19.8) assembly but in all cases the metrics were poor. hifiasm log shows that their are remaining errors which are not removed by the 3 correction cycles.

For example for the public data set, data found in
https://www.ncbi.nlm.nih.gov/bioproject/781898

Number of kmers found once in the read set = errors

grep 'ha_hist_line' slurm-7727010.out | grep ' 1:'
[M::ha_hist_line]     1: ****************************************************************************************************> 52175410
[M::ha_hist_line]     1: ****************************************************************************************************> 45429496
[M::ha_hist_line]     1: ****************************************************************************************************> 41842772
[M::ha_hist_line]     1: ****************************************************************************************************> 39899872

Compared to other assemblies this kmer error count stays very high it should drop quickly with correction cycles.
And when I extract contig coverages from the gfa file they are very low while they should be around 10.

awk '/^S/{print $2"\t"$4"\t"$5}' hifiasm_0.19.8_no_HiC.bp.hap1.p_ctg.gfa \
| sed 's/LN:i://;s/rd:i://' | more
h1tg000001l 114415 6
h1tg000002l 1935040 3
h1tg000003l 485308 3
h1tg000004l 113763 0
h1tg000005l 54120 0
h1tg000006l 82359 0
h1tg000007l 3376377 2
h1tg000008l 505683 2
h1tg000009l 1826044 2
h1tg000010l 4045620 2
h1tg000011l 151854 1
h1tg000012l 172642 0
h1tg000013l 75530 0
h1tg000014l 82829 0
h1tg000015l 71160 0
h1tg000016l 944815 1
h1tg000017l 357347 3
h1tg000018l 207160 8
h1tg000019l 510563 5

Have you seen this before?
What could I change to improve correction or assembly?

CLR read model training

Are you intending to release a PacBio CLR correction model?
If not, have you a procedure we could use to do it?

Cheers,

herro's principle and phasing

Hello,
can you please describe how herro's error correction steps work, and how you ensure the preservation of the phase between two variant sites within the same read? I see that, though N50 stats increase in your assemblies, switch and hamming rate remain the same.
Other questions:
Will herro work also with R9 data? Is there a minimum ONT coverage that the software needs to do the error correction? Can I run herro without GPUs?
Thanks,

Dario

lbcb-sci / herro Goto Github PK

herro's People

Contributors

Stargazers

Watchers

Forkers

herro's Issues

Reads get split at certain intervals more frequently.

Recommend Projects

Recommend Topics

Recommend Org