Giter Site home page Giter Site logo

Comments (17)

fangli80 avatar fangli80 commented on August 17, 2024 1

Hello @SeAudet
Thanks for letting me know. NanoRepeat was tested on datasets of 50-200X coverage. If there are 10K-100K reads, speed could be an issue.

Usually I can get accurate estimation of repeat sizes from < 1000X coverage. So it's okay to sub-sample the dataset.

I will work on improving speed for future versions.

Sincerely,
Li

from nanorepeat.

HLHsieh avatar HLHsieh commented on August 17, 2024 1

Hi Li,

Thank you for your continued support. I wanted to let you know that I have tried the latest version 1.5 of the tool, and I'm happy to report that I did not encounter the issue with the cgroup out-of-memory handler that I had experienced during some of my previous analyses. I must say, this tool is truly amazing and incredibly easy to access. Thank you for your assistance.

Many thanks,
Hsin

from nanorepeat.

fangli80 avatar fangli80 commented on August 17, 2024

It might be that there are very large repeats in the datasets.
Can you dig out the last command before it was killed? You can find it from the stderr output, such as:

image

from nanorepeat.

HLHsieh avatar HLHsieh commented on August 17, 2024

Here it is:

[04/02/2023 18:44:37] NOTICE: Running command: /sw/spack/bio/pkgs/gcc-10.3.0/minimap2/2.14-jvscyilw/bin/minimap2 -c -t 12  -x map-ont  /gpfs/accounts/home/NanoRepeat/C9ORF72_p3_NanoSim_100x.NanoRepeat_temp_dir.chr9-27573494-27573708-GGCCCC/round1_ref.fasta /gpfs/accounts/home/NanoRepeat/C9ORF72_p3_NanoSim_100x.NanoRepeat_temp_dir.chr9-27573494-27573708-GGCCCC/core_sequences.fastq > /gpfs/accounts/home/NanoRepeat/C9ORF72_p3_NanoSim_100x.NanoRepeat_temp_dir.chr9-27573494-27573708-GGCCCC/round1.paf 2> /dev/null
[04/02/2023 18:46:38] NOTICE: Step 2 finished
[04/02/2023 18:46:38] NOTICE: Step 3: round 3 estimation
sh: line 1: 4093652 Killed                  /sw/spack/bio/pkgs/gcc-10.3.0/minimap2/2.14-jvscyilw/bin/minimap2 -x map-ont -N 100 -c --eqx -t 12 /gpfs/accounts/home/NanoRepeat/C9ORF72_p3_NanoSim_100x.NanoRepeat_temp_dir.chr9-27573494-27573708-GGCCCC/round3_ref.fasta /gpfs/accounts/home/NanoRepeat/C9ORF72_p3_NanoSim_100x.NanoRepeat_temp_dir.chr9-27573494-27573708-GGCCCC/core_sequences.fastq > /gpfs/accounts/home/NanoRepeat/C9ORF72_p3_NanoSim_100x.NanoRepeat_temp_dir.chr9-27573494-27573708-GGCCCC/round3.paf 2> /dev/null
[04/02/2023 18:47:07] ERROR: Failed to run command: /sw/spack/bio/pkgs/gcc-10.3.0/minimap2/2.14-jvscyilw/bin/minimap2  -x map-ont  -N 100 -c --eqx -t 12 /gpfs/
accounts/home/NanoRepeat/C9ORF72_p3_NanoSim_100x.NanoRepeat_temp_dir.chr9-27573494-27573708-GGCCCC/round3_ref.fasta /gpfs/accounts/home/NanoRepeat/C9ORF72_p3_NanoSim_100x.NanoRepeat_temp_dir.chr9-27573494-27573708-GGCCCC/core_sequences.fastq > /gpfs/accounts/home/NanoRepeat/C9ORF72_p3_NanoSim_100x.NanoRepeat_temp_dir.chr9-27573494-27573708-GGCCCC/round3.paf 2> /dev/null
[04/02/2023 18:47:07] Return value is: 35072
slurmstepd: error: Detected 1 oom-kill event(s) in StepId=50199088.batch. Some of your processes may have been killed by the cgroup out-of-memory handler.

They all indeed have very large repeats. Although the largest one took 24 hours, it was done. The other two encountered above issue.

Best,
Hsin

from nanorepeat.

SeAudet avatar SeAudet commented on August 17, 2024

Hello,

Just wanted to mention I encountered the same issue where round 3 estimation never ends, although it seemingly doesn't run out of memory according to our ressource manager. That single step ran for over 6 days with 8 cores and 32GB of RAM (used only 17.5GB) before being timed out. It did initially mention the oom-kill event when running with only 16GB, but increasing allocated memory seemingly fixed that issue.

From what I can gather, the nature of the repeat is not the issue, but rather the amount of available data seems to be bottlenecking the process. It ran in less than a day with around 10K reads, but for samples where over 100K reads are available (good quality reads with nice repeats), the processing is perhaps too slow and doesn't increase with more cores/memory. I removed the 2>dev/null to see if there was hidden error, but it seems the command line doesn't cause an error.

I'll probably just randomly subset my data into technical replicates for it to run (output is generally overall very nice from my tests!), but thought it was a good idea to mention I also got that problem! Thank you in advance for your time!

Sincerely,
Seb

from nanorepeat.

HLHsieh avatar HLHsieh commented on August 17, 2024

Hi @fangli80 ,

I would like to provide some additional information about my analysis. Specifically, I would like to mention that my dataset was generated at a coverage of 100x.

Sincerely,
Hsin

from nanorepeat.

fangli80 avatar fangli80 commented on August 17, 2024

Thanks for letting me know. So you are working on a 100X dataset with multiple repeat regions? May I ask how many repeat regions are there?

Thanks,
Li

from nanorepeat.

HLHsieh avatar HLHsieh commented on August 17, 2024

Hi @fangli80
This is the bed file I used for this 100X dataset

chr9    27573494        27573708        GGCCCC

from nanorepeat.

fangli80 avatar fangli80 commented on August 17, 2024

@HLHsieh

It seems that the location of the GGCCCC repeat is not accurate. If I extract the region chr9:27573494-27573708 from hg38 or hg19, I got the following sequence:

hg38_chr9:27573495-27573708:
GGGCCCGCCCCCGGGCCCGCCCCGACCACGCCCCGGCCCCGGCCCCGGCCCCTAGCGCGCGACTCCTGAGTTCCAGAGCTTGCTACAGGCTGCGGTTGTTTCCCTCCTTGTTTTCTTCTGGTTAATCTTTATCAGGTCTTTTCTTGTTCACCCTCAGCGAGTACTGTGAGAGCAAGTAGTGGGGAGAGAGGGTGGGAAAAACAAAAACACACAC

hg19_chr9:27573495-27573708:
GCCCGCCCCCGGGCCCGCCCCGACCACGCCCCGGCCCCGGCCCCGGCCCCTAGCGCGCGACTCCTGAGTTCCAGAGCTTGCTACAGGCTGCGGTTGTTTCCCTCCTTGTTTTCTTCTGGTTAATCTTTATCAGGTCTTTTCTTGTTCACCCTCAGCGAGTACTGTGAGAGCAAGTAGTGGGGAGAGAGGGTGGGAAAAACAAAAACACACACCT

Only the first a 28-30 bp is the repeat.
I checked the repeatMasker annotation (from here). This repeat position is chr9:27573485-27573546

The latest version of NanoRepeat will check if the repeat region is correct. And it will give an warning message if the repeat region is not accurate.

You can use the following command to install the latest version of NanoRepeat:

git clone https://github.com/WGLab/NanoRepeat.git
cd NanoRepeat
pip install .

If you supply with the correct region, NanoRepeat can finish repeat quantification in a few minutes.

from nanorepeat.

HLHsieh avatar HLHsieh commented on August 17, 2024

Hi Li,

I have tried the latest version of the software you provided, but unfortunately, it did not work for me. I encountered an error message indicating an issue with running the command:

python /bin/NanoRepeat/src/NanoRepeat/nanoRepeat.py -i /DRD4_p2_stimulated/DRD4_p2_NanoSim_2x/DRD4_p2_NanoSim_2x.fasta -t fasta -r /Reference/Human/Genome/hg38/genome.fa -b /reference/myDefinedRepeat_NanoRepeat_chr11.bed -c 4 --samtools /sw/spack/bio/pkgs/gcc-10.3.0/samtools/1.13-fwwss5nm/bin/samtools --minimap2 /sw/spack/bio/pkgs/gcc-10.3.0/minimap2/2.14-jvscyilw/bin/minimap2 -o DRD4_p2_NanoSim_2x
[05/12/2023 14:47:49] NOTICE: Input file is: /DRD4_p2_stimulated/DRD4_p2_NanoSim_2x/DRD4_p2_NanoSim_2x.fasta
[05/12/2023 14:47:49] NOTICE: Input type is: fasta
[05/12/2023 14:47:49] NOTICE: Reference fasta file is: /Reference/Human/Genome/hg38/genome.fa
[05/12/2023 14:47:49] NOTICE: Output prefix is: /DRD4_p2_stimulated/DRD4_p2_NanoSim_2x/NanoRepeat_new/DRD4_p2_NanoSim_2x
[05/12/2023 14:47:49] NOTICE: Repeat region bed file is: /reference/myDefinedRepeat_NanoRepeat_chr11.bed
[05/12/2023 14:49:27] ERROR: Failed to run command: /sw/spack/bio/pkgs/gcc-10.3.0/samtools/1.13-fwwss5nm/bin/samtools view -hb -@ 4 /DRD4_p2_stimulated/DRD4_p2_NanoSim_2x/NanoRepeat_new/DRD4_p2_NanoSim_2x.minimap2.sam > /DRD4_p2_stimulated/DRD4_p2_NanoSim_2x/NanoRepeat_new/DRD4_p2_NanoSim_2x.minimap2.bam 2> /dev/null
[05/12/2023 14:49:27] Return value is: 256

However, the previous version of the software is still functional for me, so I will continue using that version. I wanted to bring this issue to your attention.

Additionally, I am still encountering the out-of-memory issue with the cgroup handler in several of my analyses, even after making the necessary corrections to the repeat bed file.

These particular analysis involves 6 bp with a repeat size of 3000 and its coverage is 100x with about 2400000 reads.

Many thanks,
Hsin

from nanorepeat.

fangli80 avatar fangli80 commented on August 17, 2024

Hello Hsin,
Thanks for letting me know. It seems that the data is simulated. If it is not patient data, could you please email it to me so that I can test on my end?

By the way, why it has 2400000 reads but the coverage is 100X ? Is it because the 2400000 reads are from many different repeat regions?

Thanks,
Li

from nanorepeat.

fangli80 avatar fangli80 commented on August 17, 2024

To support pip install, the new version has changed installation methods.

  1. if you want to install the latest version from GitHub, please run:
git clone https://github.com/WGLab/NanoRepeat.git
cd NanoRepeat
pip install .

If successful, nanoRepeat.py will be in a folder that is in the $PATH variable and you can directly run
nanoRepeat.py

Please don't run python ./NanoRepeat/src/NanoRepeat/nanoRepeat.py directly because this is the source code and is not the installed path any more.

  1. if you want to install a specific version that was released (e.g. v1.4.0), you can use:
pip install NanoRepeat==1.4.0

Same as above, nanoRepeat.py will be in a folder that is in the $PATH variable (usually a /bin folder) and you can directly type
nanoRepeat.py without specifying the full path.

from nanorepeat.

fangli80 avatar fangli80 commented on August 17, 2024

Thanks for reporting bugs to me. Please feel free to let me know if there are other issues.

Cheers,
Li

from nanorepeat.

HLHsieh avatar HLHsieh commented on August 17, 2024

Hi Li,

Unfortunately, I have encountered the same issue again during my analysis.

I ran the following command using NanoRepeat v1.5, along with minimap2 v2.24 and samtools v1.13:

nanoRepeat.py -i /C9ORF72_3_stimulated/C9ORF72_3_NanoSim_30x/C9ORF72_3_NanoSim_30x.fasta -t fasta -r /Reference/Human/Genome/hg38/genome.fa -b /reference/myDefinedRepeat_NanoRepeat_chr9.bed -c 8 --samtools /sw/spack/bio/pkgs/gcc-10.3.0/samtools/1.13-fwwss5nm/bin/samtools --minimap2 /minimap2-2.24_x64-linux/minimap2 -o C9ORF72_3_NanoSim_30x

Here are some messages that appeared before the error message:

[05/25/2023 01:19:26] NOTICE: Step 3: round 3 estimation
[05/25/2023 01:19:26] NOTICE: Running command: /minimap2-2.24_x64-linux/minimap2  -x map-ont  -f 0.0 -N 100 -c --eqx -t 8 /C9ORF72_3_stimulated/C9ORF72_3_NanoSim_30x/NanoRepeat/C9ORF72_3_NanoSim_30x.NanoRepeat_temp_dir.chr9-27573528-27573546-GGCCCC/round3_ref.fasta /C9ORF72_3_stimulated/C9ORF72_3_NanoSim_30x/NanoRepeat/C9ORF72_3_NanoSim_30x.NanoRepeat_temp_dir.chr9-27573528-27573546-GGCCCC/core_sequences.fastq > /C9ORF72_3_stimulated/C9ORF72_3_NanoSim_30x/NanoRepeat/C9ORF72_3_NanoSim_30x.NanoRepeat_temp_dir.chr9-27573528-27573546-GGCCCC/round3.paf
[05/25/2023 01:24:52] ERROR: Failed to run command: /minimap2-2.24_x64-linux/minimap2  -x map-ont  -f 0.0 -N 100 -c --eqx -t 8 /C9ORF72_3_stimulated/C9ORF72_3_NanoSim_30x/NanoRepeat/C9ORF72_3_NanoSim_30x.NanoRepeat_temp_dir.chr9-27573528-27573546-GGCCCC/round3_ref.fasta /C9ORF72_3_stimulated/C9ORF72_3_NanoSim_30x/NanoRepeat/C9ORF72_3_NanoSim_30x.NanoRepeat_temp_dir.chr9-27573528-27573546-GGCCCC/core_sequences.fastq > /C9ORF72_3_stimulated/C9ORF72_3_NanoSim_30x/NanoRepeat/C9ORF72_3_NanoSim_30x.NanoRepeat_temp_dir.chr9-27573528-27573546-GGCCCC/round3.paf
[05/25/2023 01:24:52] Return value is: 35072
slurmstepd: error: Detected 1 oom-kill event(s) in StepId=53724533.batch. Some of your processes may have been killed by the cgroup out-of-memory handler.

I would like to highlight that this error occurred during Step 3: round 3 estimation, and the return value was 35072. I'm curious to know if there are any specific reasons for this error.

I would greatly appreciate your insights or suggestions regarding this issue.

Thank you for your assistance,
Hsin

from nanorepeat.

fangli80 avatar fangli80 commented on August 17, 2024

Sorry for the late reply.
I noticed that the input data is C9ORF72_3_NanoSim_30x.fasta. Is it different from the data that you shared with me (C9ORF72_p2_NanoSim_30x.fasta)? If so, what is the difference?
Thanks,
Li

from nanorepeat.

HLHsieh avatar HLHsieh commented on August 17, 2024

Hi Li,

Thank you for your response. I have been persistently working on these data analysis, and I finally achieved a successful analysis this morning. I would like to share some information with you.

For the C9ORF72_p2_NanoSim_30x.fasta dataset, the memory consumption is approximately 15-20 GB, and the analysis takes around 1 hour to complete. On the other hand, for the C9ORF72_3_NanoSim_30x.fasta dataset, the memory consumption is considerably higher at around 110-120 GB, and the analysis takes approximately 5 hours to finish. It is important to note that both datasets have the same sequencing depth.

If you are interested in investigating the reasons behind this discrepancy, I would be more than happy to share it with you.

Thank you once again for your support.

Best regards,
Hsin

from nanorepeat.

fangli80 avatar fangli80 commented on August 17, 2024

Hello Hsin,
It would be great if you can share the data with me.

Thanks!
Li

from nanorepeat.

Related Issues (12)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.