Giter Site home page Giter Site logo

wglab / nanorepeat Goto Github PK

View Code? Open in Web Editor NEW
15.0 11.0 1.0 1.05 MB

NanoRepeat: fast and accurate analysis of Short Tandem Repeats (STRs) from Oxford Nanopore sequencing data

License: MIT License

Python 100.00%
bioinformatics genome-analysis nanopore-sequencing repeat-detection short-tandem-repeats genomics pacbio-sequencing sequencing

nanorepeat's Introduction

NanoRepeat: quantification of Short Tandem Repeats (STRs) from long-read sequencing data

PyPI version DOI

Table of Contents

Installation

Prerequisites:

  1. Python (version >= 3.8)
  2. Minimap2 (version >= 2.22)
  3. Samtools (version >= 1.13)

You may alreadly have minimap2 and samtools if you performed analysis of Oxford Nanopore sequencing data. You can use which minimap2 and which samtools to check the full path to the two executable files. Please note that minimap2 should be v2.22 or later.

Once you installed the above tools, you can use the following commands to install NanoRepeat (we recommend creating an new conda environment to avoid dependency issues):

conda create -n nanorepeat python=3.8
conda activate nanorepeat
git clone https://github.com/WGLab/NanoRepeat.git
cd NanoRepeat
pip install .

If you want to install a stable version from Python Package Index (PyPI):

conda create -n nanorepeat python=3.8
conda activate nanorepeat
pip install NanoRepeat

Notice: If you want to install NanoRepeat from a PyPI mirror, please check if the version in the mirror is update to date.

Usage

Regular use cases

NanoRepeat can quantify STRs from targeted sequencing or whole-genome sequencing data. We will demonstrate the usage of NanoRepeat using an example data set, which can be downloaded using the following commands.

wget https://github.com/WGLab/NanoRepeat/releases/download/v1.3/NanoRepeat_v1.3_example_data.tar.bz2
tar xjf NanoRepeat_v1.3_example_data.tar.bz2

After unzipping the file, you will see a NanoRepeat_v1.3_example_data folder and there are two subfolders: HG002 and HTT_amplicon. In this section, we will use the data under the HG002 folder.

$ ls -1  ./NanoRepeat_v1.3_example_data/HG002/ 
GRCh37_chr1.fasta
GRCh37_chr1.fasta.fai
HG002_GRCh37_example_regions.bed
hg002_Q20.20210805_3flowcells.hs37d5.example_regions.bam
hg002_Q20.20210805_3flowcells.hs37d5.example_regions.bam.bai

You can use the following command to run NanoRepeat:

nanoRepeat.py \
    -i path/to/NanoRepeat_v1.3_example_data/HG002/hg002_Q20.20210805_3flowcells.hs37d5.example_regions.bam \
    -t bam \
    -d ont_q20 \
    -r path/to/NanoRepeat_v1.3_example_data/HG002/GRCh37_chr1.fasta \
    -b path/to/NanoRepeat_v1.3_example_data/HG002/HG002_GRCh37_example_regions.bed \
    -c 4 \
    --samtools path/to/samtools \
    --minimap2 path/to/minimap2 \
    -o ./nanorepeat_output/HG002

-i specifies the input file, which can be in fasta, fastq or bam format. In this case our input file is hg002_Q20.20210805_3flowcells.hs37d5.example_regions.bam. It is a subset of an Oxford Nanopore whole-genome sequencing dataset of the NIST/GIAB HG002 (GM24385/NA24385) genome. The sequencing data was from the Oxford Nanopore Technologies Benchmark Datasets and reads from 15 example STR regions in chr1 were extracted. These regions were selected because they overlap with the HG002 SV benchmark set and are heterozygous (i.e., two alleles have different repeat sizes).

-t specifies the input file type. There are four valid values: bam, cram, fastq or fasta. In this case the input file is in a bam file.

-d specifies the data type. There are five valid values: ont_q20, ont_sup, ont, hifi, and clr. ont_q20 is for Oxford Nanopore sequencing with Q20+ chemistry (such as R10 flowcells). ont_sup is for Oxford Nanopore sequencing with R9 flowcells and basecalled in super accuracy mode. ont is for Oxford Nanopore sequencing with R9 flowcells and basecalled in fast mode or high accuracy mode. hifi is for PacBio HiFi/CCS reads. clr is for PacBio Continuous Long Reads (CLR) reads. Default value: ont.

-r specifies the reference genome file in FASTA format. In this case, GRCh37_chr1.fasta is chr1 of the GRCh37/hg19 reference genome. We used GRCh37 instead of GRCh38 because the HG002 SV benchmark set is based on GRCh37.

-b specifies the information of the tandem repeat regions that you are interested in. It is a tab-delimited text file in BED format. There are four required columns: chromosome, start_position, end_position, repeat_unit_sequence. In our case, HG002_GRCh37_example_regions.bed contains 15 STR regions in chr1 of GRCh37. The first 5 rows of the HG002_GRCh37_example_regions.bed are shown below.

1 4599903 4599930 TTTAG
1 7923034 7923187 TATTG
1 8321418 8321465 TTCC
1 14459872 14459935 AAAG
1 20934886 20934920 GTTTT

IMPORTANT NOTICE

  1. Please note that in BED format, all chromosome positions start from 0. start_position is self-inclusive but end_position is NOT self-inclusive. Tip: If you have 1-based positons, simply decrease the value of start_position by 1 and no changes for end_position

  2. NanoRepeat assumes that the seqeunce between start_position and end_position are all repeats of the motif specified in the fourth column. There should be neither non-repeat sequences nor other repeat motifs between start_position and end_position. If a region contains two consecutive repeats, you can specify them in two rows.

-c specifies the number of CPU cores for alignment.

-o specifies the output prefix. Please include the path to the output directory and prefix of output file names. In our case, the output prefix is ./nanorepeat_output/HG002, which means the output directory is ./nanorepeat_output/ and the prefix of output file names is HG002.

--samtools and --minimap2 specifies the path to the two software tools. The two arguments are optional if samtools and minimap2 and be found in your environment.

If you run NanoRepeat sucessfully, you will see 90 output files (six files per region). Output files of a single repeat region look like this:

HG002.1-7923034-7923187-TATTG.allele1.fastq
HG002.1-7923034-7923187-TATTG.allele2.fastq
HG002.1-7923034-7923187-TATTG.hist.png
HG002.1-7923034-7923187-TATTG.phased_reads.txt
HG002.1-7923034-7923187-TATTG.repeat_size.txt
HG002.1-7923034-7923187-TATTG.summary.txt

HG002.1-7923034-7923187-TATTG.allele1.fastq and HG002.1-7923034-7923187-TATTG.allele2.fastq are reads of each allele.

HG002.1-7923034-7923187-TATTG.hist.png is a histogram showing the repeat size distribution. Each allele has a different color.

HG002.1-7923034-7923187-TATTG.phased_reads.txt shows the phasing results. First 10 lines of the HG002.1-7923034-7923187-TATTG.phased_reads.txt are shown below.

$ head HG002.1-7923034-7923187-TATTG.phased_reads.txt
##RepeatRegion=1-7923034-7923187-TATTG
#Read_Name	Allele_ID	Phasing_Confidence	Repeat_Size
746edfa7-715f-4e97-913e-ef73ed97135f	1	HIGH	14.0
d6355053-0ed2-438e-8469-28cabeb2aedf	1	HIGH	17.0
513a749a-6ffc-47c4-a499-9f9222e93abf	1	HIGH	17.0
fc8dc377-8772-4dc0-922d-ad694deec8d7	1	HIGH	17.0
cd847c0e-9fbf-4abf-8f0a-ea938026ef41	1	HIGH	17.0
f53bc376-69b4-4118-87e1-59379c640408	1	HIGH	17.0
9b70cd2a-c1df-447a-a7aa-b5ab8046115e	1	HIGH	17.0
6a9b6f5b-d59d-4dde-9adb-8e6ac91cc6e4	1	HIGH	17.0

The columns of the *.phased_reads.txt file:

Column Description
1 Read_Name
2 Allele_ID
3 Phasing_Confidence (two values: HIGH or LOW)
4 Repeat_Size

HG002.1-7923034-7923187-TATTG.repeat_size.txt is the estimated repeat sizes of ALL reads. This file is similar to the *.phased_reads.txt file but it also includes reads that may be removed in the phasing process (e.g. reads considered as noisy reads or outliers)

$ head HG002.1-7923034-7923187-TATTG.repeat_size.txt
##Repeat_Region=1-7923034-7923187-TATTG
#Read_Name	Repeat_Size
746edfa7-715f-4e97-913e-ef73ed97135f	14.0
d6355053-0ed2-438e-8469-28cabeb2aedf	17.0
dadaf0a0-8797-47ca-a21b-259928edca7e	48.0
513a749a-6ffc-47c4-a499-9f9222e93abf	17.0
07f65d31-4023-4d86-beba-76fb88f2cf45	48.0
4e66c3d0-6f15-4ff7-a8a8-d5c95d57e73d	48.0
fc8dc377-8772-4dc0-922d-ad694deec8d7	17.0
cd847c0e-9fbf-4abf-8f0a-ea938026ef41	17.0

HG002.1-7923034-7923187-TATTG.summary.txt gives the quantification of the repeat size. It has the following information: 1) repeat region; 2) number of detected alleles; 3) repeat size of each allele; 4) number of reads of each allele; 5) number of removed reads.

$ cat HG002.1-7923034-7923187-TATTG.summary.txt
Repeat_Region=1-7923034-7923187-TATTG	Method=GMM	Num_Alleles=2	Num_Removed_Reads=0	Allele1_Num_Reads=33	Allele1_Repeat_Size=17	Allele2_Num_Reads=19	Allele2_Repeat_Size=48

Sometimes two STRs are next to each other. For example, in exon-1 of the human HTT gene, there are two adjacent STRs: CAG and CCG. The sequence structure is: (CAG)m-CAA-CAG-CCG-CCA-(CCG)n. NanoRepeat can jointly quantify the two STRs and provide phased results. In our experience, looking at both repeats help generate better quantification results.

We will demonstrate the joint quantification using the same example dataset (described in the above section). If you have not downloaded the dataset, you can execute following commands.

wget https://github.com/WGLab/NanoRepeat/releases/download/v1.3/NanoRepeat_v1.3_example_data.tar.bz2
tar xjf NanoRepeat_v1.3_example_data.tar.bz2

After unzipping the file, you will see a NanoRepeat_v1.2_example_data folder and there are two subfolders: HG002 and HTT_amplicon. In this section, we will use the data under the HTT_amplicon folder.

The input fastq file is here: ./NanoRepeat_v1.2_example_data/HTT_amplicon/HTT_amplicon.fastq.gz.

The reference fasta file is here: ./NanoRepeat_v1.2_example_data/HTT_amplicon/GRCh38_chr4.0_4Mb.fasta.

You can use the following command to run NanoRepeat-joint:

nanoRepeat-joint.py  \
    -i ./NanoRepeat_v1.3_example_data/HTT_amplicon/HTT_amplicon.fastq.gz \
    -r ./NanoRepeat_v1.3_example_data/HTT_amplicon/GRCh38_chr4.0_4Mb.fasta \
    -1 chr4:3074876:3074933:CAG:200      \
    -2 chr4:3074946:3074966:CCG:20       \
    -o ./joint_quantification_output/HTT \
    -c 4

-1 and -2 specify the two repeat regions. The format of -1 and -2 is chrom:start_position:end_position:repeat_unit:max_size. The start and end positions are 0-based (the first base on the chromosome is numbered 0). The start position is self-inclusive but the end position is non-inclusive, which is the same as the BED format. For example, a region of the first 100 bases of chr1 is denoted as chr1:0:100. max_size is the max repeat length that we consider. Please set max_size to be a reasonal number. If max_size is too large (e.g. well beyond the max possible number), the speed of joint quantification might be slow.

If you run NanoRepeat sucessfully, you will see the following files in the ./joint_quantification_output folder.

HTT.allele1.fastq
HTT.allele2.fastq
HTT.chr4-3074876-3074933-CAG.hist.png
HTT.chr4-3074946-3074966-CCG.hist.png
HTT.hist2d.png
HTT.phased_reads.txt
HTT.repeat_size.txt
HTT.scatter.png
HTT.summary.txt

HTT.allele1.fastq and HTT.allele2.fastq are the reads assigned to each allele.

HTT.chr4-3074876-3074933-CAG.hist.png is a histogram showing the repeat size distribution of the first repeat (chr4-3074876-3074933-CAG).

HTT.chr4-3074946-3074966-CCG.hist.png is a histogram showing the repeat size distribution of the second repeat (chr4-3074946-3074966-CCG).

HTT.hist2d.png is a two-dimensional histogram showing the joint distribution of the two repeats.

HTT.scatter.png is a scatter plot showing the joint distribution of the two repeats. The dotted lines indicates the 95% equi-probability surface of the Gaussian mixture models.

HTT.phased_reads.txt shows the phasing results. The first line is the path to the input FASTQ file. Lines 2-9 of the HTT.phased_reads.txt file are shown below (as a table).

#Read_Name Allele_ID Phasing_Confidence chr4-3074876-3074933-CAG.Repeat_Size chr4-3074946-3074966-CCG.Repeat_Size
ONT_read330 1 HIGH 13.5 8
ONT_read1284 1 HIGH 17 11.5
ONT_read579 1 HIGH 16 10
ONT_read838 1 HIGH 15.5 10
ONT_read520 1 LOW 25 13
ONT_read1066 1 HIGH 17.5 10
ONT_read1059 1 HIGH 16 10.5
ONT_read526 1 HIGH 17 10

The *summary.txt file gives the quantification of the repeat sizes. It has the following information:

  1. input file
  2. number of alleles
  3. number of reads for each allele
  4. quantification of repeat sizes of each allele

The content of HTT.summary.txt is shown below:

Input_FASTQ path/to/HTT_amplicon.fastq.gz
Method 2D-GMM
Num_Alleles 2
Num_Removed_Reads 0
Allele1_Num_Reads 733
Allele1_chr4-3074876-3074933-CAG.Repeat_Size 17
Allele1_chr4-3074946-3074966-CCG.Repeat_Size 10
Allele2_Num_Reads 856
Allele2_chr4-3074876-3074933-CAG.Repeat_Size 55
Allele2_chr4-3074946-3074966-CCG.Repeat_Size 7

Citation

If you use NanoRepeat, please cite:

Fang L, Monteys AM, Dürr A, Keiser M, Cheng C, Harapanahalli A, et al. Haplotyping SNPs for allele-specific gene editing of the expanded huntingtin allele using long-read sequencing. Human Genetics and Genomics Advances. 2023;4(1):100146. DOI: https://doi.org/10.1016/j.xhgg.2022.100146.

BibTeX format:

@article{FANG2023100146,
	title = {Haplotyping SNPs for allele-specific gene editing of the expanded huntingtin allele using long-read sequencing},
	journal = {Human Genetics and Genomics Advances},
	volume = {4},
	number = {1},
	pages = {100146},
	year = {2023},
	issn = {2666-2477},
	doi = {https://doi.org/10.1016/j.xhgg.2022.100146},
	url = {https://www.sciencedirect.com/science/article/pii/S266624772200063X},
	author = {Li Fang and Alex Mas Monteys and Alexandra Dürr and Megan Keiser and Congsheng Cheng and Akhil Harapanahalli and Pedro Gonzalez-Alegre and Beverly L. Davidson and Kai Wang},
	keywords = {Huntington’s disease, long-read sequencing, CRISPR, SNP, repeat detection}
}

Limitation

NanoRepeat can accuratly quantify simple repeats but cannot handle mixed repeats of different motifs (i.e. a mixture of GCCA and AAATT), but imperfect repeats of approximately the same motif are OK.

Contact Us

If you need any help from us, you are welcome to raise an issue at the issue page. You can also contact Dr. Li Fang ([email protected]) or Dr. Kai Wang ([email protected]).

nanorepeat's People

Contributors

fangli80 avatar kaichop avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Forkers

fangli80

nanorepeat's Issues

Issue about the cgroup out-of-memory handler and taking a long time

Hi there,

Thanks for this nice and great tool. I used it on my several sets of data. Most of them worked smoothly, but three sets of data were stopped by some reasons.

I executed the following command for these three sets:

python $script -i ${myseq}.fasta -t fasta -r $genome -b $predefined -c 12 --samtools $samtools_path --minimap2 $minimap2_path -o ${myseq}

Among two, I got "slurmstepd: error: Detected 1 oom-kill event(s) in StepId=50190480.batch. Some of your processes may have been killed by the cgroup out-of-memory handler." I have tried to request the maximum memory as I can, but it did not work out.

Here is the requested resource for each job

#SBATCH --nodes=1
#SBATCH --ntasks-per-node=3
#SBATCH --cpus-per-task=4
#SBATCH --mem-per-cpu=15GB
#SBATCH --time=24:00:00 

The other one took really long (24 hour) compared to other set of data (usually about 2 hours).

I would appreciate it if you could advise.

Many Thanks,
Hsin

Scalability of NanoRepeat

Hello, I'm trying to measure the scalability of NanoRepeat by using varying number of cores, but is constantly obtaining roughly the same runtime, regardless of how many cores I use.

Here is some information about my input data:

  • ONT (R9.4.1)
  • mean coverage 50x
  • 200 STR region

The command I used is: nanoRepeat.py -i _my_bam_-t bam -r _my_reference_ -b _my_STR_bed_ -c _num_cores_ -o _output_dir_ --ploidy 1

Could it be that my data is too small to observe the scalability of NanoRepeat?
Thanks for reading!

Best regards,
18parkky

Question about repeat location and issue about OSError

Hi there,

I am very interested in this tool.

I got the REPEAT_region.bed from the database of Tandem Repeat Finder and executed the following command.
python $script -i ${prefix}.fasta -t fasta -r $genome -b $predefined -c 4 --samtools $samtools_path --minimap2 $minimap2_path -o ${output_folder}/${prefix}

And then I got the following message (an excerpt here):

[03/17/2023 21:43:26] NOTICE: Input file is: ~/stimulated_test/C9ORF72_c1_simulated_reads.fasta
[03/17/2023 21:43:26] NOTICE: Input type is: fasta
[03/17/2023 21:43:26] NOTICE: Referece fasta file is: ~/Reference/Human/Genome/hg38/genome.fa
[03/17/2023 21:43:26] NOTICE: Output prefix is: ~/stimulated_test/C9ORF72_c1_simulated_reads_NanoRepeat/C9ORF72_c1_simulated_reads
[03/17/2023 21:43:26] NOTICE: Repeat region bed file is: ~/genome_TRF/chr9_hg38_NanoRepeat.bed
[03/17/2023 22:15:40] NOTICE: Reading repeat region file: ~/genome_TRF/chr9_hg38_NanoRepeat.bed
[03/17/2023 22:15:40] NOTICE: Reading reference fasta file: ~/Reference/Human/Genome/hg38/genome.fa
[03/17/2023 22:16:43] NOTICE: Quantifying repeat: 9-10000-10472-TAACCC
[main_samview] region "9:9990-10482" specifies an invalid region or unknown reference. Continue anyway.
[03/17/2023 22:16:43] WARNING! No reads were found in repeat region: 9-10000-10472-TAACCC
[03/17/2023 22:16:43] NOTICE: Quantifying repeat: 9-10976-11052-CCGGCGCAGGCGCAGAGAGGCGCGCCGCG
[main_samview] region "9:10966-11062" specifies an invalid region or unknown reference. Continue anyway.
[03/17/2023 22:16:43] WARNING! No reads were found in repeat region: 9-10976-11052-CCGGCGCAGGCGCAGAGAGGCGCGCCGCG
[03/17/2023 22:16:43] NOTICE: Quantifying repeat: 9-11165-11194-G
[main_samview] region "9:11155-11204" specifies an invalid region or unknown reference. Continue anyway.
[03/17/2023 22:16:43] WARNING! No reads were found in repeat region: 9-11165-11194-G
[03/17/2023 22:16:43] NOTICE: Quantifying repeat: 9-11338-11561-CGCCCCCTGCTGGCAGCTAGGGACACTGCAGGGCCCTCTTGCTCAAGGTATAGTGGTGGCA
[main_samview] region "9:11328-11571" specifies an invalid region or unknown reference. Continue anyway.

Am I wrong to use the bed file from Tandem Repeat Finder?

Here is my REPEAT_region.bed:
image

Besides, I got the error message as follows:

[03/17/2023 22:16:48] WARNING! No reads were found in repeat region: 9-106323-106481-TGTTCTTAATATTCAGAGAGGGAGACAATGATAATAATGTCAATATAACAGGGGACACACCAGCCCCGTGACGT
[03/17/2023 22:16:48] NOTICE: Quantifying repeat: 9-106460-108380-CCCTGTGATATTGTTTCTAATATCCAGAGGGGAGAAGATGATATTACTCCCAATATCAGAAGGGGTGTACACCCCTCCTGTGATATTGTTCCAATATCCAGGGGGGGAGAGGATGATATTACTCCCAATATCGCAGGAGGTGTACACTCCCCCTGTGATATTGTTCCTAATATCCAGGGGGGAAGAGGATGATATTACTCCCAATATCGCTGGGGGTGTACACCCCCCCTGTGATATTGTTCCTAATATCCACGGGGGAGAGAGAATGATATTACTCCCAATATCGCAGGGGGTGTACACA
Traceback (most recent call last):
  File ~/bin/NanoRepeat/nanoRepeat.py, line 185, in <module>
    main()
  File ~/bin/NanoRepeat/nanoRepeat.py, line 174, in main
    preprocess_fastq(input_args)
  File ~/bin/NanoRepeat/nanoRepeat.py, line 91, in preprocess_fastq
    nanoRepeat_bam.nanoRepeat_bam(input_args, in_bam_file)
  File ~/bin/NanoRepeat/nanoRepeat_bam.py, line 578, in nanoRepeat_bam
    quantify1repeat_from_bam(input_args, in_bam_file, ref_fasta_dict, repeat_region)
  File ~/bin/NanoRepeat/nanoRepeat_bam.py, line 523, in quantify1repeat_from_bam
    os.makedirs(temp_out_dir, exist_ok=True)
  File /sw/spack/bio/pkgs/gcc-10.3.0/python/3.9.7-rv5ybzg3/lib/python3.9/os.py, line 225, in makedirs
    mkdir(name, mode)
OSError: [Errno 36] File name too long: ~/stimulated_test/C9ORF72_c1_simulated_reads_NanoRepeat/C9ORF72_c1_simulated_reads.NanoRepeat_temp_dir.9-106460-108380-CCCTGTGATATTGTTTCTAATATCCAGAGGGGAGAAGATGATATTACTCCCAATATCAGAAGGGGTGTACACCCCTCCTGTGATATTGTTCCTAATATCCAGGGGGGGAGAGGATGATATTACTCCCAATATCGCAGGAGGTGTACACTCCCCCTGTGATATTGTTCCTAATATCCAGGGGGGAAGAGGATGATATTACTCCCAATATCGCTGGGGGTGTACACCCCCCCTGTGATATTGTTCCTAATATCCACGGGGGAGAGAGAATGATATTACTCCCAATATCGCAGGGGGTGTACACA

I guess it is because NanoRepeat is designed for STR. Any comments on it?

Thanks,
Hsin

NanoRepeat uses more cores than specified with -c

Hi, thanks for developing NanoRepeat!

I'm trying to measure the runtime of NanoRepeat when running with varying number of cores.

However, through Linux's top command, I noticed that NanoRepeat sometimes uses more cores than specified with the -c command.
For example, even though I set -c 16, NanoRepeat occasionally uses up to 26 cores. I'm assuming NanoRepeat does this during the alignment step with minimap2, where it uses as many processors as possible, and then shifts back to using less cores in other steps.

Do you know why NanoRepeat does this and any way to fix this?

Thanks,
18parkky

cite

if we use it, how to cite?

Support for haplotagged.bam?

I am currently working on comparing different STR software to see how well they make the calls using ONT's latest Q20+ data for HG002.
https://labs.epi2me.io/giab-2023.05/

I am using HG002 T2T assembly v0.7 as well as using IGV to establish a truth set.
https://github.com/marbl/HG002

I noticed that while NanoRepeat works quite well in most cases, it often just calls one STR allele when the two STR counts of a heterozygous call only differ by one or two repeats. For example, for the GIPC2 CCG repeat

$ more pass.cram.chr19-14496041-14496074-CCG.summary.txt
Summary_file=pass.cram.chr19-14496041-14496074-CCG.summary.txt Repeat_Region=chr19-14496041-14496074-CCG Method=GMM Num_Alleles=1 Num_Removed_Reads=0 Allele1_Num_Reads=22 Allele1_Repeat_Size=12
$ more pass.cram.chr19-14496041-14496074-CCG.repeat_size.txt
##Repeat_Region=chr19-14496041-14496074-CCG
#Read_Name Repeat_Size
5a621e4a-464b-4397-bb52-dd6868344ed8 12.0
32c058b8-463a-4963-98de-ed151215c897 12.0
5e640f40-af78-4c37-b75f-2a489e6d287d 10.0
07ba1eca-eed6-41b7-9657-041cf9478b30 10.0
09feeb1a-bacc-43a5-9246-51adbec0d4ad 12.0
f7e95eed-e085-480e-badf-2ff3dafb75c6 12.0
ce13d292-e06c-4216-88e2-01900709ffb4 12.0
f2c19bf6-8875-45c6-bfef-c1cc49c74e82 12.0
7993c22d-00f7-4da4-a39e-dd7836d214f4 10.0
36b5fa47-1ae5-4c95-9d7b-2f4209d2ce57 12.0
1e96d19d-0281-4225-9395-bf9a001d1f18 10.0
d2a355a4-e30e-4f83-96b6-731467528d6c 10.0
be856e49-b1dd-4b2a-bc05-d129a5dfbff9 10.0
622e9d4c-3ab6-4560-9396-65cb9bd73696 10.0
1daea057-60bb-449a-8027-26a8e536909b 10.0
8505aa3e-d5b3-4544-b98c-e492122ccd27 12.0
1c702495-d4f0-4f41-9622-fb9555950713 12.0
1655f006-34f8-43ed-8800-eef18aa9a22e 12.0
4dc8881f-bf77-4c2e-94c7-fcb6955872c5 12.0
b8c0f0d9-377e-4f31-9440-46a2926d4ed4 12.0
c0b6597d-e401-4c5d-8eec-a7e49a43dc39 11.0
8646a71c-6f96-49c3-8250-e20186acfbdf 10.0

While it can be seen that the call should be 10/12 but I suppose NanoRepeat plays it safe and call 12/12 most of the time.

I think it is possible to resolve quite many of these calls if the reads are haplotagged.

haplotagged bam can be created by running "whatshap haplotag". The idea is to use heterozygous sites to assign a read to one of the two possible haplotypes in a chromosome. whatshap add a HP tag to an aligned read to be either 1 or 2. In the GIPC2 example above, you can see clearly 10 and 12 are assigned to different haplotypes.

Therefore, I think it would be great if NanoRepeat can also support haplotagged bam such that it can make better calls in these situations.

Issue with repeat size

Hi @fangli80 ,
thank you for this tool, i think it's great and very useful.
I used it for quantifying CAG repeats in a sample. It outputs 4 different alleles (basically 20,30,50 and 100 repeats) but, if I look at sequences present in the allele3.fastq file (50 repeats) it seems to over-estimate the number of repeats.
I show you one of the sequences of the allele3.fastq file and the corresponding repeat size estimate (from the repeat_size.txt file).

@7818cc7f-dc3d-476b-beeb-38d6d52a40e0
CTTCAATTTCCCGAGTAAGCAGGCAGAGATCGCGGACGCTACCCCAGAGCAGGGCGTCATGCGCGAGAGAAAGCTTTGCACTTTGCGAACCAACGATAGGTGGGGGTGCGTGGAGGATGGAACACGGACGGCCCGGCTTGCTGCCTTTCCCAGGCCTGCAGGTTTGCCCATCCACGTCAGGGCCTCAGCCTGGCCGAAAAGAAATGGTCTGTGATCCCCCAGCAGCAGCAGCAGCAGCAGCGGCGGCGGCGGCGTTTCCCCGGCTACAAGGACCCTTCGAGCCCGTTCGCCGGCCGCGGACCCGGCCCCTCCCTCCCCCGGCCGCTGGGGGCGGGCCCGGATCACAGGACTGGAGCTGGGCGGAGACCCACGCTCGGAGCGGTTGTGAACTGGCAGGCGGTGGGCGCGGCTTCTGTGCCGTGCCCGGGCACTCAGTCTTCCAACGGGGCCCCGGAGTCGAAGACAGTTCTAGGGTTCAGGGAGCGCGGGCGGCTGCCTGGGCGGCGCCAGACTGCGGTGAGTTGGCCGGCGTGGGCCACCAACCCAATGCAGCCCAGGGCGGCGG

7818cc7f-dc3d-476b-beeb-38d6d52a40e0 50.5

As you can see, in the sequence there are about 20 repeats, while the number of estimated repeats is above 50.
Any ideas of why such a behaviour? Is there any sort of error rate parameter which count also motifs similar to CAG?
Thank you very much in advance!

Is it possible to output an vcf with clinical annotation?

For example, straglr can add annotation to an vcf output with pathogenic_min and then call whether it is full_mutation, pre_mutation or normal, e.g.

chr21 43776443 . G . PASS SVTYPE=STR;END=43776479;REF=5;RL=36;RU=GGGGCGC;REPID=CSTB;VARID=CSTB;STR_STATUS=pre_mutation;STR_NORMAL_MAX=3;STR_PATHOLOGIC_MIN=30;RankScore=1:20;HGNCId=2482;InheritanceMode=AR;DisplayRU=CCCCGCCCCGCG;SourceDisplay=GeneReviews_Internet_2019-11-07;Source=GeneReviews;SourceId=NBK535148;Disease=EPM1 GT:SO:CN:CI:AD_SP:AD_FL:AD_IR 1/1:SPANNING/SPANNING:4/4:4-4/4-4:12.5/12.5:0/0:0/0
chr14 23321472 . G , . PASS SVTYPE=STR;END=23321490;REF=6;RL=18;RU=GCG;REPID=PABPN1;VARID=PABPN1;STR_STATUS=full_mutation,normal;STR_NORMAL_MAX=10;STR_PATHOLOGIC_MIN=11;RankScore=1:30;HGNCId=8565;InheritanceMode=AD;DisplayRU=GCN;SourceDisplay=GeneReviews_Internet_2020-10-22;Source=GeneReviews;SourceId=NBK1126;Disease=OPMD GT:SO:CN:CI:AD_SP:AD_FL:AD_IR 1/2:SPANNING/SPANNING:16/6:16-16/5-6:1/31:0/0:0/0

Support for data from the R10 flowcells

Hi,

As the R9 flowcells are soon be discontinued by ONT, I was wondering would there be any support for the R10 flowcells in the future for NanoRepeat?

Many thanks!

Question about repeat size

Hi there,

I am currently analyzing some samples and came across this result from one of my analyses:

##RepeatRegion=chr11-639647-640306-CCCCGCGCCCGGCCTTCCCCGGGGTCCCTGCGGCCCCGACTGTGCGCC
#Read_Name	Allele_ID	Phasing_Confidence	Repeat_Size
DRD4-c2_620249_30430_R_14_29029_0	1	HIGH	0.0
DRD4-c2_634223_28127_F_7_27160_7	2	HIGH	9.0

I would like to confirm whether the "Repeat_Size" column refers to an estimated size or relative size compared to a reference. Additionally, I am curious as to why this algorithm reported a read without any repeat size. Based on these results, the variance appears to be somewhat large.

Any comments or suggestions would be greatly appreciated.

Best,
Hsin

Are there any plan to add support for repeat unit with IUPAC nucleotides?

Thanks for developing NanoRepeat. From my own testing, it seems to be doing much better than straglr.

However, I find that NanoRepeat doesn't support IUPAC nucleotides in repeat unit but straglr does. It would be great if IUPAC support is added.

If that's too much to ask, is it possible to support poly-alanine repeat which has a repeat unit of GCN?

It occurs in PHOX2B gene and is a well documented trinucleotide repeat disorder.
https://academic.oup.com/hmg/article/13/suppl_2/R235/619705

Thanks a lot in advance.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.