Giter Site home page Giter Site logo

jalwillcox / get-str-extensions Goto Github PK

View Code? Open in Web Editor NEW
0.0 1.0 0.0 140 KB

This script takes advantage of read pairing in bam/cram files to identify the presence of STR extensions.

Shell 100.00%
str simple-tandem-repeats fxn frataxin str-extension ngs-analysis wgs-analysis paired-end-reads paired-end-sequencing paired-end

get-str-extensions's Introduction

Find STR Extensions

Sequencing aligners frequently misalign reads containing simple tandem repeats (STRs) that extend significantly beyond the reference genome. This can make it difficult to identify the presence of long STR extensions.

This script takes advantage of read pairing in bam/cram files to identify the presence of STR extensions.

A few important notes:

  1. Paired-End Data This script is intended for paired-end data.
  2. Read Length Limitation This script can identify the presence of STRs that extend beyond the length of a single read, but cannot identify the exact STR length if it is greater than the read length.
  3. Defects and Changes in STR Sequence The repeats.txt output only identifies the lengths for sequences of pure STR repeat. If there are defects in the STR (e.g. "CTTCTTCTTTCTT") or if the repeat changes (e.g. "CTTCTTCCTCCT"), the STR is interrupted and the lengths may not be useful. However, the terminalSeq.txt file gives displays the 3' and 5' termination of the STR for individual reads, which should give some insight into the presence/nature of any deviations from pure repetition.
  4. Unique Flanking Sequence This script relies on the accuracy of reads mapped to the sequence adjacent to the STR. Misaligned reads adjacent to the STR can lead to inaccuracy.

Background

This script was developed to identify extensions in an STR near the Frataxin (FXN) gene at chr9:69037185-69037404 in the human genome (hg38). It should, however, be useable for other STRs as well.

Overview

Broadly, this script works in two steps:

  1. Identify STR Reads Samtools and Bedtools are used to extract all reads that map to the STR region along with their mates from bam/cram input
  2. Identify STR Lengths The distribution of STR lengths is determined

Required Programs

This script uses the following programs; alternate versions may also work:

Flags

Flag Argument Default Description
-b BAM/CRAM file (required) - BAM/CRAM input file
-g Fasta File (required) - The genome file that matches the alignment
-o string (required) - the basename for the output file
-r STR region (required) - The region containing the STR, including unique flanking sequence, e.g. chr9:69037185-69037404 for the Frataxin STR
-s string (required) - The repeat sequence in the STR, e.g. "CTT" for the Frataxin STR
-l read length (optional) 150 the number of bases in a single read
-t integer (optional) 1 the number of threads to use
-h no argument (optional) - print usage

Example Usage

Running get-STR-extensions.sh on the Frataxin STR:

  ./get-STR-extensions.sh -b sample1.bam -g ./Homo_sapiens_assembly38.fasta -s CTT -r chr9:69037185-69037404 -o sample1-CTT

Output

File Description
*fq A fastq file with all identified reads for the STR region
*_repeats.txt A table of repeat lengths and read counts by length
*_terminalSeq.txt a list of STRs by read with the flanking sequences

get-str-extensions's People

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.