Giter Site home page Giter Site logo

bam_parser's Introduction

BAM Parser Bash Project

This tools is part of the Bioinformatics course project (ZC-BMS320)

Project Description >> BAM parsing for assembly QC

Parse BAM files to find the reads with multiple mismatches, supplementary alignment, and/or discordant reads. Try to find out if these reads aggregate in specific loci along the genome which might mark regions of bad assembly.

The Pipeline goes as follow:

1. Downloading and Indexing the Reference Human Genome

You should first create a new directory for this pipeline. Preferably, name it BAM_Parser.

work_dir="$(pwd)"

mkdir $work_dir/genome-data && cd $work_dir/genome-data

# download ref.
wget ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.28_GRCh38.p13/GCA_000001405.28_GRCh38.p13_genomic.fna.gz

# unzipping
gunzip GCA_000001405.28_GRCh38.p13_genomic.fna.gz
ref_fna="$work_dir/genome-data/GCA_000001405.28_GRCh38.p13_genomic.fna"

# hisat indexing
mkdir idx
hisat2-build $ref_fna idx/grch38
index="$work_dir/genome-data/idx/grch38"

cd $work_dir

2. Downloading the sample of Interest from NCBI-SRA

fastq-dump -I --split-files $sample

3. Mapping the sample to the indexed reference human genome

mkdir map
hisat2 -q --phred33 \
	-x $index \
	-1 ${sample}_1.fastq \
	-2 ${sample}_2.fastq \
	-S map/$sample.sam \
	--met-file map/map.met \
	> map/map.log

4. Converting SAM to BAM

samtools view -hbo $sample.bam map/$sample.sam 

5. Extracting Records from the BAM file for two Different Purposes

All the above steps were done for to create the BAM files where the real project comes in. The first purpose is to select the reads with intronic sequences, split them over the introns, merge them with all other mapped reads. The second purpose was to use the mapped reads to assess the quality control of the reference assembly (or the sample of interest) based on three different criteria: proper alignments of paired end reads, chimeric (supplementary) alignments, and the accumulation of multiple mismatches at specific loci of the refernce genome.

The following code was done for enviroenment prep. but each subtask later was done by a different individual as mentioned.

mkdir parsed
bam_file="$sample.bam"
output="$(basename -- $bam_file)"

a) Purpose 1: getting the records with intronic sequences (CIGAR) >> Ruwaa

mkdir introns
samtools view -h -F 4 $bam_file | awk '$6 !~ /N/ || $1 ~ /@/' | samtools view -b > introns/without_introns_$output
samtools view -h -F 4 $bam_file | awk '$6 ~ /N/ || $1 ~ /@/' | samtools view -b > introns/with_introns_$output


# split the recrods in the introns file to exons (without introns)
cd introns/
bam_file="with_introns_SRR2973277.bam"

# sorting
samtools view -H parsed/$bam_file > parsed/header.bam
samtools sort -o parsed/sorted_$bam_file parsed/$bam_file

# 1) using SplitNCigarReads
conda activate ngs
gatk SplitNCigarReads -R $fna_file -I parsed/sorted_$bam_file -O parsed/Split_N.bam

# 2) using my Python script
python SplitN_RU.py -b bam_file -o splitted_RU.bam


# counting lines and comparing (benchmarking)
samtools view parsed/sorted_$bam_file | wc -l
samtools view parsed/Split_N.bam | wc -l
samtools view parsed/splitted_RU.bam | wc -l

When using the SplitNCigarReads tools of GATK >> when viewd, some records still contan introns 'N' in the CIGAR string! A new python/bash script is needed for the splitting.

When using my Python script >>

b) Purpose 2: getting the records that made improper alignment (F) >> Amira

samtools view -h -F 2 $bam_file > parsed/improper_$output

c) Purpose 3: getting the records with supplementary alignment (F) >> Negm

samtools view -h -f 2048 $bam_file > parsed/supp_$output

d) Purpose 4: getting the records with multiple mismatch (Flags) >> Mayar

samtools view -h -F 4 $bam_file | #...

there are two options here:

  1. SamFixCigar (tool >> http://lindenb.github.io/jvarkit/)
  2. there is a tag named MD that can be used for that purpose!

I support the second option. See this tutorial >> http://zenfractal.com/2013/06/19/playing-with-matches/

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.