GBSprep: a pipeline for pre-processing Genotyping-by-Sequencing (GBS) Illumina reads

Tool for preprocessing Illumina reads generated from GBS library protocol

==================================================================================

This program could be used for preprocessing Illumina 1.8+ reads generated by Genotyping-by-Sequencing (GBS) library preparation protocol.

Step-by-Step pipeline:

The GBSpep program preprocess GBS reads in 4 consecutive steps:

Demultiplexing and barcodes removal: The demultiplexing step identify exact barcode match in the sequence and remove the barcode;
Clip chimeras and adapters: This second steps clip reads presenting the Restriction Enzyme (RE) site(s) (i.e. possible chimeras, partial digestion or sequencing errors) or the common adapter initial sequence;
Quality trimming: This step trim the reads based on a 5bp sliding window analysis, averaging the quality over the sliding window. When the average quality drop below a specified cut-off the reads are trimmed;
RE remnant site: This step check for the presence of the RE remnant site(s). Only the reads that start with the RE remnant site(s) will be kept;

Only the reads that pass all the 4 steps will be kept for further analysis.

This program is able to preprocess multiple files at once. Just be sure that samples have the same barcode in all the sequencing runs. In case of replicated samples (i.e. the same genotype with different barcodes) they will be merged in a single file if they have the same name in the barcode file.

==================================================================================

Download this repo

For downloading this folder to your computer first you need to install git.

Let's assume you want to clone this repo into a directory named proj, you will need to open the terminal and type:

mkdir proj
cd proj
git clone https://github.com/aariani/GBSprep

==================================================================================

Run the source code

If you are using Windows or MAC (and the binary code is not working) please refer to the GBSprep.py script for running this program.

On Windows you will probably need to install Cygwin (not tested!).

The GBSprep.py script contains all the codes in the source folder in a single file.

Prepare the environment

Create a directory for you project
```
 mkdir myGBSproject
```
Copy the GBSprep script to you project directory. If you have saved the script in the proj/GBSprep folder type:
```
 cp proj/GBSprep/GBSprep.py myGBSproject
```
Create a raw reads folder and copy your FASTQ files there (Important: this script works only with gz formatted fastq files)
```
 mkdir myGBSproject/fastq
 cp myfastqfolder/myfastqfiles.fastq.gz myGBSproject/fastq
```

Run the script inside the myGBSproject folder (Example with CviAII enzyme)

cd myGBSproject
python GBSprep.py -i fastq -o demultiplexed -bc myBarcodefile.txt -s CATG -SR ATG

The script will output the demultiplexed cleaned reads in fq format ready for downstream analysis.

In addition the script will output a Demultiplexing_stats.txt file with informations about the total number of reads (after filtering) for each genotype.

For all the option of the code from the command line type:

python GBSprep.py -h
usage: GBSprep [-h] [-i READS] [-o CLEAN_READS] [-bc BC_FILE] [-s RESITES]
           [-SR REREM] [-l MINLEN] [-q MINQ] [-gz] [-ad CONTAMINANT]
           [--remove-remnant-site]

Program for preprocessing GBS reads

optional arguments:
  -h, --help            show this help message and exit
  -i READS, --input-folder READS
                        The folder with your raw reads in gz compressed format
                        (REQUIRED)
  -o CLEAN_READS, --output-folder CLEAN_READS
                        The output folder for the final cleaned and
                        demultiplexed reads (REQUIRED)
  -bc BC_FILE, --barcode-file BC_FILE
                        The barcode file for your raw reads (REQUIRED)
  -s RESITES, --restriction_enzyme_site RESITES
                        The Restriction Enzyme (RE) recognition site
                        (REQUIRED). If your RE has ambiguous nucleotide you
                        should write all the possible sites separated by a
                        comma (ex. For ApeKI you should use -s GCAGC,GCTGC)
  -SR REREM, --site-remnant REREM
                        The RE remnant site after the digestion (REQUIRED).
                        Only reads with the remnant site after the barcode
                        will be kept. For RE with ambiguous nucleotide you
                        should write all the possible remnant sites separated
                        by a comma (as in the -s parameter)
  -l MINLEN, --min-length MINLEN
                        Minimum length of reads after quality trimming and
                        adapter/chimeras clipping (default: 30bp)
  -q MINQ, --min-qual MINQ
                        Mean minimum quality (in a sliding window of 5bp) for
                        trimming reads, assumed Sanger quality (Illumina 1.8+,
                        default: 20)
  -gz, --binary-output  Do you want to output the fastq file in a compressed
                        format (i.e. gz compressed)? This option save disk
                        space, but writing binary files requires a
                        considerable higher amount of time (Default: False)
  -ad CONTAMINANT, --adapter-contaminants CONTAMINANT
                        The initial sequence of the adapter contaminant
                        (default: AGATCGG)
  --remove-remnant-site
                        Do you want to remove the RE remnant site from the
                        final cleaned reads? (default: False)

See the [barcode section] (#barcode-file-specifications) for the format of the barcode file

==================================================================================

Barcode file specifications

The barcode files if formed by two columns, separated by a space, with the barcode sequence and the name of the genotype

barcode1 genotype1
barcode2 genotype2
barcode3 genotype3

sybalemus1 / gbsprep Goto Github PK

gbsprep's Introduction

GBSprep: a pipeline for pre-processing Genotyping-by-Sequencing (GBS) Illumina reads

Tool for preprocessing Illumina reads generated from GBS library protocol

Step-by-Step pipeline:

Download this repo

Run the source code

Prepare the environment

Barcode file specifications

gbsprep's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent