Giter Site home page Giter Site logo

gbsprep's Introduction

GBSprep: a pipeline for pre-processing Genotyping-by-Sequencing (GBS) Illumina reads

Tool for preprocessing Illumina reads generated from GBS library protocol

==================================================================================

This program could be used for preprocessing Illumina 1.8+ reads generated by Genotyping-by-Sequencing (GBS) library preparation protocol.

Step-by-Step pipeline:

The GBSpep program preprocess GBS reads in 4 consecutive steps:

  1. Demultiplexing and barcodes removal: The demultiplexing step identify exact barcode match in the sequence and remove the barcode;

  2. Clip chimeras and adapters: This second steps clip reads presenting the Restriction Enzyme (RE) site(s) (i.e. possible chimeras, partial digestion or sequencing errors) or the common adapter initial sequence;

  3. Quality trimming: This step trim the reads based on a 5bp sliding window analysis, averaging the quality over the sliding window. When the average quality drop below a specified cut-off the reads are trimmed;

  4. RE remnant site: This step check for the presence of the RE remnant site(s). Only the reads that start with the RE remnant site(s) will be kept;

Only the reads that pass all the 4 steps will be kept for further analysis.

This program is able to preprocess multiple files at once. Just be sure that samples have the same barcode in all the sequencing runs. In case of replicated samples (i.e. the same genotype with different barcodes) they will be merged in a single file if they have the same name in the barcode file.

==================================================================================

Download this repo

For downloading this folder to your computer first you need to install git.

Let's assume you want to clone this repo into a directory named proj, you will need to open the terminal and type:

mkdir proj
cd proj
git clone https://github.com/aariani/GBSprep

==================================================================================

Run the source code

If you are using Windows or MAC (and the binary code is not working) please refer to the GBSprep.py script for running this program.

On Windows you will probably need to install Cygwin (not tested!).

The GBSprep.py script contains all the codes in the source folder in a single file.

Prepare the environment

  1. Create a directory for you project

     mkdir myGBSproject
    
  2. Copy the GBSprep script to you project directory. If you have saved the script in the proj/GBSprep folder type:

     cp proj/GBSprep/GBSprep.py myGBSproject
    
  3. Create a raw reads folder and copy your FASTQ files there (Important: this script works only with gz formatted fastq files)

     mkdir myGBSproject/fastq
     cp myfastqfolder/myfastqfiles.fastq.gz myGBSproject/fastq
    

Run the script inside the myGBSproject folder (Example with CviAII enzyme)

cd myGBSproject
python GBSprep.py -i fastq -o demultiplexed -bc myBarcodefile.txt -s CATG -SR ATG

The script will output the demultiplexed cleaned reads in fq format ready for downstream analysis.

In addition the script will output a Demultiplexing_stats.txt file with informations about the total number of reads (after filtering) for each genotype.

For all the option of the code from the command line type:

python GBSprep.py -h
usage: GBSprep [-h] [-i READS] [-o CLEAN_READS] [-bc BC_FILE] [-s RESITES]
           [-SR REREM] [-l MINLEN] [-q MINQ] [-gz] [-ad CONTAMINANT]
           [--remove-remnant-site]

Program for preprocessing GBS reads

optional arguments:
  -h, --help            show this help message and exit
  -i READS, --input-folder READS
                        The folder with your raw reads in gz compressed format
                        (REQUIRED)
  -o CLEAN_READS, --output-folder CLEAN_READS
                        The output folder for the final cleaned and
                        demultiplexed reads (REQUIRED)
  -bc BC_FILE, --barcode-file BC_FILE
                        The barcode file for your raw reads (REQUIRED)
  -s RESITES, --restriction_enzyme_site RESITES
                        The Restriction Enzyme (RE) recognition site
                        (REQUIRED). If your RE has ambiguous nucleotide you
                        should write all the possible sites separated by a
                        comma (ex. For ApeKI you should use -s GCAGC,GCTGC)
  -SR REREM, --site-remnant REREM
                        The RE remnant site after the digestion (REQUIRED).
                        Only reads with the remnant site after the barcode
                        will be kept. For RE with ambiguous nucleotide you
                        should write all the possible remnant sites separated
                        by a comma (as in the -s parameter)
  -l MINLEN, --min-length MINLEN
                        Minimum length of reads after quality trimming and
                        adapter/chimeras clipping (default: 30bp)
  -q MINQ, --min-qual MINQ
                        Mean minimum quality (in a sliding window of 5bp) for
                        trimming reads, assumed Sanger quality (Illumina 1.8+,
                        default: 20)
  -gz, --binary-output  Do you want to output the fastq file in a compressed
                        format (i.e. gz compressed)? This option save disk
                        space, but writing binary files requires a
                        considerable higher amount of time (Default: False)
  -ad CONTAMINANT, --adapter-contaminants CONTAMINANT
                        The initial sequence of the adapter contaminant
                        (default: AGATCGG)
  --remove-remnant-site
                        Do you want to remove the RE remnant site from the
                        final cleaned reads? (default: False)

See the [barcode section] (#barcode-file-specifications) for the format of the barcode file

==================================================================================

Barcode file specifications

The barcode files if formed by two columns, separated by a space, with the barcode sequence and the name of the genotype

barcode1 genotype1
barcode2 genotype2
barcode3 genotype3

gbsprep's People

Contributors

aariani avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.