Giter Site home page Giter Site logo

gbs-tools's Introduction

gbs-tools
=========

Tools for processing genotyping-by-sequencing data 

checkfastqformat.py - Checks a fastq file for formatting issues. 
findNs.py - Reports on Ns in the first 12 nucleotides of reads in a fastq file. 
exploreabmiguous.py - For 12mers that are equally distant to more than one 
                      barcode/stickyend with a distance of 1 or 2, return the 
                      12mer, the distance, and the list of equally distant 
                      barcodes. 
findbarcode.py - Searches for a specific barcode anywhere within the reads of 
                 a fastq file.
makerandom12mers.py - Generates a dummy fasta file of random 12bp sequences.
                      Useful for checking whether my pipeline is returning 
                      better results than are expected just by chance in a 
                      large dataset.
collapsebarcodes.py - Removes a single nucleotide from the end of a set of 
                      12mer (or whatever-mer) sequences to see how the list 
                      of unique sequences collapses. Didn't end up being useful.
countsbysample.py - Returns a count of how many reads in a fastq file were 
                    assigned to each barcode.
compare12merquality.py - Compares 12mers to the closest matching barcode and 
                         reports the average quality of matching vs mismatched
                         bases.
compare12merquality-zd.py - Same, but for data with 14-16mers instead of 12mers.
categorizemismatch.py - Compares 12mers to the closest matching barcode and
                        categorizes the mismatches by type (A to C, C to G,
                        etc).
justthedata.py - Pulls out just the data line from a fastq file to make a    
                 smaller, more manageable file for scripts that don't need the
                 header or quality lines
doublecheckdups.py - Double check that a list of unique starting sequences is 
                     really unique.

"Pipeline" used to clean up T. intricatum GBS dataset (This process could
 sped up drastically if some of these scripts were combined, but I left the
 steps in separate scripts for flexibility):

1) Make a list of unique starting sequences (potential barcode-stickyends)
   found in the fastq file. (It says 12mers, but the length is configurable.)
      findunique12mers.py
2) For each unique starting sequence, determine which of our 
   barcode-stickyends is the nearest match and the distance.
      sortbarcodes-wobble.py - for data with "wobble" bases in the cutsite 
                               (R, W, etc.)
      sortbarcodes-indels.py - if you don't have wobble bases and want to 
                               allow for indels when calculating distances
3) Where the nearest match is unambiguous and the distance is < a specified
   value, replace the starting characters of each read in the fastq data 
   with the corrected barcode-stickyend. Sort reads with ambiguous matches
   or larger distances into files to be explored separately.
      correctbarcodes.py
4) Trim the barcode-stickyends from the reads and store the sticky end, sample
   name, and barcode in the fastq description line. Trim the fastq quality 
   line to match.
      trimbarcodes.py
5) Trim any Illumina primer sequence from the 3' end of the reads and trim the
   quality line to match.
     Used "cutadapt"

gbs-tools's People

Contributors

aduffy70 avatar

Watchers

James Cloos avatar Paul Wolf avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.