Giter Site home page Giter Site logo

gbs-tools's Introduction

gbs-tools
=========

Tools for processing genotyping-by-sequencing data 

checkfastqformat.py - Checks a fastq file for formatting issues. 
findNs.py - Reports on Ns in the first 12 nucleotides of reads in a fastq file. 
exploreabmiguous.py - For 12mers that are equally distant to more than one 
                      barcode/stickyend with a distance of 1 or 2, return the 
                      12mer, the distance, and the list of equally distant 
                      barcodes. 
findbarcode.py - Searches for a specific barcode anywhere within the reads of 
                 a fastq file.
makerandom12mers.py - Generates a dummy fasta file of random 12bp sequences.
                      Useful for checking whether my pipeline is returning 
                      better results than are expected just by chance in a 
                      large dataset.
collapsebarcodes.py - Removes a single nucleotide from the end of a set of 
                      12mer (or whatever-mer) sequences to see how the list 
                      of unique sequences collapses. Didn't end up being useful.
countsbysample.py - Returns a count of how many reads in a fastq file were 
                    assigned to each barcode.
compare12merquality.py - Compares 12mers to the closest matching barcode and 
                         reports the average quality of matching vs mismatched
                         bases.
compare12merquality-zd.py - Same, but for data with 14-16mers instead of 12mers.
categorizemismatch.py - Compares 12mers to the closest matching barcode and
                        categorizes the mismatches by type (A to C, C to G,
                        etc).
justthedata.py - Pulls out just the data line from a fastq file to make a    
                 smaller, more manageable file for scripts that don't need the
                 header or quality lines
doublecheckdups.py - Double check that a list of unique starting sequences is 
                     really unique.
removeemptysequences.py - Returns just records with non-zero length sequences 
                          from a fastq file.
getbarcodecounts.py - Counts how many reads and bases are assigned to each
                      barcode in a fastq file.
splitfastq.py - Splits a fastq file into mutiple files based on a list of
                barcodes and which file they belong in.
renameclusters.py - Renames reads in cd-hit fasta output with the name of
                    the cluster and the number of reads included in the 
                    cluster (instead of the name of the seed sequence).
plotclustercounts.py - Generates data for a cluster size histogram. Includes 
                       counts of reads represented by each seed in a cluster
                       of the output from renameclusters.py
plotclusterreadproportions.py - Generates data for a scatter plot of the 
                                proportion of reads in a cluster that have the 
                                most common read vs either cluster read count
                                or cluster unique read count.
makepseudogenome.py - Generates a pseudogenome (each cluster is a "chromosome") 
                      to assemble reads against.
pairwise_coverage.py - Uses a genotypes table to generate a table showing what
                       proportion of loci are shared between each two samples.

"Pipeline" used to clean up T. intricatum GBS dataset (This process could
 sped up drastically if some of these scripts were combined, but I left the
 steps in separate scripts for flexibility):

1) Make a list of unique starting sequences (potential barcode-stickyends)
   found in the fastq file. (It says 12mers, but the length is configurable.)
      findunique12mers.py
      (or findunique12mersHTSeq.py)
2) For each unique starting sequence, determine which of our 
   barcode-stickyends is the nearest match and the distance.
      sortbarcodes-wobble.py - for data with "wobble" bases in the cutsite 
                               (R, W, etc.)
      sortbarcodes-indels.py - if you don't have wobble bases and want to 
                               allow for indels when calculating distances
3) Where the nearest match is unambiguous and the distance is < a specified
   value, replace the starting characters of each read in the fastq data 
   with the corrected barcode-stickyend. Sort reads with ambiguous matches
   or larger distances into files to be explored separately.
      correctbarcodes.py
4) Trim the barcode-stickyends from the reads and store the sticky end, sample
   name, and barcode in the fastq description line. Trim the fastq quality 
   line to match.
      trimbarcodes.py
5) Trim any Illumina primer sequence from the 3' end of the reads and trim the
   quality line to match.
      Used "cutadapt"
6) Remove any 0-length sequences from the fastq file (with the right cutadapt
   options this might not be necessary) because they cause problems for some 
   of the fastx tools.
      removeemptysequences.py
7) Get a baseline for sequence quality before quality trimming.
      Used "fastx_quality_stats" from fastx tools
8) Trim trailing low quality bases.
      Used "fastq_quality_trimmer" from fastx tools
9) Check quality after trimming and compare to baseline.
      Used "fastx_quality_stats"
10) Split C. intricatum, T. boschiana, and D. petersii reads into three
    separate files.
      splitfastq.py
11) Convert C. intricatum fastq file to fasta
      Used "fastq_to_fasta" from fastx tools
12) Sort C.intricatum fasta file by decreasing read length
      Used "sort.pl"
13) Move sample name & barcode to the start of the fasta descriptions so 
    it doesn't get cut off in the cluster output from cd-hit-454.
      movefastabarcode.py
14) Cluster at decreasing stringency starting with 100% identity
      Used "cd-hit-454"

gbs-tools's People

Contributors

aduffy70 avatar

Stargazers

 avatar

Watchers

 avatar

Forkers

wolflab

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.