Giter Site home page Giter Site logo

fastqident's Introduction

fastqident: Guess the quality encoding system used by FASTQ sequence files.

The FASTQ sequence format stores biological sequences, along with sequencing quality values for each element in each sequence. Unfortunately, different sources encode the quality values in different ways. The different encodings are sparsely documented, and telling them apart is confusing, especially because many files are techincally valid in multiple encodings. However, in practice it is generally possible to make a good guess as to which encoding was used based on the observed range of ASCII character. The purpose of this module is to do this guessing for you. It should get the answer right as long as the question isn't too hard.

Installation

fastqident is distributed as a standard Python package. You can download the tarball here: https://github.com/DarwinAwardWinner/fastqident/tarball/master

Simply install it your normal Python package installer. Probably pip or easy_install or setup.py. Something like this:

wget -O fastqident.tar.gz https://github.com/DarwinAwardWinner/fastqident/tarball/master
pip install fastqident.tar.gz

Usage

From the command line

This package includes a script by the same name that takes a list of fastq files on the command line and tries to identify the encoding of each. Usage:

$ fastqident read1.fastq read2.fastq
{'read1.fastq': 'illumina', 'read2.fastq': 'illumina'}

From python code

Import the constructor:

from fastqident import FastqQualityIdentifier

Create an identifier with default values:

id_default = FastqQualityIdentifier()

Create an identifier with custom options:

id_custom = FastqQualityIdentifier(max_quality=50, nnuc=5000, start=50, skip=50)

The identifier class supplies three methods: detect_encoding, detect_encoding_safe, and detect_encodings. Here is some example code using them:

# Identify a single fastq file (returns a string)
filename = "read1.fastq"
file_encoding = id_default.detect_encoding(filename)
print "%s has quality encoding %s" % (filename, file_encodings)

# Identify a list of files (returns a dict)
filenames = ("read1.fastq", "read2.fastq", "read3.fastq")
file_encodings = id_custom.detect_encodings(filenames)
for fname, fenc in file_encodings:
    print "%s has quality encoding %s" % (fname, fenc)

Note that if you just want the default settings, you can also import those same three methods as module-level functions:

from fastqident import detect_encoding, detect_encoding_safe, detect_encodings

filename = "read1.fastq"
file_encoding = detect_encoding(filename)
print "%s has quality encoding %s" % (filename, file_encodings)

Errata

Illumina's many versions

There are at least three different versions of "Illumina" fastq files that differ only slightly. I believe they are too similar to reliably distinguish based purely on the range of ASCII characters present in the quality strings, so this module makes no attempt to do so. It simply returns 'illumina', without trying to be any more specific than that..

Assumptions

fastqident assumes that the sequences near the start of the file will provide a reasonable sample of the range of quality values in the whole file. If this assumption fails, then fastqident may fail to correctly identify things. You can increase -n, -i, and -s from their default values to take a larger sample across a greater fraction of the fastq file, at the cost of taking longer.

Examples of misidentified files are welcome.

fastqident's People

Contributors

darwinawardwinner avatar granek avatar

Watchers

James Cloos avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.