Giter Site home page Giter Site logo

mineer's Introduction

minEER - A sensible sequence trimming algorithm

For a given quality annotated read, the minEER algorithm identifies the subsequence which minimizes the expected error rate (EER; the mean q-score) while maximizing the subsequence length according to user defined EER and minimum sequence length thresholds. This procedure mimics the manual exercise of choosing truncation positions by glancing at a quality profile distribution, but ensures consistent results and is fast (can run on 20,000 files in under 15s). The minEER algorithm offers an improvement to current heuristic methods (sliding windows, quality score drop offs, etc.) which can be sensitive to noise and miss the opportunity for a deterministic solution based on reasonable criteria. The algorithm itself can be seen here.

The minEER pipeline (see documentation with mineer -h after installing) operates on an entire set of files from a single project. Assuming that all reads from a given direction (forward or reverse for paired reads) share a "similar" quality profile, the minEER pipeline runs the algorithm on a subsample of reads and to determine global truncation positions (where to start and end each read). All reads are then truncated according to these global positions and reads that fail to meet the user defined EER and minimum sequence lenght thresholds.

Two critical parameters to the minEER pipeline are the minimum acceptable length (MAL, set with flag --mal, default=100) and the maximum acceptable error (MAE, set with flag --mae, default=.01). See the Algorithm section below for an example of how these parameters operate.

Install

Install with pip install mineer and then run mineer -h to view the input documentation and to test that installation worked properly.

Tutorial

After installing mineer, run the following:

# Download some fastq files to `sample_files/`
mineer-test-files
# Run the pipeline with default parameters (minimal acceptable error=.01)
mineer -i sample_files -f _1.fastq -r _2.fastq -o test_out -v sample_figs

Once you run the pipeline, a report of each step will appear as they execute. Files containing truncated reads will appear in the directory specified with -o. Providing the -v flag will produce visualizations like the following of quality profiles of untrimmed reads and the distribution of truncation positions identified by minEER: quality-profiles trunc-dist

If you just want to compute truncation positions without writing out truncated files (which can take a while), then you can run mineer without writing files by not providing an output directory (no - o argument):

mineer -i sample_files -f _1.fastq -r _2.fastq 

To produce visualizations without writing out truncated files, you can provide a directory with -v (again, without -o):

mineer -i sample_files -f _1.fastq -r _2.fastq -v sample_figs

Note that these examples use default parameters, which can be inspected with mineer -h.

Pipeline

Method:

  1. Ingest files and recognize pairs based on file names (using xopen)
  2. Run minEER on subset of reads
  3. Determine global truncation positions
  4. Truncate all reads to global positions and filter out read pairs that don't pass QC (currently requires longest length)
  5. Save truncated sequences
  6. Produce visualizations, if visualization output directory provided

Algorithm

For a given read, minEER works by finding the longest subsequence (with length => MAL) with an expected error rate <= MAE. The figure below illustrates the search space minEER explores to find the optimal subsequence with MAL=100 and MAE=1e-3. The expected error (EER) is calculated for each subsequence (at each start and end position) for all subsequences with length >= MAL (notice that all end positions are >= MAL) and then the longest subsequences with an expected error rate <= MAE is chosen (C shows all subsequence positions with expected error rate <= MAE in yellow). image

Contributing

Run tests with python -m unittest or pytest

Here is a longer list of SRRs to test on:

SRR9660346
SRR9660368
SRR9660375
SRR9660380
SRR9660372
SRR9660321
SRR9660322
SRR9660307
SRR9660387
SRR9660385

mineer's People

Contributors

michaelsilverstein avatar dependabot[bot] avatar

Stargazers

ChenXiao avatar  avatar Luis AF avatar

Watchers

James Cloos avatar  avatar

mineer's Issues

New arg parser

Arg parser should be simplified. Just accept directory

Pass read pair to read

Instead of having a bothpassing method in readpair and read, assign readpair object as property to read

Filter out truncated reads that don't pass qc?

Currently, all read pairs that contain a read that does not pass qc is filtered out, leading to stats like:

Reads that pass quality control after truncation:
Forward reads:      19732/29679
Reverse reads:       8116/29679
Read pairs:          6945/29679

Not only is it sad to lose > 20,000 reads, it is likely unnecessary considering many of these reads could be salvaged after merging. Probably a good idea to add a filterout=False argument to Project that behaves by either:

  1. Truncating all reads to global positions without filtering any
  2. Truncating all reads to global positions and keeping readpairs where at least one read passes

Considerations for pair overlap

Overlap is required for reads to merge. The truncation range for overlap can be determined by forward and reverse primer position. Ability to overlap can then be determined by truncation regions.

Create viz module

Functions for visualization report:

  • phred distributions
  • trunc position distributions

Speed

Takes ~6s to trim ~4000 reads. Def could be ways to speed up:

  • parallel processing?
  • numba?

Change subsample strategy

Currently subsampling is done before checking qc leading to imbalance in number of fwd vs rev seqs used for trunc pos call.

New strat for each direction:

  1. shuffle
  2. run mineer, reject read if fails
  3. repeat until nreads is reached

Loading large files takes a longish time

10 files with 327064 read pairs total took 1min45s to run - most of the time was spent on reading files. Files need to be read in order to be truncated, so ultimately all files need to be read in some way, but not all reads need to be ingested as SeqRecords.

For the same with 77067 reads, SeqIO takes 1.4s while reading each line as text only takes 63.8ms (20x speed up)! Should try to implement a work around where only reads to be run on mineer are ingested as SeqRecords.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.