Giter Site home page Giter Site logo

vallurumk / fastq-pair Goto Github PK

View Code? Open in Web Editor NEW

This project forked from linsalrob/fastq-pair

0.0 1.0 0.0 128 KB

Match up paired end fastq files quickly and efficiently.

Home Page: https://edwards.sdsu.edu/research/sorting-and-paring-fastq-files/

License: MIT License

CMake 1.49% C 98.51%

fastq-pair's Introduction

FASTQ PAIR

Rewrite paired end fastq files to make sure that all reads have a mate and to separate out singletons.

This code does one thing: it takes two fastq files, and generates four fastq files. That's right, for free it doubles the number of fastq files that you have!!

Usually when you get paired end read files you have two files with a /1 sequence in one and a /2 sequence in the other (or a /f and /r or just two reads with the same ID). However, often when working with files from a third party source (e.g. the SRA) there are different numbers of reads in each file (because some reads fail QC). Spades, bowtie2 and other tools break because they demand paired end files have the same number of reads.

This program solves that problem.

It rewrites the files with the sequences in order, with matching files for the two files provided on the command line, and then any single reads that are not matched are place in two separate files, one for each original file.

This code is designed to be fast and memory efficient, and works with large fastq files. It does not store the whole file in memory, but rather just stores the locations of each of the indices in the first file provided in memory.

Speed and efficiency considerations

The most efficient way to use this code is to provide the smallest file first (though it doesn't matter which way you provide the files), and then to manipulate the -t parameter on the command line. The code implementation is based on a hash table and the size of that table is the biggest way to make this code run faster. If you set the hash table size too low, then the data structure quickly fills up and the performance degrades to what we call O(n). On the other hand if you set the table size too big, then you waste a lot of memory, and it takes longer to initialize the data structures safely.

The optimal table size is basically somewhere around the number of sequences in your fastq files. You can quickly find out how many sequences there are in your fastq file:

wc -l fastq_filename

The number of sequences will be the number printed here, divided by 4.

If you are not sure, you can run this code with the -p parameter. Before it prints out the matched pairs of sequences, it will print out the number of sequences in each "bucket" in the table. If this number is more than about a dozen you need to increase the value you provide to -t. If most of the entries are zero, then you should decrease the size of -t.

As an aside, this code is also really slow if none of your sequences are paired. You should most likely use this after taking a peek at your files and making sure there are at least some paired sequences in your files!

Installing fastq_pair

To install the code, grab the github repository, then make a build directory:

mkdir build && cd build
cmake3 ..
make && sudo make install

There are more instructions on the installation page.

Running fastq_pair

fastq_pair takes two primary arguments. The name of the two fastq files that you want to pair up.

fastq_pair file1.fastq file2.fastq

You can also change the size of the hash table using the -t parameter:

fastq_pair -t 50021 file1.fastq file2.fastq

You can also print out the number of elements in each bucket using the -p parameter:

fastq_pair -p -t 100 file1.fastq file2.fastq

Testing fastq_pair

In the test directory there are two fastq files that you can use to test fastq_pair. There are 250 sequences in the left file and 75 sequences in the right file. Only 50 sequences are common between the two files.

You can test the code with:

fastq_pair -t 1000 test/left.fastq test/right.fastq

This will make four files in the test/ directory:

  • left.fastq.paired.fq
  • left.fastq.single.fq
  • right.fastq.paired.fq
  • right.fastq.single.fq

The paired files have 50 sequences each, and the two single files have 200 and 25 sequences (left and right respectively).

A note about gzipped fastq files

Unfortunately fastq_pair doesn't work with gzipped files at the moment, because it relies heavily on random access of the file stream. That is complex with gzipped files, especially when the uncompressed file exceeds available memory (which is exactly the situation that fastq_pair was designed to handle).

Therefore, at this time, fastq_pair does not support gzipped files. You need to uncompress the files before using fastq_pair.

If you really need to use gzipped files, and can accept slightly worse performance, then we have some alternative approaches written in Python that you can try.

Citing fastq_pair

Please see the CITATION file for the current citation for fastq-pair

fastq-pair's People

Contributors

linsalrob avatar johned0 avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.