Giter Site home page Giter Site logo

harc's Introduction

HARC

Improved FASTQ compressor: SPRING

HARC (HAsh-based Read Compressor) - Tool for compression of genomic reads in FASTQ format. Compresses only the read sequences. Achieves near-optimal compression ratios and fast decompression. Supports upto 4.29 Billion fixed-length reads with lengths at most 256. Requires around 50 Bytes of RAM/read for read length 100 during compression. The algorithm requires C++11 and g++ compiler and works on Linux. p7zip should already be installed.

Installation

git clone https://github.com/shubhamchandak94/HARC.git
cd HARC
./install.sh

Usage

Compression - compresses FASTQ reads. Output written to .harc file
./harc -c FASTQ_file [-p] [-t num_threads] [-q]

-p = Preserve order of reads (compression ratio 2-4x worse if order preserved)

-t num_threads - default 8

-q = Write quality values and read IDs to .quality and .id files, respectively. 
Quality values and read IDs are appropriately reordered if -p is not specified. 
Decompression - decompresses reads. Output written to .dna.d file
./harc -d HARC_file [-p] [-t num_threads] [-m max_memory]

-p = Get reads in original order (slower). Only applicable if -p was used during compression.

-t num_threads - default 8

-m max_memory - Controls memory-time tradeoff for decompression with -p. 
Specify max memory in GB (minimum 3 GB for 8 threads). e.g. -m 10 for 10 GB maximum memory. 
Default: 7 GB (note: less than 3 GB memory required if -p not specified)
Help (this message)
./harc -h

Example Usage of HARC

For compressing file.fastq in the HARC home directory without preserving order using default 8 threads,

./harc -c file.fastq

The compressed file is located in the same directory and is named file.harc.

To decompress file.harc to file.dna.d containing the reads (not in the original order) using 4 threads,

./harc -d file.harc -t 4

For compressing file.fastq while preserving its order, run

./harc -c file.fastq -p

Now, file.harc also contains the order information and it can be decompressed with or without the -p flag. If -p flag is not used, the reads will be decompressed faster but not in the original order. To decompress and restore the original order, run

./harc -d file.harc -p

Decompression with -p flag also allows the user to set the maximum memory used during the order restoration step by using the -m flag, e.g., to use at most 10 GB memory for that step while decompressing, run

./harc -d file.harc -p -m 10

Note, however, that the BSC decompression step uses 350 MB/thread (both with and without -p) irrespective of the -m switch. This component can be controlled by reducing the number of threads. Also, maximum threads used by HARC during decompression is limited by the number of threads used during compression.

To preserve the quality and read identifiers, use -q switch during compression, e.g.,

./harc -c file.fastq -q

This will write the quality values to file.quality and read identifiers to file.id in the same directory as file.fastq. Since the command was run without the -p flag, the quality values and the IDs will be reordered to match the new order of the reads. Quality value compressor QVZ (available at https://github.com/mikelhernaez/qvz) can be used to directly compress file.quality.

Downloading datasets

Genomic sequencing reads
wget -b ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR554/SRR554369/SRR554369_1.fastq.gz
wget -b ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR554/SRR554369/SRR554369_2.fastq.gz
gunzip SRR554369_1.fastq.gz SRR554369_2.fastq.gz
cat SRR554369_1.fastq SRR554369_2.fastq > SRR554369.fastq

For some datasets (e.g. SRR327342 and SRR870667), the two fastq files may have reads of different lengths. To use HARC on such datasets, compress the two fastq files separately.

Metagenomics data
wget -b http://public.genomics.org.cn/BGI/gutmeta/High_quality_reads/MH0001/081026/MH0001_081026_clean.1.fq.gz
wget -b http://public.genomics.org.cn/BGI/gutmeta/High_quality_reads/MH0001/081026/MH0001_081026_clean.2.fq.gz
gunzip MH0001_081026_clean.1.fq.gz MH0001_081026_clean.2.fq.gz
cat MH0001_081026_clean.1.fq MH0001_081026_clean.2.fq > MH0001_081026_clean.fq
Human genome (hg19 - for generating simulated reads)
wget -b ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/human_g1k_v37.fasta.gz

Inside this file different chromosomes are demarcated.

Generating reads using gen_fastq (from orcom repo)
Error-free reads (without reverse complementation) - 35M reads of length 100 from chrom 22
cd util/gen_fastq_noRC
make
./gen_fastq_noRC 35000000 100 PATH/chrom22.fasta PATH/chrom22_reads.fastq
Reads with 1% uniform substitution rate (each base is equally likely to be changed to any of the 4 possibilities e.g. A - C,T,G,N) (without reverse complementation) - 35M reads of length 100 from chrom 22
./gen_fastq_noRC 35000000 100 PATH/chrom22.fasta PATH/chrom22_reads.fastq -e
Typical fastq format
@seq id
read
+
quality score

Computing noise entropy

The directory util/ contains quality_counts.cpp and noise_entropy.py, which can be used to compute the noise entropy upper bound using the method described in the Supplementary Data (https://github.com/shubhamchandak94/HARC/blob/master/supplementary-data.pdf). To use these,

  1. Write the quality values (every fourth line of the FASTQ file) to a separate file, e.g., by using
sed -n '4~4p' file.fastq > file.quality
  1. Modify the read length in the header of quality_counts.cpp and compile it by running
g++ util/quality_counts.cpp -O3 -march=native -std=c++11 -o util/quality_counts.out
  1. Generate the quality counts by running
./util/quality_counts.out file.quality file.quality_counts
  1. Update the read length and input file in the header of noise_entropy.py and run
python util/noise_entropy.py

to get the noise entropies for models of different orders.

Other compressors (for evaluation)

Installing & Running orcom (boost should be installed)
git clone https://github.com/lrog/orcom.git
cd orcom
make boost
cd bin
./orcom_bin e -iPATH/SRR065390.fastq -oPATH/SRR065390.bin
./orcom_pack e -iPATH/SRR065390.bin -oPATH/SRR065390.orcom
Getting orcom ordered file
./orcom_pack d -iPATH/SRR065390.orcom -oPATH/SRR065390.dna

The dna file is of the form:

read1
read2
..
Installing and running Leon (seq-only mode)
git clone https://github.com/GATB/leon.git
cd leon
sh INSTALL
./leon -file PATH_TO_FASTQ -c -seq-only -nb-cores NUM_THR
Leon Decompression (order preserved)
./leon -file PATH_TO_LEON -d -nb-cores NUM_THR

The fasta.d file is of the form:

>1
read1
>2
read2
..

harc's People

Contributors

kedartatwawadi avatar shubhamchandak94 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

harc's Issues

encoder.cpp not compiling

encoder.cpp not compiling and throws errors related to:
src/cpp/noisy/encoder.cpp:229:14: error: ‘array’ is not a member of ‘std’

HARC or Spring

I have fasta files (NOT Fastq) to compress. Between HARC and Spring which compressor should I be using? Is it at all possible to compress fasta files with these.

Thanks.

Issue when running HARC in different datasets

Hello developers,
We installed HARC succesfully in Centos 7, but when running it to compress it throwedan error. We tested it over different datasets of our own (small, and medium size) and we got the problem writen below. Then we downloaded one of the files suggested by authors in the supplementary data and got the same exact error. In all of of the tests we tried both options: preserving and not preserving original reads.

Any help about what we should do ?
Thank you very much

*** Preprocessing ***
SRR554369_1.fastq
Read length: 100
Total number of reads: 1657871
Total number of reads without N: 1633521
Preprocessing Done!
src/reorder.cpp: In function ‘void reorder(std::bitset<200>, bbhashdict)’:
src/reorder.cpp:685:42: error: ‘accumulate’ is not a member of ‘std’
std::cout << "Reordering done, "<<
std::accumulate(unmatched,unmatched+num_thr,0) <<" were unmatched\n";

7zip not found

Hello,
I installed HARC succesfully in Centos 7, but when trying to compress it throws this error:
/harc: line 104: 7z: command not found
I have installed p7zip successfully but It looks like the way to execute it in Centos is through the command "7za".

Any idea about what I should do ?
Thank you very much

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.