Giter Site home page Giter Site logo

alexpreynolds / sample Goto Github PK

View Code? Open in Web Editor NEW
68.0 7.0 12.0 83 KB

Performs memory-efficient reservoir sampling on very large input files delimited by newlines

Home Page: http://alexpreynolds.github.io/sample

License: MIT License

C 96.42% Makefile 3.58%
reservoir-sampling sampling c bed genomics bioinformatics

sample's Introduction

sample

Build Status

This tool performs reservoir sampling (Vitter, "Random sampling with a reservoir"; cf. http://dx.doi.org/10.1145/3147.3165 and also: http://en.wikipedia.org/wiki/Reservoir_sampling) on very large text files that are delimited by newline characters. Sampling can be done with or without replacement. The approach used in this application reduces the typical memory usage issue with reservoir sampling by storing a pool of byte offsets to the start of each line, instead of the line elements themselves, thus allowing much larger sample sizes.

In its current form, this application offers a few advantages over common shuf-based approaches:

  • On small k, it performs roughly 2.25-2.75x faster than shuf in informal tests on OS X and Linux hosts.
  • It uses much less memory than the usual reservoir sampling approach that stores a pool of sampled elements; instead, sample stores the start positions of sampled lines (8 bytes per element).
  • Using less memory gives sample an advantage over shuf for whole-genome scale files, helping avoid shuf: memory exhausted errors. For instance, a 2 GB allocation would allow a sample size up to ~268M random elements (sampling without replacement).

The sample tool stores a pool of line positions and makes two passes through the input file. One pass generates the sample of random positions, using a Mersenne Twister to generate uniformly random values, while the second pass uses those positions to print the sample to standard output. To minimize the expense of this second pass, we use mmap routines to gain random access to data in the regular input file on both passes.

The benefit that mmap provided was significant. For comparison purposes, we also add a --cstdio option to test the performance of the use of standard C I/O routines (fseek(), etc.); predictably, this performed worse than the mmap-based approach in all tests, but timing results were about identical with gshuf on OS X and still an average 1.5x improvement over shuf under Linux.

The sample tool can be used to sample from any text file delimited by newline characters (BED, SAM, VCF, etc.).

Additionally, the sample tool can be used with the --lines-per-offset option to sample multiples of lines from a text file. This can be useful for sampling from FASTA or FASTQ files, each with records that are formatted in two- or four-line groupings.

One can use the --rng-seed option to sample the same lines from a particular file. This can be useful for testing sample distributions, or for sampling paired-end reads in conjunction with --lines-per-offset.

By adding the --preserve-order option, the output sample preserves the input order. For example, when sampling from an input BED file that has been sorted by BEDOPS sort-bed — which applies a lexicographical sort on chromosome names and a numerical sort on start and stop coordinates — the sample will also have the same ordering applied, with a relatively small O(k logk) penalty for a sample of size k.

By default, sample performs sampling without replacement — a sampled element will not be resampled. Using --sample-with-replacement changes this behavior accordingly.

By omitting the sample size parameter, the sample tool can shuffle the entire file. This tool can be used to shuffle files that shuf has memory issues with; however, sample currently operates slower than shuf on shuffling whole files, when shuf can be used. We recommend use of shuf when shuffling an entire file, where possible, or specifying the --sample-size as the line count with sample, if known ahead of time (e.g., with wc -l or similar).

One downside at this time is that sample does not process a standard input stream; the input to sample must be a regular file. In contrast, the shuf tool can process a standard input stream.

sample's People

Contributors

alexpreynolds avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

sample's Issues

Info - zipped files

Hi Alex,
Does sample work with zipped files? Large files are usually compressed.
The command is producing a binary file for me and if I try to <(zcat file.fq.zg) it gives me an error.
What command should I use in this case.
Thank you in advance for the reply

For sample loop...

Hi Alex, hope you're fine.

I hope you can give me some light (I am a beginner in the bioinformatics field and I am stuck in an analysis step). I have some fastq files that I need to downsample and I wanted to use sample for it. I was trying to do a for loop in bash. My fastq files have the format:

xxYYY_R1_paired.fastq.gz & xxYYY_R2_paired.fastq.gz (forward & revers paired-ends respectively), where xx is an individual number and YYY the population where the sample came from.

The loop I tried is:
for i in ls *_paired.fastq.gz ; do echo sample -k 10000000 -l 4 -s 1000 $i > sub$i ; done
but the command only gives semi-empty files (57 octets size) which do not appear to be gz or fastq files (FastQC programme does not recognize the format and does not open the files).

Do you know what could be wrong with the command ? Have you any advice to loop sample for many paired-end fastq files ?

Thanks in advance and continue to have a nice week,

Julián OLVERA

Segmentation fault

I get a Segmentation fault (core dumped) when I run sample on a fastq file: sample --sample-size 100 --lines-per-offset 4 input.fastq

This is what I get during compilation:

$ make
mkdir -p objects && cc -Wall -Wextra -pedantic -std=c99 -D__STDC_CONSTANT_MACROS -D_FILE_OFFSET_BITS=64 -D_LARGEFILE64_SOURCE=1 -O3 -c src/sample-library/mt19937.c -o objects/mt19937.o -iquote/****/sample/include
ar rcs ****sample/sample-library.a objects/mt19937.o
cc -Wall -Wextra -pedantic -std=c99 -D__STDC_CONSTANT_MACROS -D_FILE_OFFSET_BITS=64 -D_LARGEFILE64_SOURCE=1 -O3 -c src/bin/sample.c -o objects/sample.o -iquote/****/sample/include
src/bin/sample.c: In function ‘print_sorted_offset_reservoir_sample_via_cstdio’:
src/bin/sample.c:602:18: warning: ignoring return value of ‘fgets’, declared with attribute warn_unused_result [-Wunused-result]
             fgets(temp_line, LINE_LENGTH_VALUE + 1, in_file_ptr);
                  ^
src/bin/sample.c: In function ‘print_unsorted_offset_reservoir_sample_via_cstdio’:
src/bin/sample.c:642:18: warning: ignoring return value of ‘fgets’, declared with attribute warn_unused_result [-Wunused-result]
             fgets(temp_line, LINE_LENGTH_VALUE + 1, in_file_ptr);
                  ^
cc -Wall -Wextra -pedantic -std=c99 -D__STDC_CONSTANT_MACROS -D_FILE_OFFSET_BITS=64 -D_LARGEFILE64_SOURCE=1 -O3 objects/sample.o -o sample /****/sample/sample-library.a

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.