Fast and Accurate MEthylation Aligner for large mammalian genomes. Carries out alignment and methylation calling for CpGs of Whole Genome Bisulfite Sequencing (WGBS) reads in one go without the need of intermediate alignment or buffer files. The algorithm is working on the full alphabet (A,C,G,T), resolving the asymmetric mapping problem* correctly. Reference genomes are expected to be .fasta files and reads in .fastq (or .fastq.gz) format.
The code is written in C++ and parallelized using OpenMP and is licensed under GPL3.
*In WGBS experiments, unmethylated Cytosines are converted to Thymines in Reads, thus it is necessary to allow Read Cytosines to map to Reference Thymines, but not vice versa. This is termed asymmetric mapping problem and is commonly tackled by working only on the reduced alphabet (A,T,G), which results in false matchings of reference Ts to read Cs.
This tool is still under development. We will update the news section once we published the algorithm and a stable version of the code is available.
The following steps explain how to install and setup FAME .
FAME is dependent on the following libraries, which are all shipped with this repository. So no additional work is necessary here.
- ntHash - a fast library for genomic rolling hash functions.
- sparsehash - an efficient hash map implementation from Google.
- hopscotch-map - an efficient hash map implementation by Tessil.
- gzstream - a C++ stream interface for working with gzip files.
To compile the program, a recent version of the GNU Compiler Collection (gcc) or LLVM Clang (clang) is required. The gzstream library is dependent on zlib, which is usually installed on Linux systems.
To retrieve the code base, just clone this repository:
git clone https://github.com/FischerJo/Metal
To install FAME on your machine, just use the shipped Makefile by typing
make
in the top level directory of the cloned repository.
FAME, as most other tools, uses a sohpisticated index structure of the reference genome to accelerate alignment computation. The tool has two separate functions, the construction of the index and the alignment of the reads. After construction of the index, it is stored in an efficient binary format and can thus be reused for different experiments.
To construct an index with the parameters specified in CONST.h
(see 2.A):
./FAME --genome /Path/To/Genome/genome.fasta --store_index /Path/To/produced_index
where genome.fasta
is a genome stored in fasta format and produced_index is the name of the output file holding the index after execution. For a full set of options see 2.A and 2.B.
To align reads to an index call:
./FAME -r /Path/To/reads.fastq --load_index /Path/To/produced_index -o /Path/To/output
with /Path/To/output
being the output file path for the CpG report.
For a full set of options we refer to section 2.B.
All parameters are set in the file CONST.h
.
If you change any of those parameters, make sure to rebuild the program:
make clean
make
An index is dependet on the parameters, that is, if you call the program with an index that was built with different parameters, it will throw an error.
Here is a list of the external parameters:
Parameter | Definition | Recommended value | Location (line number) |
---|---|---|---|
READLEN | (Maximum) Length of reads queried to the index | 100 | 35 |
CORENUM | Number of threads spawned by the program. Should be number of free cores on the system. | 16 | 38 |
MINPDIST | Minimum distance between a read pair in paired end mode. Measured from end to first read to beginning of second read. | 20 | 41 |
MAXPDIST | Maximum distance between a read pair in paired end mode. Measured from end to first read to beginning of second read. | 400 | 42 |
CHROMNUM | Number of chromosomes of reference organism. | 24 | 45 |
Here is a list of some important internal parameters, we strongly recommend NOT to change them:
Parameter | Definition | Recommended value | Location (line number) |
---|---|---|---|
SEED | The gapped q-gram (aka seed) to use as array of bits | [see below] | 57 |
SEEDBITS | The gapped q-gram as bitstring | 0b11011110111111011111111111111101 | 58 |
CHUNKSIZE | Number of reads (or read pairs) read to buffer. | 300000 | 69 |
KMERLEN | k, the length of a k-mer for the index. This is a very sensitive parameter. Must be the length of the seed. | 32 | 73 |
QTHRESH | Minimum number of k-mer matches required for match verification | 5 | 80 |
WINLEN | Window length for the index data structure. | 2048 | 90 |
MISCOUNT | Number of errors considered for k-mer filters. | 2 | 94 |
ADDMIS | Number of errors additionally (to MISCOUNT) allowed in alignment | 4 | 97 |
KMERCUTOFF | Controls hash collisions in index. Low value means more lossy but faster filter. Not considered if --no_loss flag is set during index construction. |
1500 | 100 |
KMERDIST | Controls pruning after matching a read to a Window. Minimum distance of count of window to prune and count of matched window. | 10 | 104 |
SKIPMOD | Hash only every SKIPMODth k-mer of reference. | 2 | 107 |
The following is a description of all arguments accepted by FAME. Examples on how to use FAME in the command line are given in 2.D.
Flag | Argument | Description |
---|---|---|
--genome | Filepath | Forces the tool to build an index for the specified .fasta reference file |
--gzip_reads | None | Treats the read files passed to -r or -r1 and -r2 as gzipped files. |
-h | None | Lists all available options with a description. |
--help | None | see -h |
--human_opt | None | Uses optimizations for human reference genomes to prune away unlocalized contigs etc |
--load_index | Filepath | Loads the index constructed before. NOTE: Parameters in CONST.h must be the same. |
--no_loss | None | Builds the index with lossless filter (NOT RECOMMENDED). |
-o | Filepath | Base name for output file. Contains CpG methylation levels after processing. |
--out_basename | Filepath | see -o |
-r | Filepath | Forces the tool to query the specified single end read .fastq file to a loaded index. |
-r1 | Filepath | Path to file with first reads of a paired read set. Read format must be .fastq. |
-r2 | Filepath | Path to file with second reads of a paired read set. Read format must be .fastq. |
--store_index | Filepath | Writes output of index construction to filepath (~38GB for human genome). NOTE: Directory must exist. |
--unord_reads | None | Disable optimization to find stranding of reads. |
The alignment of a read set to an index produces an output file (see -o
command line argument) that contains
methylation levels of all CpGs in the reference genome.
The file is a tab-separated value file (.tsv
) with the following structure:
Chromosome Position #Meth_Cs_fwd #Unmeth_Cs_fwd #Meth_Cs_rev #Unmeth_Cs_rev
where #Meth_Cs_fwd means number of methylated Cytosines on forward strand CpG and so on. The position is a (zero based) count of the bases of a chromosome, indicating the position of the C of a CpG. The forward strand is the strand provided in the reference file, the reverse complement strand the strand not provided in the reference file.
Suppose you want to match paired reads to a human genome and save it to the file with prefix pairedResults. The reads are in gzip format.
First build the index:
./FAME --genome /Path/To/Genome/genome.fasta --store_index /Path/To/produced_index --human_opt
Align the reads:
./FAME --load_index /Path/To/produced_index -r1 /Path/To/r1.fastq.gz -r2 /Path/To/r2.fastq.gz --gzip_reads -o pairedResults
A file called pairedResults_cpg.tsv is generated containing the CpGs with corresponding methylation counts.
We appreciate any feedback to our tool. Feel free to contact us by mailing to fischer 'at' mpi-inf.mpg.de.
If you found a bug, please contact us with a description of how you called the tool and which data you used.
- Jonas Fischer - Main Contributor - Max-Planck-Institute for Informatics, Saarbruecken, Germany
- Marcel H. Schulz Max-Planck-Institute for Informatics, Saarbruecken, Germany
This project is licensed under the GPL3 - see the LICENSE_GPL_3_0 file for details.
- Our paper: COMING SOON
- ntHash: ntHash: recursive nucleotide hashing. Mohamadi H, Chu J, Vandervalk BP, Birol I. Bioinformatics 2016