Software to generate randomised content into files from letter sets and then generate FASTA read files from it. Originally created in order to test deBruijn graph construction from FASTA reads and have the ability to compare the results to original genome data.
The program can be used in the following cases:
- Create a synthetic genome.
- Create a simulated sequencer reads file (FASTA) from an existing genome file.
- Create both the synthetic genome file and its simulated sequencer reads file in one go.
The software is in the alpha stage of development.
You will need both cmake
and make
tools.
- Clone the repo from the folder you want to genomeMaker directory to reside.
- From inside the
genomeMaker/
directory runcmake CMakerList.txt
. - Then run
make
.
The genomeMaker
application should be in the genomeMaker/build/
directory.
Help with flag descriptions and examples:
./genomeMaker
Standard usage:
./genomeMaker -<option> <argument>
-g -genome Name of the genome file to create.
-s -size Size of the genome in bytes.
-t -type Type of letter set for genome creation (DNA, RNA). [DEFAULT='DNA']
To create a synthetic genome file of 100,000,000 bytes (100MB) with the RNA letter set:
./genomeMaker -g genome_file -s 100000000 -t rna
-f -fasta Name of the FASTA file to create.
-l -length Character length of each reads. [DEFAULT='260']
-d -depth Depth of reads.
-e -error Error rate of the simulated sequencer (0 <= x <= 1). [DEFAULT='0']
The error rate is based on the number of expected reads on a genome. i.e.: if the error rate is set to 0.01 (1%) and the expected reads number is 75,000 then there would be an error injected in 750 reads taken approximately.
Currently the error injection is very basic: when a read is tagged for being injected with an error a position within the read is randomly chosen to be replaced with a different character based on another random pick within the same read.
So let's say that we have a read flagged for an error { AACCTT }:
- a random position for the error is chosen (2),
- a random position for a replacement is chosen (4) and will be accepted if it is a different character ('C' != 'T'). Otherwise it tries again. The resulting read writen to the FASTA file is { AATCTT }.
To create a sequencer file named "my_reads.fasta" with the default read length of 260, error rate of 0.01, depth of 200 and based on a genome file called "genome.genome":
./genomeMaker -g my_reads -f reads -d 200 -e 0.01
-p -pipeline Create both genome and sequencer files.
To create a complete set of files composed of:
- a genome file called "my_genome.genome"
- size of 100,000 bytes
- a sequencer file "my_genome.fasta"
- read length of 10 characters and
- depth of 5
./genomeMaker -p my_genome -s 100000 -l 10 -d 5
GenomeMaker comes with a logger but defaults to output to both the screen and the file.
To avoid that just make sure that the output specified in the log_config.cfg
file only contains:
OUTPUT=<log,FILE_OVERWRITE,TERMINAL,MSG>
If you run into issues with the software you will need to change that last argument
from MSG
to TRACE
in order to produce a comprehensive log during your next execution of genomeMaker.
The software is bundled with the components required from the EADlib library.
- Need a version of GCC with C++14 support (made with GCC 6.2.1)
- CMake 3.5
- [Optional] Google Tests libraries (gTest & gMock) to run the unit tests
- See above.
- Homebrew might help. Untested.
- Nope. Good luck.
This software is released under the GNU General Public License 2 license.
Please reference when used in project and/or research.