Giter Site home page Giter Site logo

genomemaker's Introduction

Table of Contents

  1. Description
  2. Compiling
  3. How to use
    1. Creating a genome file
    2. Creating a set of FASTA reads
    3. Creating a genome and its reads in one go
  4. Logger
  5. Platforms Supported
  6. License

Description

Software to generate randomised content into files from letter sets and then generate FASTA read files from it. Originally created in order to test deBruijn graph construction from FASTA reads and have the ability to compare the results to original genome data.

The program can be used in the following cases:

  1. Create a synthetic genome.
  2. Create a simulated sequencer reads file (FASTA) from an existing genome file.
  3. Create both the synthetic genome file and its simulated sequencer reads file in one go.

Notes

The software is in the alpha stage of development.

Compiling

You will need both cmake and make tools.

  • Clone the repo from the folder you want to genomeMaker directory to reside.
  • From inside the genomeMaker/ directory run cmake CMakerList.txt.
  • Then run make.

The genomeMakerapplication should be in the genomeMaker/build/ directory.

How to use

CLI

Help with flag descriptions and examples:

./genomeMaker

Standard usage:

./genomeMaker -<option> <argument>

Creating a genome file

Flags
  -g	-genome	Name of the genome file to create.
  -s	-size	Size of the genome in bytes.
  -t	-type	Type of letter set for genome creation (DNA, RNA).	[DEFAULT='DNA']
Example

To create a synthetic genome file of 100,000,000 bytes (100MB) with the RNA letter set:

./genomeMaker -g genome_file -s 100000000 -t rna

Creating a set of FASTA reads

Flags
  -f	-fasta	Name of the FASTA file to create.
  -l	-length	Character length of each reads.	[DEFAULT='260']
  -d	-depth	Depth of reads.
  -e	-error	Error rate of the simulated sequencer (0 <= x <= 1).	[DEFAULT='0']

The error rate is based on the number of expected reads on a genome. i.e.: if the error rate is set to 0.01 (1%) and the expected reads number is 75,000 then there would be an error injected in 750 reads taken approximately.

Currently the error injection is very basic: when a read is tagged for being injected with an error a position within the read is randomly chosen to be replaced with a different character based on another random pick within the same read.

So let's say that we have a read flagged for an error { AACCTT }:

  • a random position for the error is chosen (2),
  • a random position for a replacement is chosen (4) and will be accepted if it is a different character ('C' != 'T'). Otherwise it tries again. The resulting read writen to the FASTA file is { AATCTT }.
Example

To create a sequencer file named "my_reads.fasta" with the default read length of 260, error rate of 0.01, depth of 200 and based on a genome file called "genome.genome":

./genomeMaker -g my_reads -f reads -d 200 -e 0.01

Creating a genome and its reads in one go

Flag

  -p	-pipeline	Create both genome and sequencer files.

Example

To create a complete set of files composed of:

  • a genome file called "my_genome.genome"
    • size of 100,000 bytes
  • a sequencer file "my_genome.fasta"
    • read length of 10 characters and
    • depth of 5
./genomeMaker -p my_genome -s 100000 -l 10 -d 5

Logger

GenomeMaker comes with a logger but defaults to output to both the screen and the file.

To avoid that just make sure that the output specified in the log_config.cfg file only contains:

OUTPUT=<log,FILE_OVERWRITE,TERMINAL,MSG>

If you run into issues with the software you will need to change that last argument from MSG to TRACE in order to produce a comprehensive log during your next execution of genomeMaker.

Platforms Supported

The software is bundled with the components required from the EADlib library.

Linux ツ

  • Need a version of GCC with C++14 support (made with GCC 6.2.1)
  • CMake 3.5
  • [Optional] Google Tests libraries (gTest & gMock) to run the unit tests

Mac OSX

  • See above.
  • Homebrew might help. Untested.

Windows

  • Nope. Good luck.

License

This software is released under the GNU General Public License 2 license.

Please reference when used in project and/or research.

genomemaker's People

Contributors

an7ar35 avatar

Stargazers

 avatar

Watchers

James Cloos avatar  avatar

genomemaker's Issues

possible regex bug

not tested for all options but -p does not allow underscores:

Invalid value 'my_genome': File name must be composed of only letter/numbers with no extension.
[20/11/2016, 16:09:36] Cian Murphy: Value 'my_genome' for Option '-p' is not valid.

'genome' works fine though.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.