Giter Site home page Giter Site logo

matthew-mosior / sigma-to-mosaic Goto Github PK

View Code? Open in Web Editor NEW
0.0 1.0 0.0 122 KB

Sigma to Mosaic File Format Converter.

License: MIT License

Shell 28.20% Haskell 71.80%
shell-script metagenomics format-converter mosaic-community-challenge haskell

sigma-to-mosaic's Introduction

Sigma-to-Mosaic: File Format Converter from sigma_out.gvector.txt to mosaic.txt

Introduction

File format converters are essential for a host of data conversion needs, from low level instructions to OS-level file format conversions. This format converter, which is implemented in two different languages (Shell and Haskell), is used to transform the output of SigmaW (Strain-level Inference of Genomes from Metagenomic Analysis), sigma_out.gvector.txt, into the format accepted for the Mosaic Community Challenge: Strains (MOSAIC Community Challenge: Strains).

Shell Implementation

Setting up the Reference Genome Directory

A prerequisite to getting useful output from this shell script is to setup your reference genome directory correctly.
First, your reference genome directory should have the following structure:

[database directory] - [genome directory] - [fasta file]

To create this required reference genome directory, use the shell script GCFrefgenomedirectory.sh. This shell script will correctly set-up your reference genome directory, assuming you have downloaded GCF (RefSeq assembly) sequences.

This shell script will change your initial reference genome directory setup of [database directory] - [fasta file] to the required reference genome directory setup: [database directory] - [genome directory] - [fasta file].

Provide the path to the directory that contains the initial [database directory] as a command line argument as shown in the following example:

sh GCFrefgenomedirectory.sh /usr/home/ncbi/ncbi-genomes-2018-02-17

Usage

This script is very easy to use, it takes sigma_out.gvector.txt as command line arguments. Keep in mind, the Strains community challenge has four datasets:

Simulated_Low_Complexity
Simulated_Medium_Complexity
Simulated_High_Complexity
RealData (Mouse fecal samples)

Each of these datasets contains four sets of paired-end sequencing reads, so in reality:

Simulated_Low_Complexity
sim_low_S1_PE1.fq
sim_low_S1_PE2.fq
sim_low_S2_PE1.fq
sim_low_S2_PE2.fq
sim_low_S3_PE1.fq
sim_low_S3_PE2.fq
sim_low_S4_PE1.fq
sim_low_S4_PE2.fq

Simulated_Medium_Complexity
...

Simulated_High_Complexity
...

RealData (Mouse fecal samples)
...

Since each set of paired-end sequencing reads (i.e. sim_low_S1_PE1.fq and sim_low_S1_PE2.fq) are run together to output a single sigma_out.gvector.txt file, you should be running the script for each dataset as follows:

sh SigmatoMosaic.sh sigma1_out.gvector.txt sigma2_out.gvector.txt sigma3_out.gvector.txt sigma4_out.gvector.txt
*Since all output files are named sigma_out.gvector.txt, you'll need to rename them so that they are all unique, as shown above.

If you have sigma_out.gvector.txt files with many identified organisms (lines that start with "*"), it may be wise to do the following:

nohup sh SigmatoMosaic.sh sigma1_out.gvector.txt sigma2_out.gvector.txt sigma3_out.gvector.txt sigma4_out.gvector.txt &

This will run the script in the background after you logout (nohup) and puts the process into a subshell (&), which allows you to continue to work in the current terminal session, and will keep it running once you logout.

Running SigmatoMosaic.sh will output a single file, mosaic.txt.

Please see example files sigma_out.gvector.txt and mosaic.txt (examples of input and output).

Update to Roadmap (05/31/2018)

SigmatoMosaic.sh now has the following features:
-Incorrect file format detection.
-Placeholder zeros when organism wasn't identified (per file), so relative abundances will be mapped to specific sigma_out.gvector.txt input files.

Haskell Implementation

Setting up the Reference Genome Directory

A prerequisite to getting useful output from this haskell script is to setup your reference genome directory correctly.
First, your reference genome directory should have the following structure:

[database directory] - [genome directory] - [fasta file]

To create this required reference genome directory, use the shell script GCFrefgenomedirectory.sh. This shell script will correctly set-up your reference genome directory, assuming you have downloaded GCF (RefSeq assembly) sequences.

This shell script will change your initial reference genome directory setup of [database directory] - [fasta file] to the required reference genome directory setup: [database directory] - [genome directory] - [fasta file].

Provide the path to the directory that contains the initial [database directory] as a command line argument as shown in the following example:

sh GCFrefgenomedirectory.sh /usr/home/ncbi/ncbi-genomes-2018-02-17

Usage

This script is very easy to use, it takes sigma_out.gvector.txt as command line arguments. Keep in mind, the Strains community challenge has four datasets:

Simulated_Low_Complexity
Simulated_Medium_Complexity
Simulated_High_Complexity
RealData (Mouse fecal samples)

Each of these datasets contains four sets of paired-end sequencing reads, so in reality:

Simulated_Low_Complexity
sim_low_S1_PE1.fq
sim_low_S1_PE2.fq
sim_low_S2_PE1.fq
sim_low_S2_PE2.fq
sim_low_S3_PE1.fq
sim_low_S3_PE2.fq
sim_low_S4_PE1.fq
sim_low_S4_PE2.fq

Simulated_Medium_Complexity
...

Simulated_High_Complexity
...

RealData (Mouse fecal samples)
...

Since each set of paired-end sequencing reads (i.e. sim_low_S1_PE1.fq and sim_low_S1_PE2.fq) are run together to output a single sigma_out.gvector.txt file, you should be running the script for each dataset as follows:

runghc SigmatoMosaic.hs sigma1_out.gvector.txt sigma2_out.gvector.txt sigma3_out.gvector.txt sigma4_out.gvector.txt
*Since all output files are named sigma_out.gvector.txt, you'll need to rename them so that they are all unique, as shown above.

If you have sigma_out.gvector.txt files with many identified organisms (lines that start with "*"), it may be wise to do the following:

nohup runghc SigmatoMosaic.hs sigma1_out.gvector.txt sigma2_out.gvector.txt sigma3_out.gvector.txt sigma4_out.gvector.txt &

This will run the script in the background after you logout (nohup) and puts the process into a subshell (&), which allows you to continue to work in the current terminal session, and will keep it running once you logout.

Running SigmatoMosaic.hs will output a single file, mosaic.txt.

Please see example files sigma_out.gvector.txt and mosaic.txt (examples of input and output).

For maximum performance, please compile the source code.

Credits

Shell implementation and documentation added April 2018.

Haskell implementation and documentation added August 2018.

Author : Matthew Mosior

sigma-to-mosaic's People

Contributors

matthew-mosior avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.