Giter Site home page Giter Site logo

dib-mmetsp's Introduction

dib-MMETSP

Output files available for download:

Transcriptome assemblies (fasta): DOI

Annotations (gff): DOI

Table of one annotation name (best = sorted by e-value < 1e-05) by transcript ID (.csv): DOI

Peptide translations (fasta): DOI

Expression quantification (salmon output): DOI

All files combined: DOI

Pipeline scripts: DOI

Citation:

Johnson, Lisa K., Alexander, Harriet, & Brown, C. Titus. (2018). MMETSP re-assemblies [Data set]. Zenodo. https://doi.org/10.5281/zenodo.740440

MMETSP pipeline

This respository contains the pipeline code used to generate re-assemblies of the Marine Microbial Eukaryote Transcriptome Sequencing Project (MMETSP). Originally: https://github.com/ljcohen/MMETSP

This pipeline was constructed to automate the eel pond khmer protocols over a large-scale RNAseq data set. The data set used is from the Marine Microbial Eukaryotic Transcriptome Sequencing Project (MMETSP), which contains 678 cultured samples of 306 pelagic and endosymbiotic marine eukaryotic species representing more than 40 phyla (Keeling et al. 2014).

Input file is SraRunInfo.csv, a metadata spreadsheet downloaded from NCBI-SRA that contains the url and sample ID information. Scripts were designed for the high performance computing cluster at Michigan State University, iCER, and will be launched in parallel through the portable batch system (PBS) scheduler. Scripts will use the SraRunInfo.csv metadata spreadsheet to download and extract data, run qc, trim, diginorm, then assemble using Trinity. If you are interested in using these scripts, please be aware that modifications will be required specific to the system you are using.

The main pipeline scripts in this repository:

  • getdata.py, download data from NCBI and organize into individual directories for each sample/accession ID
  • trim_qc.py, trim reads for quality, interleave reads
  • diginorm_mmetsp.py, normalize-by-median and filter-abund from khmer, rename, combined orphans
  • assembly.py, runs Trinity de novo transcriptome assembly software

Annotation and expression counts (run separately):

Additional scripts (run separately):

Usage:

  1. Clone this repo
git clone https://github.com/dib-lab/dib-MMETSP.git
  1. edit dibMMETSP_configuration.py with absolute path names specific to your system. The file SraRunInfo.csv was obtained from NCBI for NCBI Bioproject accession: PRJNA231566. This set of code could be used with SraRunInfo.csv input from any collection of SRA records from NCBI or ENA.

  2. Run the main python function

python main.py

References

Keeling et al. 2014: http://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.1001889

Supporting information with methods description: http://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.1001889#s6

Preliminary assembly protocol run by NCGR: https://github.com/ncgr/rbpa

MMETSP website: http://marinemicroeukaryotes.org/

iMicrobe project with data and combined assembly downloads: ftp://ftp.imicrobe.us/projects/104/

Blog posts: https://monsterbashseq.wordpress.com/2016/09/13/mmetsp-re-assemblies/

http://ivory.idyll.org/blog/2016-mmetsp-a-first-look.html

dib-mmetsp's People

Contributors

halexand avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

dib-mmetsp's Issues

Is there a way to extract the assembly sequence based on your annotation file?

Dear ljcohen,

I'm impressed by this work! So happy that I found it.
I'm mostly focus on the polyketide synthase of dinoflagellate, and I had also found ketosynthase gene in your annotation files.
However, the row name in the annotation is different from the header id inside each assembly file. Is there any way to find the connection between the row name and header id in the fasta file?
I will appreciate to hear your reply!

Best

Shaolin

Hard coded Paths

bash-4.1$ python main.py
Directory created: /work/databases/bio/mmetsp_new/mmetsp/
Traceback (most recent call last):
File "main.py", line 2, in
import getdata as data
File "/work/databases/bio/mmetsp_new/dib-MMETSP/getdata.py", line 149, in
for datafile in datafiles:
NameError: name 'datafiles' is not defined

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.