Giter Site home page Giter Site logo

bionode-example-dat-gasket's Introduction

Bionode Example with Dat and Gasket

This is a basic example of using Dat, Gasket and Bionode to do reproducible bioinformatics.

Install

On your command line, run the following:

git clone [email protected]:bionode/bionode-example-dat-gasket.git
cd bionode-example-dat-gasket
npm install

Run

There are 4 pipelines defined in package.json that you can run in order.

Create a new empty dat repo

npm run init

Get all Eukaryota genomes metadata from NCBI into Dat

npm run fetch

What is happening?

The previous command will run a pipeline named fetch-eukaryota-genomes-metadata that is stored inside the package.json file of this git repository.

The following is a description of that pipeline:

// Initialize a Dat repository (stored locally in the .dat folder)
"dat init --no-prompt",
// Query the NCBI API for the metadata and returns it in Newline Delimited JSON format
"bionode ncbi search genome eukaryota",
// Store the data in the dat repository
"dat import --json"

You can look at the data that got stored by doing dat listen (then go to localhost:6461 in your browser) or dat cat | head while in the same folder where the .dat folder is located.

Search NCBI for our raw data

npm run search

What is happening?

Like the previous gasket command, this runs another pipeline inside the package.json file.

Here's the description of that pipeline:

// Output all the data in the .dat repository
"dat cat",
// Collect only the JSON object that matches Guillardia
"grep Guillardia",
// Extract the value of the assemblyid property
"tool-stream extractProperty assemblyid",
// Download that genome assembly, it is the reference genome
// http://en.wikipedia.org/wiki/Genome_project#Genome_assembly
// http://en.wikipedia.org/wiki/Reference_genome
"bionode ncbi download assembly",
// Wait for the download to complete
"tool-stream collectMatch status completed ",
// Get the NCBI unique ID for this assembly again
"tool-stream extractProperty uid",
// Query the NCBI API to get the UID of the project related to this assembly
"bionode ncbi link assembly bioproject ",
// Extract the UID of the project
"tool-stream extractProperty destUID ",
// Query NCBI to get the UIDs of the related Sequence Read Archive (SRA) files
// http://www.ncbi.nlm.nih.gov/books/NBK47539/
"bionode ncbi link bioproject sra ",
// Extract the SRAs UIDs
"tool-stream extractProperty destUID"
// **Cheating** To have this example run in minutes and not hours/day,
// we will only download the smallest dataset by using its unique ID that
// we figured out in advance. In a normal situation, the following command
// would not be here.
"grep 35526"

Align the Guillardia theta genomic sequences

// Download the SRA file
"bionode ncbi download sra",
// Extract a FASTQ file containing the DNA sequences from the SRA file
"bionode sra fastq-dump",
// Get the path of the FASTQ file
"tool-stream extractProperty destFile",
// Map the DNA sequences to the reference genome
// http://en.wikipedia.org/wiki/Sequence_alignment
// http://en.wikipedia.org/wiki/List_of_sequence_alignment_software#Short-Read_Sequence_Alignment
"bionode bwa mem **/*fna.gz",
// Wait for the previous step to finish and produce a Sequence Alignment/Map (SAM) file
"tool-stream collectMatch status finished",
// Get the path of the SAM file
"tool-stream extractProperty sam",
// Convert the SAM file to a binary format (BAM)
"bionode sam"

Once you have a BAM file, you can view which [scaffold](http://en.wikipedia.org/wiki/Contig#Sequence_contigs) of the reference genomes has more [reads](http://www.k.u-tokyo.ac.jp/pros-e/person/shinichi_morishita/genome-assembly.jpg) mapped to it
```bash
samtools idxstats 35526/SRR070675.bam | sort -k3 -n

Then, you can look for variants (SNPs) in that scaffold (JH993052.1)

gunzip 503988/GCA_000315625.1_Guith1_genomic.fna.gz
samtools mpileup -uf 503988/GCA_000315625.1_Guith1_genomic.fna 35526/SRR070675.bam | bcftools view -v snps - | grep JH993052.1

There are several variants, we can start to have a look at the last one by using the Text Alignment Viewer (tview)

samtools tview -p JH993052.1:31185 35526/SRR070675.bam 503988/GCA_000315625.1_Guith1_genomic.fna
# Press ? for help

bionode-example-dat-gasket's People

Contributors

bmpvieira avatar max-mapper avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.