Giter Site home page Giter Site logo

biocyberman / variantdb_challenge Goto Github PK

View Code? Open in Web Editor NEW

This project forked from steven-n-hart/variantdb_challenge

0.0 2.0 0.0 187 KB

Finding a scalable alternative to the VCF File for genomics analysis

Home Page: http://steven-n-hart.github.io/VariantDB_Challenge

License: MIT License

Perl 90.74% JavaScript 5.99% Python 2.28% Shell 0.98%

variantdb_challenge's Introduction

VariantDB_Challenge

Follow me on twitter (@Variant_Chall) to recieve status updates on the project

This repo is designed to serve as a testing framework for different database schemas, architectures, and query formulations within the context of mining genomics data. For some reason, we in the genomics community (with a few outliers) are stuck in a file-based modus operandi. Most people store variants in either a VCF or gVCF. However there are numerous reasons why this is a bad long-term solution.

  1. VCF files need to be filtered in many different ways depending on the study question
  • For example, if I wanted to exclude variants that change the protein sequence, I would write a filter program on the command-line and then write the subsetted VCF into yet another file. This is a huge burden on data storage, as well as tracking data provenance.
  1. It's just not a scalable strategy
  • I have a hard enough time keeping track of where all my files are from this year, let alone last year. Instead many of us have to keep separate files that track where all of our other files are located.
  1. VCF files do not store metadata
  • VCF files are OK for storing genetic variants, but the lack information about the samples they represent. For example, simple metadata such as which samples are cases or controls, how old was the patient, etc. is not available in the current specification.

It's pretty clear, we need to move to a database if we want to stay on top of the problem of data growth that is to be expected. However, given the scale and nature of genomics data, this isn't a simple solution. Those that aren't in this field may ask,

Why not just throw all your data into a MySQL database? Won't that work?

Technically, yes but pragmatically no. The key limitation is that the RDBMS approach scales up rather than scaling out for this particular use case. Genomics is somewhat heterogeneous, constantly evolving and growing, and needs to leverage distributed computing and storage. Finding a solution that takes advantage of these key concepts is what this project is all about!

The future of genomics depends on our ability to leverage distributed storage and computing, so what database architecture is best? Let's find out by addressing the following challenges:

The purpose of this challenge is not an ideological one, rather an empirical one. We are looking for submissions from experts of various architectures to submit solutions to address these challenges. If you think you have a better solution an those that currently exist, please follow the instructions to submit your own response.

What's in it for you?

The plan is to submit a peer reviewed manuscript showing the results of these challenges. Those who participate will be listed as co-authors. The winner of the challenge will be the first author. Companies are more than welcome to participate.

If you are a fan of open science, Hadoop, Spark, and other Big Data projects, then this challenge is for you!

See the Wiki and Rules pages for more details.

Current Progress

Technology Submitter Schema Import Query Cost Projection
SQLite ?
ArrangoDB StevenNHart
MongoDB StevenNHart
MySQL raymond301

variantdb_challenge's People

Contributors

steven-n-hart avatar geertvandeweyer avatar raymond301 avatar neunhoef avatar

Watchers

Vang Le avatar James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.