Giter Site home page Giter Site logo

determination-of-all-control-datasets-in-large-public-repositories's Introduction

Putative GEO DataSet prediction

Murat Ozturk, Shuhei Sugai, Sergiusz Wesolowski, Luiz Irber


The original task?

Determination of all 'control' samples in public repositories of genomic data.


How we abandoned SRA and moved to GEO

SRA Issues:

  1. No clear SRA metadata format guidelines.
  2. Notoriously non-standard metadata as a result.
  3. Control-informative fields are especially sparse (treatment, disease, tumor, affection_status)

Switching to a new task on GEO

Automate process of aggregating samples into comparable DataSets.

A DataSet represents a curated collection of biologically and statistically comparable GEO Samples and forms the basis of GEO's suite of data display and analysis tools.


Motivation

DataSet curation is manual and backlogged.

NGS experiments should be comparable

  • Platforms are machine specific, results should not be machine specific
  • We group by Series_type, which is a categorical variable
    • Focus on type "expression profiling by high throughput sequencing"

Data source

  • Apis mellifera"[porgn:__txid7460]"
  • Expression profiling by high throughput sequencing
  • 735 Samples (70 Series)

Automatic curation: Samples -> DataSets

  • Same Platform
  • Same organism
  • Same experiment type
  • Same calibration (hopefully)

Practical problems

Even though the search specified full species name, the samples that we have found span 9 different tax-ids

Dead End (again):

  • none of the Apis mellifera samples had structured expression data.
  • It turns out, GEO does not have a standard for submitting high throughput sequencing data.
  • Raw data is deposited to SRA and any processed data is included in 'supplementary data' in free format.

We seem to have indepently discovered what is in the GEO FAQ:

Processed sequence data files: GEO hosts processed sequence data files, which are linked at the bottom of Sample and/or Series records as supplementary files. Requirements for processed data files are not yet fully standardized and will depend on the nature of the study, but data typically include genome tracks or expression counts.


Not many genes in common

 ('CM000063FS008065472', 3),
 ('CH877218FS000002636', 3),
 ('CM000059FS001816901', 3),
 ('CM000063FS006342682', 3),

Conclusion

Our struggles indicate that the problem exists and current data repositories have strucutre far from perfect. To enable the full potential of such resources there must be a reliable way of searching and grouping the samples based on the metadata.

  1. Enforce more strict rules on sample submission format with reliable metadata
  2. Build a metadata search engine capable of identifying potential mislabling and inconsistencies in the submissions
  3. Finally build a sample assembler that will allow the used to find "closest" match control samples for his/her experiment.

determination-of-all-control-datasets-in-large-public-repositories's People

Contributors

luizirber avatar

Stargazers

allisonvmitch avatar

Watchers

 avatar James Cloos avatar Ben Busby avatar Lisa Federer avatar Allissa Dillman avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.