Giter Site home page Giter Site logo

ensembl-genes's Introduction

Extract the Ensembl gene catalog to simple tables

This repository extracts a catalog of genes from the Ensembl database for multiple species including human, rat, and mouse. It is ideal for situations where you want to represent genes via stable Ensembl identifiers. Data is extracted by a series of SQL queries as well as additional transformations in Python using Pandas. Tables are exported to output branches based on the corresponding Ensembl version in TSV and Parquet format.

Motivation

NCBI publishes the Homo_sapiens.gene_info.gz dataset of human genes with one row per gene. It includes useful metadata like the gene symbol, synonyms, and chromosome. However, we weren't able to find a comparable dataset for Ensembl genes (please let us know if this exists). Therefore, we combined several SQL queries guided by Biostars answers — for example to retrieve symbols, alternative sequence allele groups, and chromosomes — and from Open Targets pipelines to extract simplified tabular datasets.

Note that the Ensembl core schema consists of many tables. There is a chance we have made mistakes and will appreciate any feedback or contributions. Please use GitHub Issues for contact.

Usage

Ensembl stores gene information in databases where each database corresponds to specific combination of species, release, and genome assembly. Each supported core database receives a corresponding output branch in this repository. For example, see the output/homo_sapiens_core_104_38 branch for datasets generated from Ensembl release 104 of the human genome using the GRCh38 assembly.

If you'd like to download all files for a specific gene catalog, you can use a command like the following (replacing homo_sapiens_core_104_38 with the desired database, see all current databases here):

# clone the relevant output branch to a local directory
git clone --branch=output/homo_sapiens_core_104_38 --depth=1 https://github.com/related-sciences/ensembl-genes.git
# optionally uninitialize git from the data directory
cd ensembl-genes && rm -rf .git

Maintainers can create exports for new Ensembl releases running the export workflow (which is a workflow_dispatch GitHub Action). In addition, CI checks for a new Ensembl release every week, as reported by Bioversions, and runs an export if none already exists for that each species-specific database.

Development

# Install the environment
poetry install --no-root

# Update the lock file
poetry update

# Export datasets to output (change 104 to desired release)
poetry run ensembl_genes datasets --species=human --release=104

# Export notebooks to output (change 104 to desired release)
poetry run ensembl_genes notebooks --species=human --release=104

# Run tests
pytest

# Set up the git pre-commit hooks.
# `git commit` will now trigger automatic checks including linting.
pre-commit install

# Run all pre-commit checks (CI will also run this).
pre-commit run --all

License

This repository is released under an Apache License 2.0 License (see LICENSE.md). Furthermore, output datasets are also released under CC0 1.0 Universal (CC0 1.0) Public Domain Dedication.

Please familiarize yourself with the Ensembl data disclaimer:

Ensembl imposes no restrictions on access to, or use of, the data provided and the software used to analyse and present it. Ensembl data generated by members of the project are available without restriction. …

Some of the data and software included in the distribution may be subject to third-party constraints. Users of the data and software are solely responsible for establishing the nature of and complying with any such restrictions.

The European Molecular Biology Laboratory's European Bioinformatics Institute (EMBL-EBI) provides this data and software in good faith, but make no warranty, express or implied, nor assume any legal liability or responsibility for any purpose for which they are used.

Readings

Here is a list of relevant works that help explain aspects of Ensembl genes:

  1. Accessing alternate sequences in human
    Bronwen Aken
    Ensembl Blog (2011-05-20)

  2. Patches and Haplotypes in the Human Genome
    Ensembl Training
    YouTube (2012-01-21)

  3. Ensembl insights: Annotating readthrough transcription in Ensembl
    Erin Haskell
    Ensembl Blog (2019-02-11)

  4. The Ensembl gene annotation system
    Bronwen L Aken, Sarah Ayling, Daniel Barrell, Laura Clarke, Valery Curwen, Susan Fairley, Julio Fernandez Banet, Konstantinos Billis, Carlos García Girón, Thibaut Hourlier, … Stephen MJ Searle
    Database (2016-06-23)
    DOI: 10.1093/database/baw093 · PMID: 27337980 · PMCID: PMC4919035

ensembl-genes's People

Contributors

acastanza avatar dhimmel avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.