Giter Site home page Giter Site logo

bed_annotation's Introduction

BED Annotation

Build Status Anaconda-Server Badge

A tool that assigns gene names to regions in a BED file based on Ensembl genomic features overlap.

Requirements

Python 3.6, 3.7, 3.8, 3.9, 3.10.

Installation

pip install bed_annotation

Usage

bed_annotation INPUT.bed -g hg19 -o OUTPUT.bed

The script checks each BED region against the Ensembl genomic features database, and writes a BED file in a standardized format with a gene symbol, strand and exon rank in 4-6th columns:

INPUT.bed:

chr1    69090   70008
chr1    367658  368597

OUTPUT.bed:

chr1    69090   70008   OR4F5   1       +
chr1    367658  368597  OR4F29  1       +

Available genomes (to provide with -g): GRCh37, hg19, hg38.

Transcripts order

The piority for choosing transcripts for annotation is the following:

  • Overlap % with transcript
  • Overlap % with CDS
  • Overlap % with exons
  • Biotype (protein_coding > others > *RNA > *_decay > sense_* > antisense > translated_* > transcribed_*)
  • TSL (1 > NA > others > 2 > 3 > 4 > 5)
  • Presence of a HUGO gene symbol
  • Is cancer canonical
  • Transcript size

Extended annotation

Use --extended option to report extra columns with details on features, biotype, overlapping transcripts and overlap sizes:

bed_annotation INPUT.bed -g hg19 -o OUTPUT.bed --extended

OUTPUT.bed:

## Tx_overlap_%: part of region overlapping with transcripts
## Exon_overlaps_%: part of region overlapping with exons
## CDS_overlaps_%: part of region overlapping with protein coding regions
#Chrom  Start   End     Gene    Exon  Strand  Feature Biotype         Ensembl_ID      TSL HUGO    Tx_overlap_% Exon_overlaps_% CDS_overlaps_% Ori_Fields
chr1    69090   70008   OR4F5   1     +       capture protein_coding  ENST00000335137 NA  OR4F5   100.0        100.0           99.7
chr1    367658  368597  OR4F29  1     +       capture protein_coding  ENST00000426406 NA  OR4F29  100.0        100.0           99.7

Ambuguous annotations

Regions may overlap mltiple genes. The --ambiguities controls how the script resolves such ambiguities

  • --ambiguities all -- report all reliable overlaps (in order in the "priority" section, see above)
  • --ambiguities all_ask -- stop execution and ask user which annotation to pick
  • --ambiguities best_all (default) -- find the best overlap, and if there are several equally good, report all (in terms of the "priority" above)
  • --ambiguities best_ask -- find the best overlap, and if there are several equally good, ask user
  • --ambiguities best_one -- find the best overlap, and if there are several equally good, report any of them

Note that the first 4 options might output multiple lines per region, e.g.:

bed_annotation INPUT.bed -g hg19 -o OUTPUT.bed --extended --ambiguities best_all

OUTPUT.bed:

## Tx_overlap_%: part of region overlapping with transcripts
## Exon_overlaps_%: part of region overlapping with exons
## CDS_overlaps_%: part of region overlapping with protein coding regions
#Chrom  Start   End     Gene    Exon    Strand  Feature Biotype Ensembl_ID      TSL     HUGO    Tx_overlap_%    Exon_overlaps_% CDS_overlaps_%
chr1    69090   70008   OR4F5   1       +       capture protein_coding  ENST00000335137 NA      OR4F5   100.0   100.0   100.0
chr1    367658  368597  OR4F29  1       +       capture protein_coding  ENST00000426406 NA      OR4F29  100.0   100.0   100.0
chr1    367658  368597  OR4F29  1       +       capture protein_coding  ENST00000412321 NA      OR4F29  100.0   100.0   100.0

Other options

  • --coding-only: take only the features of type protein_coding for annotation
  • --high-confidence: annotate with only high confidence regions (TSL is 1 or NA, with HUGO symbol, total overlap size > 50%)
  • --canonical: use only canonical transcripts to annotate (which to the most part means the longest transcript, by SnpEff definition)
  • --short: add only the 4th "Gene" column (outputa 4-col BED file instead of 6-col)
  • --output-features: good for debugging. Under each BED file region, also output Ensemble featues that were used to annotate it

bed_annotation's People

Contributors

almiheenko avatar vladsavelyev avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

bed_annotation's Issues

conda install Python version incompatibility

Hello,

Thanks for the great package; I would love to use it. I'm having some issues during install that I believe are all related to a Python version incompatibility issue between dependancies (ngsutils) and bed_annotation.

TL;DR It appears that bed_annotation and versionpy require Python 3.7 or 3.8, while the dependency ngsutils requires Python 2.7. Does anyone have a workaround for this? Would it be possible to update bed_annotation to use another tool?

  1. On Apple M1, install with the basic conda install -c vladsaveliev bed_annotation syntax produces:
`PackagesNotFoundError: The following packages are not available from current channels:

  - bed_annotation

Current channels:

  - https://conda.anaconda.org/vladsaveliev/osx-64
  - https://conda.anaconda.org/conda-forge/osx-64

To search for alternate channels that may provide the conda package you're
looking for, navigate to

    https://anaconda.org

and use the search bar at the top of the page.`

Solution: This can be solved by specifying conda install -c vladsaveliev/linux-64 bed_annotation.

  1. Running conda install -c vladsaveliev/linux-64 bed_annotation produces:
`UnsatisfiableError: The following specifications were found
to be incompatible with the existing python installation in your environment:

Specifications:

  - bed_annotation -> python[version='>=3.7,<3.8.0a0|>=3.8,<3.9.0a0']

Your python: python=3.10`

Solution: Easily fixed by creating a conda environment with Python version 3.7 or 3.8.

  1. Installing within a conda environment with Python 3.7 or 3.8 using conda install -c vladsaveliev/linux-64 bed_annotation results in the very ambiguous error:
UnsatisfiableError: The following specifications were found to be incompatible with each other:

Output in format: Requested package -> Available versions
  1. My latest attempt was to create a conda environment with Python 3.8 and all of bed_annotation's dependancies installed, with the hope that I could then clone the bed_annotation repository to my computer and put the bed_annotation folder in my Python path. This revealed the dependency incompatibility issue. I found that ngsutils requires Python 2.7:
`UnsatisfiableError: The following specifications were found
to be incompatible with the existing python installation in your environment:

Specifications:

  - ngsutils -> python[version='2.7.*|>=2.7,<2.8.0a0']`

From what I can tell, this might be what is causing the ambiguous error above. Seems like ngsutils is no longer in development, but they may have an updated set of tools available. Thanks for any guidance.

Annotate based on refGene.*.txt.gz

Currently, annotation data has to be initialised in 2 steps:

  1. Make RefSeq_knownGene.*.txt using UCSC browser,
  2. Run generate_refseq_data.py that reads RefSeq_knownGene.*.txt and generates all_features.*.bed.
    After that, annotate_bed.py can use all_features.*.bed to annotate BED files on request.

I want to avoid that initialization step, and store the original RefSeq file that can be directly downloaded from UCSC (http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/refGene.txt.gz, and same for hg38). This file can be more easily supported and updated (RefSeq has a new release every day), and I can integrate annotation into BCBio and use their cool system to update reference data.

So I want annotate_bed.py to be able to work directly from refGene.*.txt.gz. The files are already downloaded, and I created function get_refseq_gene(genome) in __init__.py that returns the path.

Since the file is gzipped, probably it can be tabixed and used more effectively to annotate. Currently annotate_bed.py uses all_features.*.bed that is sorted so bedtools intersect can work fast, but it shouldn't be such easy with refGene.*.txt.gz.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.