Giter Site home page Giter Site logo

cthreepo's Introduction

cthreepo

PyPI version Conda

A python script to interconvert seq-ids in gff3, gtf, bed and other files.


Quick start for the impatient

  1. Install using conda
conda install -c bioconda cthreepo 
  1. Execute as follows:
## convert seq-ids in <input.gff3> from refseq format (NC_000001.11)
## to UCSC format (chr1) using the Human GRCh38 mapping dictionary
cthreepo -i <input.gff3> -if rs -it uc -f gff3 -m h38 -o <output.gff3>

Introduction

NCBI RefSeq, UCSC and Ensembl use different identifiers for chromosomes in annotation and other files such as GFF3, GTF, etc. Users interested in using a mix of files downloaded from different sources and use them in a single pipeline may end up with seq-id mismatch related errors. This script converts seq-ids from one style to the other in order to make the files compatible with each other.

Installation and Usage

Python3 is required for this script to work. With that requirement satisfied, you can install as shown below:

Install using conda

conda install -c bioconda cthreepo 

Install using pip

pip install cthreepo

Install from this repository

First, download/clone the repository. Then run:

python3 setup.py install

Usage

## help
cthreepo --help 

## usage
## convert seq-ids in <input.gff3> from refseq format (NC_000001.11)
## to UCSC format (chr1) using the Human GRCh38 mapping dictionary
cthreepo \
    --infile <input.gff3> \
    --id_from rs \
    --id_to uc \
    --format gff3 \
    --mapfile h38 \
    --outfile <output.gff3>

File formats supported

  1. GFF3 (default)
  2. GTF
  3. BedGraph
  4. BED
  5. SAM
  6. VCF
  7. WIG
  8. TSV

Mapping files

cthreepo needs a mapfile that it uses to figure out how seq-ids map from one style to the other.

  • Use the built-in shortcuts -- h38, h37, m38 and m37 for GRCh38/hg38, GRCh37/hg19, MGSCv37/mm9 and GRCm38/mm10 respectively. I try to keep these files up-to-date but if they don't work as expected, I suggest using the latest file by following one of the two options described below.
  • Provide NCBI assembly accession using the -a parameter. A complete, legal accession.version such as GCF_000001405.39 should be provided.
  • Provide an NCBI assembly report file. For a given assembly it can be downloaded from the NCBI Assembly website. If the 'Download' button is used, this file is called 'Assembly structure report'. On the NCBI Genomes FTP site, these files have the suffix assembly_report.txt.

cthreepo's People

Contributors

marksantcroos avatar vkkodali avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

cthreepo's Issues

syntax error

Hi,
Thanks for creating this very helpful script. I'm having trouble running it, and was wondering if you help me pin point what the issue with my script.

Here's the error I'm encountering: `+ python cthreepo.py

File "cthreepo.py", line 11

--id_from rs
^
SyntaxError: invalid syntax
`
here's my python script:

`gpfs1/home/e/m/embueno/myscripts/cthreepo/ \

--infile /gpfs1/home/e/m/embueno/myscripts/cthreepo/test_files/lepdecOGSv12GCF0005003251.gff3 \

--id_from rs \

--id_to genbank \

--format gff3 \

--mapfile /gpfs1/home/e/m/embueno/myscripts/cthreepo/cthreepo/mapfiles/GCF_000500325.1_Ldec_2.0_assembly_report.txt \

--outfile lepdecOGSv12GCF0005003251GB.gff3
`
Thank you in advance.

E

packaging?

Hi @vkkodali,

Thanks for making and publishing this. I support the idea that not everyone roles their own!

Would it be an idea to turn it into a pypi package, or do you think that is overkill?

Cheers,

Mark

sequence-region lines are dropped

Ensembl GFF3 has a bunch of sequence region headers in the format ##sequence-region 1 1 248956422 that get dropped after seq-id conversion.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.