Giter Site home page Giter Site logo

cog's Introduction

COG - Scripts for setting up NCBI's COG function database

When annotating a microbial metagenome or metatranscriptome, organism annotations are often not enough - sometimes, functional annotations are also desired. One useful database for functional annotation of genomes is the Clusters of Orthologous Groups (COG), run by NCBI (https://www.ncbi.nlm.nih.gov/COG/).

These COG files can be downloaded directly from NCBI's FTP site (ftp://ftp.ncbi.nlm.nih.gov/pub/COG/COG2014/data/), but they aren't immediately structured in a searchable form if using a high-speed aligner like DIAMOND. This repo contains a few quick scripts to get a workable merged database file for use as a DIAMOND reference.

Instructions

Setting up the COG database for DIAMOND

To convert the downloaded COG files from NCBI's FTP archive into a single FASTA-structured file that can be loaded into DIAMOND as a database, run the following command:

python merger.py prot2003-2014.fa cog2003-2014.csv cognames2003-2014.tab

This will output a single file called "merged_cogs.fa", that contains all of the COG information in FASTA format:

>gi|103485499|ref|YP_615060.1| chromosomal replication initiation protein [Sphingopyxis alaskensis RB2256] | COG0593 | L
MSGDAAALWPRVAEGLRRDLGARTFDHWLKPVRFADYCALSGVVTLETASRFSANWINERFGDRLELAWRQQLPAVRSVS
VRGGVAATERAATLASVPLPTFDAPAAPAANPALLGFDPRLSFDRFVVARSNILAANAARRMAMVERPQFNPLYLCSGTG
QGKTHLLQAIAQDYAAAHPTATIILMSAEKFMLEFVGAMRGGDMMAFKARLRAADLLLLDDLQFVIGKNSTQEELLHTID
DLMTAGKRLVVTADRPPAMLDGVEARLLSRLSGGLVADIEAPEDDLRERIIRQRLAAMPMVEVPDDVIAWLVKHFTRNIR
ELEGALNKLLAYAALTGARIDLMLAEDRLAENVRSARPRITIDEIQRAVCAHYRLDRSDMSSKRRVRAVARPRQVAMYLA
KELTPRSYPEIGRRFGGRDHSTVIHAVRTVEALRVADSELDAEIAAIRRSLNS
>gi|103485500|ref|YP_615061.1| glutathione S-transferase-like protein [Sphingopyxis alaskensis RB2256] | COG0625 | O
MKLFIGNKAYSSWSLRGWLAARHSGLPFEEVTVPLYNEEWNQRREGDEFAPSGGKVPILWDGADIVVWDSLAIIDYLNEK
TGGTRGYWPDDMAARAMARSMAAEMHSSFAALRREHSMNIRRIYPAAELTPEVQADVIRILQIWAEARARFGGEGDYLFG
DWSAADMMFAPVVTRFITYSIPLPRFALPYAQAVISHPHMQEWIGGAQAEDWVIEKFEGPVEG
>gi|103485501|ref|YP_615062.1| ribosome biogenesis GTP-binding protein YsxC [Sphingopyxis alaskensis RB2256] | COG0218 | D
MSEIELEPGADPERAERARKLFSGPIAFLKSAPALQHLPVPSVPEIAFAGRSNVGKSSLLNALTNRNGLARTSVTPGRTQ
ELNYFDVGEPPVFRLVDMPGYGFAKAPKDVVRKWRFLINDYLRGRQVLKRTLVLIDSRHGIKDVDRDVLEMLDTAAVSYR
LVLTKADKIKASALADVHAATEAEARKHPAAHPEVIATSSEKGMGIAELRTAVLEAVEL

Note that, for each header line (indicated by a > symbol), this merged file now contains the COG ID and the COG category, separated by vertical lines ( | ) from the other entries.

This file can be converted to DIAMOND format easily:

diamond makedb --in merged_cogs.fa -d COG_diamond.dmnd

Annotating sequences against this COG database

Once this database is set up in DIAMOND (.dmnd) format, it can be searched against like any other:

diamond blastx --db COG_diamond.dmnd -q test_file.fastq -a test.cogs

This produces a DIAMOND outfile that must be converted to a viewable format:

diamond view --daa test.cogs.daa -o test.cogs -f tab

Analyzing the results of a DIAMOND search against the COG database

After searching a set of FASTQ or FASTA sequences against the COG database using DIAMOND, it is possible to analyze these results, condensing down the receipt of all matches to get a summary of the most abundantly matched categories.

This can be accomplished using the included Python script, "DIAMOND_COG_analysis_counter.py":

python DIAMOND_COG_analysis_counter.py -I test.cogs -O result.cogs -D merged_cogs.fa

This script produces an outfile, "result.cogs", which contains a summary of the abundance of each COG hit from the DIAMOND annotation step. This output file contains 3 columns:

2.16863838445	2878	 COG3328  | X
1.42340441564	1889	 COG3335  | X
1.32846055309	1763	 COG3436  | X
1.02629794288	1362	 COG2801  | X
0.799487604551	1061	 COG3415  | X
0.785170672896	1042	 COG0366  | G
0.68344510587	907	 COG0480  | J
0.646522492653	858	 COG3385  | X
0.616381583905	818	 COG0488  | R
0.59000828875	783	 COG0085  | K

The first column is the percentage abundance for that particular COG. The second column contains the number of raw hits for that particular COG. The third column contains both the COG ID and the COG category, represented by one or more letters.

This file can be used for other analyses, including comparisons between different annotations or figure generation. One example of figure generation is in the "COG_analysis.R" script included in this repo.

This script runs on the result.cogs outfile:

Rscript COG_analysis.R -I result.cogs -O graph.pdf -T 20

The -O option designates the name of the created image, saved as a PDF vector file. The -T option indicates the cutoff for how many top COG categories are included. The default cutoff is at 20, but fewer or more levels may be included.

The result is a PDF containing a pie chart, along with a legend depicting the top COG categories.

Pie chart

A guide for each letter code can be found here: http://www.sbg.bio.ic.ac.uk/~phunkee/html/old/COG_functions.html

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.