ORTHOLOG SET CONSTRUCTION

Orthology prediction is critical for many genomic studies, particularly comparative genomics and phylogenomics. A variety of tools exist for identifying orthologous gene groups in nucleotide sequence data, but as the number of samples increase, the computational time necessary can become restrictive. An alternative approach has been to define a set of orthologous groups, say from a subset of samples or from an online repository such as OrthoDB, and then use more computationally tractable solutions to match the remaining samples to those orthologous groups.

One excellent tool for orthology prediction against an existing catalog is Orthograph (Peterson et al. 2017). Orthograph uses an existing set of orthologous genes and creates am ortholog set or catalog. Orthograph then uses a graph-based approach to identify putative orthologs from input DNA contigs, such as from gene calling programs, transcriptomes, or target capture. Here, I present a few scripts I have written that enable rapid creation of orthologous gene sets ready for importing into Orthograph.

For now, only tutorials involving gene sets from OrthoDB are presented. However, I have previously used gene sets from OMA, OrthoFinder2, and others, in a similar fashion - any method that outputs an orthologous gene set as a single fasta file per orthogroup should be readily customizable with the script here.

CITATION

We have a manuscript in preparation where we demonstrate the utility of these catalogs in evaluating denovo genome sequencing data from museum specimens. More details to come soon.

INSTALLATION

Just place the shell scripts and python scripts in the same directory, and feel free to run them from anywhere, add them to your path, etc. Just always keep the shell and python scripts in the same directory.

QUICK START GUIDE FOR DOWNLOADING FROM ORTHODB

For a more detailed tutorial on the usage of these scripts, see below.

Run the script ortho_dl.sh to download the orthologs at the target group of interest from OrthoDB: sh ortho_dl.sh coloeoptera 7041 0.8 0.8
Next, parse the fasta files to adjust headers as necessary: sh ortho_process.sh coleoptera
Once processed, add your new orthologs to Orthograph, or use them as a reference catalog in your own favorite software tools. Remember to create appropriate catalogs of ortholog files.

TUTORIAL FOR DOWNLOADING A SET OF ARACHNIDA ORTHOLOGS FROM ORTHODB

This tutorial generates a set of files and directories suitable for phylogenomic analysis, or for the creation of an ortholog catalog usable by Orthograph. It assumes you have already downloaded and installed Orthograph.

This guide was written for OrthoDB version 10.1, and will hopefully be updated with additional OrthoDB versions as needed.

IDENTIFYING TAXONOMIC ID FROM ORTHODB

This quickstart guide will walk through an example of downloading a set of orthologs for Arachnida from OrthoDB. The first step is to identify the taxonomic level from OrthoDB. OrthoDB contains a wide range of taxonomic levels for users to choose from. To see what levels are available, from OrthoDB, click Advanced. This will display the list of species and taxonomic groupings available. Ortholog sets are available for taxonomic groupings of genomes that have a box around their name, as in this example:

Now for the steps themselves:

Navitage to OrthoDB.org. Refresh the page to clear previous searches etc from cache. Click on Advanced to bring up the species/taxonomy selection screen, as shown above.
Select Arachnida by clicking the empty box next to the ID. The number next to Arachnida indicates there are 10 genomes present in this taxonomic category. You may need to expand Eukaryota, Metazoa, and Arthropoda before you can click Arachnida, depending on your previous browsing.

Once selected, the level Arachnida will receive a checkmark, as will all associated taxonomic levels above the order, as above. Click the Submit button.

This page contains the set of orthologs associated with Arachnida based on the default settings of the OrthoDB search. The taxonomic ID is actually part of the URL of this page and is visible in a web browser. We will revisit that momentarily. With no filtering, there will be a total of 16575 orthogroups. Click the back button to return to the previous OrthoDB page, and click Advanced. Arachnida should still be selected; if it is not, re-select it.
Under Phyloprofile, change the first drop down box to present in all species; I will sometimes refer to this value as the universality of the orthologs. Change the second drop down box to single copy in >90% of species. Click submit.
The set of orthologs displayed are those that are present in all Arachinda genomes on OrthoDB, and single copy in 90% of those genomes. The search result should list 2342 orthogroups. For your own analyses, carefully consider universality and single copy thresholds based on your research question and methods. Note that the number shown in the web preview will vary slightly with the number retrieved from online databases due to slight differences in inclusions, a minor issue that OrthoDB is aware of and plans to fix in the future.
Note the level and species ID in the URL bar, as below:

Thus, our taxonomic ID for Arachnida is 6854.

DOWNLOADING ORTHOGROUPS FOR THAT LEVEL

Given a taxonomic ID, and thresholds for universitality and single copyness, the script ortho_dl.sh will download the orthogroups as unaligned fasta files per orthogroup and store them within a subdirectory. These fasta files are suitable for alignment and analysis, but will contain orthoDB header information instead of identifiable species epithets. Later outputs of this pipeline will provide fasta files for alignment with recognizable species epithets.

While in the directory where you want your single copy orthologs to be downloaded and processed, run the orthodl script as below:

sh ortho_dl.sh arachnida 6854 1 0.9

This command will retrieve the orthogroup identifiers associated with taxonomic ID 6854 that are single copy in all genomes at that level. The first argument, the prefix name, is an arbitrary identifier the user sets that will be used in this and subsequent steps to identify the particular ortholog catalog under construction. In this case, the prefix name coleoptera is used to identify this particular set of downloads, and a subdirectory will be made called coleoptera_orthologs to store the results of this script. See the header information in ortho_dl.sh for additional information.

This can take a while, depending on the number of orthogroups and number of reference species, and requires an active internet connection.

Once the script completes, let's take a look at the contents of one of the fasta files:

head arachnida_orthologs/2at6854.fasta

You may notice that the fasta headers begin with a code like 6945_0:003757 . This identifies the species and gene ID in the OrthoDB database, but is not intelligible to us. We'll translate that a bit later, based on the other header information, and a translation file. grep ">" arachnida_orthologs/2at6854.fasta

Note that multiple lines begin with 114398_0 ! This means that Orthogroup 2at6854 contains two genes originated in the species with ID 114398_0. As such, this file must be processed to ensure that the duplicated genes are removed, leaving only putative single copy orthologs to build our orthogroups from. It is important to keep in mind that with a single copy threshold below 1, some of the orthogroups in your analysis may have evidence of duplication events. Alternatively, it is possible some of the duplicated genes are the result of bioinformatics-related circumstances such as misassemblies. Regardless of the their origin, the processing script below will remove duplicated genes from fasta files.

PROCESSING ORTHOGROUPS

Next we process our orthogroups to prepare them for Orthograph. This involves:

Fixing headers to be easier to read (and import to Orthograph!)
Removing any any species from an orthogroup with evidence of a duplication event
Updating the species IDs to match the actual species IDs, rather than OrthoDB's IDs
Making a single fasta file per species containing all single copy orthologs for that species, with headers corresponding to gene names and orthograph IDs
Making a TSV containing the connections between orthogroups, species, and gene names in the fasta file headers - a Cluster of Orthologous Groups (COG) file needed by Orthograph - and finally
Ensuring the input files we will add to Orthograph have the same number of entries in the COG file as in the fasta file for each species.

This is all accomplished by the ortho_process script, as demonstrated below:

sh ortho_process.sh arachnida

How did the script do? Well, we can check it out. Inside the arachnida_orthologs directory there should be one .faa and one .log file for each reference. This particular version of OrthoDB has 10 Arachnida references.

ls arachnida_orthologs/*.faa
ls arachnida_orthologs/*.log

To create the full single COG file of all taxa, type:

cat *.log > tax_for_orthograph.cog

And use the resulting tax_for_orthograph.cog file for importing gene and ortholog associations to Orthograph.

CREATING AN ORTHOGRAPH ORTHOLOG DATABASE

The ortholog database is ready to be created in Orthograph! As Orthograph requires its own installation instructions, head on over to https://github.com/mptrsen/Orthograph for details on how to install and run Orthograph, and to configure your own Ortholog database from the input files we've generated here!

Inquiry about orthoset_construction error!

Hello. This is Woo Jun Bang, and I'm trying to use this program for building orthograph reference data set!

I got this error when I wrote 'sh ortho_dl.sh culicidae 7157 0.9 0.9' for test.

Then I got the message as below:

sh ortho_dl.sh culicidae 7157 0.9 0.9
/Users/woojunbang/program/orthoset_construction
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 1201k 0 1201k 0 0 213k 0 --:--:-- 0:00:05 --:--:-- 253k
Bad RequestBad RequestBad RequestBad Request

(Keep going...this error message)

and same message for 'sh ortho_dl.sh coloeoptera 7041 0.8 0.8'

So I checked the URL code by the site "https://www.ezlab.org/orthodb_userguide.html#standalone-orthologer-software".

However, there are no exact differences with your "ortho_dl.sh" script for orthodb.V10.

This is "ortho_dl.sh". I only changed the line 35 to fix this error below:

/Users/woojunbang/program/orthoset_construction
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 1417k 0 1417k 0 0 220k 0 --:--:-- 0:00:06 --:--:-- 290k
ortho_dl.sh: line 35: unexpected EOF while looking for matching `''
ortho_dl.sh: line 39: syntax error: unexpected end of file

#!/bin/bash
SCRIPTPATH="$( cd "$(dirname "$0")" >/dev/null 2>&1 ; pwd -P )"
echo $SCRIPTPATH
#This script is part of the ortholog construction pipeline written by J. Soghigian.
#It was last edited on 2023-02-28
#This script will download all the fasta files belonging to a taxonomic level that meets user-specified thresholds of inclusion for universality and single-copy nature.
#Usage:
#sh ortho_dl.sh prefix_name level universality level_of_single_copy
#Prefix name refers to the prefix used in downloading and construction of folders, and can be a name of the taxonomic group of group, e.g. Diptera. Level refers to the internal taxonomic identifier used by orthodb, e.g. 7147 for Diptera. Universailty is the presence of the ortholog across the genomes on orthodb at that taxonomic level, and level of single copy refers to orthologs that are single copy in only that percentage of genomes. In other words,
#sh ortho_dl.sh diptera 7147 0.9 0.9
#Will include only orthologs found in 90% of the Diptera genomes, and of those, we include only orthologs that are single copy in at least 90% of genomes. It is important to note that this WILL download duplicated genes (we take care of them later). The set will have the prefix diptera.

#First, we will define some variables. We will start with the ortholog database prefix. This is the ortholog set name we might use later for e.g. Orthograph, but the exact name is arbitrary.
ogprefix=$1

#We will now use wget to download a list of fasta file IDs for a given taxonomic level (level=7147) and species/set of species (7147). Consider adjusting the universal/single copy settings as desired - here, universality (presence in genomes) is set to 0.9, and threshold for single copy is also set to 0.9. This means that of all the genomes at this taxonomic level, we include only orthologs found in 90% of the genomes, and of those, we include only orthologs that are single copy in at least 90% of genomes. It is important to note that this WILL download duplicated genes (we take care of them later).
#note if you are targetting a set of orthologs >10k, you'll need to adjust the limit we set as well.
level=$2
uni=$3
sc=$4

curl -o ${ogprefix}.uni0.9single0.9.fasta "https://data.orthodb.org/current/search?query=&level=${level}&species=${level}&universal=${uni}&singlecopy=${sc}&take=100000"
#We will now process this file so that it can be fed into a loop. This will allow us to download each fasta file individually for each ortholog.

cat ${ogprefix}.uni0.9single0.9.fasta | awk -F"[" '{print $2}' | awk -F"]" '{print $1}' | sed 's/"//g' | perl -pe 's/, /\n/g' > ${ogprefix}.listoffasta

#This is now a file that contains OrthoDB IDs for orthologs at a given taxonomic level. E.g., 10359at7203 is Orthogroup 10359 at taxononimc level 7203. This list corresponds to the specifications we used in the wget expression above; e.g., the orthogroups contained herein are present in 90% of genomes at that taxonimic level and 90% single copy in those genomes. So with this identifier, we can now download this orthogroup as a fasta file.

#to begin we create a folder to store these orthologs

mkdir ${ogprefix}_orthologs

#now we loop over the aforementioned list of fasta file and download each orthogroup's fasta file. Note that this URL may change as orthoDB changes their URLs. Consult orthoDB for more information.

for line in cat ${ogprefix}.listoffasta; do curl 'https://data.orthodb.org/current/fasta?id='${line}' -o ${ogprefix}_orthologs/${line}.fasta'; sleep 2; done

rm ${ogprefix}.listoffasta
rm ${ogprefix}.uni0.9single0.9.fasta

I'm inquiring about how to resolve the following issue!

By the way, Your phylogenomics paper has been incredibly insightful, and it has greatly aided for me!

Thank you.

jsoghigian / orthoset_construction Goto Github PK

orthoset_construction's Introduction

ORTHOLOG SET CONSTRUCTION

CITATION

INSTALLATION

QUICK START GUIDE FOR DOWNLOADING FROM ORTHODB

TUTORIAL FOR DOWNLOADING A SET OF ARACHNIDA ORTHOLOGS FROM ORTHODB

IDENTIFYING TAXONOMIC ID FROM ORTHODB

DOWNLOADING ORTHOGROUPS FOR THAT LEVEL

PROCESSING ORTHOGROUPS

CREATING AN ORTHOGRAPH ORTHOLOG DATABASE

orthoset_construction's People

Contributors

Stargazers

Watchers

orthoset_construction's Issues

Recommend Projects

Recommend Topics

Recommend Org