R - COMPLETE

Pipeline for extracting localization elements/motifs using a comparitive approach. Library can be installed and tested on R(>=4.1). For a list of genes, the pipeline downloads full transcript sequences for all organisms (or selected organisms) from ENSEMBL or NCBI, Formats the headers and stores them according to the clusters (optionally taken from OrthoDB or clustered otherwise). Sequences without cluster information are clustered based on sequence identity/sequence coverage of Reciprocal Bidirectional BLAST Hits (RBH) of CDS regions. Next step would be to stitch the transcript regions into full length transcripts and align them. Pipeline is format agnostic(to my knowledge - TMK), packaged with a multi-threaded BLAST framework, to BLAST direclty from R (QuickBLAST - available a precompiled library) coupled with Arrow IPC, can load and convert BLAST formats (internally - to GRanges) and sports functions for performing Reciprocal Bidirectional Hits. The multithreaded BLAST framework can handle many-many BLAST Hits across organisms. The package is interfaced with bash, R transforms the data & calls the shots and bash handles files & BLASTing. Check requirements and installation instuctions before proceeding.

Ironically this repo is incomplete but the functionality in it works. Under Construction Indefinitely. Documentation can be found within the package, Play around with the functions for the rest

Installation (on R - Linux or RStudio Docker with WSL in Windows) :

Download and install QuickBLAST binaries

BiocManager::install(c("remotes))
remotes::install_local("QuickBLAST_1.0_R_x86_64-pc-linux-gnu.tar.gz",build = F, quiet = T)

sudo apt-get update && sudo apt-get install curl bzip2 parallel liblmdb-dev ncbi-blast+ samtools bedtools libz-dev liblzma-dev libbz2-dev libclang-dev gffread curl lsof libboost-dev libparquet-dev

BiocManager::install(c("Rhtslib", "devtools", "BiocManager", "Biostrings", "biomaRt", "S4Vectors", "IRanges", "rtracklayer", "GenomicRanges", "BiocGenerics"))
devtools::install_github("https://github.com/vizkidd/R-COMPLETE/")

REQUIRES :

Linux with BASH ($SHELL must be set or /bin/bash must exist) (export SHELL="/bin/bash")
Config Files
Lot of space in genomes_path, fasta_path and annos_path path locations (in parameters file)
GNU parallel (in $PATH - BASH functions)
GffRead
Samtools (in $PATH - BASH functions)
Bedtools (in $PATH - BASH functions)
ncbi-blast+ (Compile from .src.tar.gz with ./configure && make all_r && sudo make install) (or sudo alien -i ncbi-blast-X.XX.X+-2.src.rpm) (or download binaries) (Docs & Compilation)
- Check if you have the binaries for blastdb_path and makeblastdb
Zlib - (Compile from sources) (or) (sudo apt install libz-dev or yum install zlib-devel)
LZMA SDK - (sudo apt-get install liblzma-dev or yum install xz-devel)
BZLIB - (sudo apt-get install libbz2-dev libclang-dev or yum install bzip2-devel.x86_64)

OrthoDB : (Optional)

OrthoDB (ODB) Flat Files (>= v10.1) (Pipeline is tested with ODB v10.1)
- odb10v1_species.tab.gz - Ortho DB organism ids based on NCBI taxonomy ids (mostly species level)
- odb10v1_genes.tab.gz -Ortho DB genes with some info
- odb10v1_OG2genes.tab.gz - OGs to genes correspondence
  (OR)
- odb10v1_OGgenes_fixed.tab.gz - Merged & Transformed ODB file (Done within pipeline - Only once)
- odb10v1_OGgenes_fixed_user.tab.gz - Merged & Transformed ODB file BASED on user gene list (Done within pipeline - For different gene sets)

NOTE : Set orthodb_path_prefix in parameters file if you use OrthoDB files

Tools - (Paths in parameters file)

MACSE (Path to the .jar)
MAFFT (Compile from sources with extensions because mafft-qinsi is required)
TRANSAT (Download preferred tarball and check INSTALL file)
RNADECODER (Compiled program is in the bin/ of the repo) (Give path to the folder containing the binary)
FastTree

Files (Config)

Parameters
User Data (Optional)
Reference Organisms (COMPLETE_env$org.meta has the list of organisms available)

Run Example :

To run the example, from the context of your current working directory,

Download OrthoDB(ODB) files (optional) and Tools
Check config files
Provide paths and options in the parameters file
- NOTE : Default parameters file (parameters.txt) is at fs::path_package("COMPLETE","pkg_data","parameters.txt")

params_list <- COMPLETE::load_params(fs::path_package("COMPLETE","pkg_data","parameters.txt"))
gene_list = fs::path_package("COMPLETE","pkg_data","genelist.txt")
user_data = fs::path_package("COMPLETE","pkg_data", "user_data.txt")
COMPLETE::EXTRACT_DATA(params_list = params_list, gene_list = gene_list, user_data = user_data, only.user.data = F )
COMPLETE::FIND_TRANSCRIPT_ORTHOLOGS(params_list = params_list, gene_list = gene_list, blast_program = "tblastx", group.mode=COMPLETE_env$FORMAT_ID_INDEX$CLUSTERS, run.mode="both", verbose=F, seed=123)

NOTE : First run will take some time due to conversion of ODB file structure (if OrthoDB is used)

Documentation :

?COMPLETE_PIPELINE_DESIGN (in R docs)

PARAMETERS

The pipeline takes a single parameter file. This design was chosen,

To expose as many options as possible to the end-user
The pipeline uses BASH to BLAST and handle files (significantly faster than R) and the parameter file is shared between R and BASH.
Parameters tagged as output in comment column are outputs from COMPLETE

* Delimited by '=='
* Inputs and Ouputs are specified in the comments
* The file is of the format [param_id==value==comment] where param_id and value columns are CASE-SENSITIVE (hard to check and convert param types in BASH). 
* A default/example file is in fs::path_package("COMPLETE","pkg_data","parameters.txt")

EXTRACT_DATA() - genomes_path, annos_path,

USER DATA : (Optional)

* Columns Org, genome, gtf
* Can accept empty or '-' in genome and/or gtf column. If empty or '-', the genome/gtf is looked up in ENSEMBL or NCBI DBs 
* A default/example file is in fs::path_package("COMPLETE","pkg_data","user_data.txt")

COMPLETE.format.ids

Order of FASTA ID labels are stored in COMPLETE_env$FORMAT_ID_INDEX
Sequences are labelled with the following long ID format of R-COMPLETE (specific to this pipeline and referred to as COMPLETE.format.ids) (seqID_delimiter & transcripID_delimiter set in parameters, :: & || respectively in this context)
COMPLETE.format.ids are indexed(internally) with COMPLETE::index_BLAST_tables() for compatibility

>$transcript_id $transcripID_delimiter $transcript_region ($strand) $seqID_delimiter $seqID_delimiter $org_name $gene_name $seqID_delimiter $ortho_cluster
>SOME_TRANSCRIPT||cds(+)::SOMEORG::RANDOMGENE::ORTHOLOG_CLUSTERS
>ENSDART00000193157||cds(+)::danio_rerio::sulf1::18335at7898,51668at7742,360590at33208

BLAST Functions

COMPLETE::one2one() - one2one BLAST between a list of files
COMPLETE::all2all() - all2all BLAST between a list of files
COMPLETE::run_QuickBLAST*() - Run QuickBLAST - Multithreaded Streaming of FASTA files
COMPLETE::run_BLAST() - Call blast suite of programs from cmdline
COMPLETE::GetQuickBLASTInstance() - Exposed QuickBLAST Object
QuickBLAST

remotes::install_local("QuickBLAST_1.0_R_x86_64-pc-linux-gnu.tar.gz", build=F)
tblastx_ptr <- QuickBLAST::CreateNewBLASTInstance(seq_info = list(0,0,F), program = "tblastx", options = list("evalue"=1e-05, "pident"=0.75, "qcovhsp_perc"=0.75))
blastn_ptr <- QuickBLAST::CreateNewBLASTInstance(seq_info = list(0,0,F), program = "blastn", options = "")
 QuickBLAST::BLAST2Files(ptr=tblastx_ptr, query="ungrouped.cds", subject="ungrouped.cds", outFile="out.tmp", seq_limit=1000, show_progress=T,return_values=F, num_threads=8)
QuickBLAST::BLAST2Seqs(ptr=blastn_ptr, query="AAAAAAAAAAAATTTTTTTTTTTTTTTTTTTTTGGGGGGGGGGGGGGGGGGGGGGGGCCCCCCCCCCCCCCCCCCCC", subject="TTTTTTTTTTTTGGGGGGGGGGGGGGGG")

QuickBLAST Options

Same as BLAST but DB & OUTPUT Format are not available. List of available options can be checked with `COMPLETE::GetAvailableBLASTOptions()` (Emplty elements from the list are removed and BLAST defaults are set on the c++ side). Inputs and Outputs are provided as parameters and sequence specification(strand, sequence type) can be provided during QuickBLAST object creation with `COMPLETE::GetQuickBLASTInstance()` (or use the BLAST2*() functions in R). Enums used by QuickBLAST in C++ are not exposed in R and only integers are used, check `COMPLETE::GetQuickBLASTEnums()`.

FLOW

EXTRACT_DATA() - Extracts the transcript regions for Protein Coding Transcripts (provided in parameters, pipeline requires cds,5utr,3utr) from BIOMART and/or User provided genomes & GTFs. This functions uses biomaRt/biomartr for extracting data from BIOMART and BASH function extract_transcript_regions() for user provided data. Extraction priority/flow : User Data > biomaRt > biomartr
- ODB Files are merged and transformed with BASH function merge_OG2genes_OrthoDB()
- Orthologous genes are found for genes which are not present in the organism with BASH function check_OrthoDB()
- Flank lengths are calculated from GTF data for missing UTRs (with variance correction, check ?calculate_gtf_stats)
- FASTA Nucleotide Sequences for given TRANSCRIPT_REGIONS are fetched from BIOMART/Genome

FIND_TRANSCRIPT_ORTHOLOGS() - Finds transcript-level orthologs based on minimum coverage and/or maximum sequence identity (check ?extract_transcript_orthologs). Has a grouping mode(group.mode) and run mode(run.mode), to group transcript orthologs at the level of organisms, genes or Ortholog Clusters, sequences are grouped into any level of COMPLETE_env$FORMAT_ID_INDEX (Default - COMPLETE_env$FORMAT_ID_INDEX$CLUSTERS) and select transcript orthologs. Gene level grouping has more tight orthology and fewer transcript orthologs. Ortholog Cluster level grouping is a level higher than Genes (Because an Ortholog Cluster can have more than one gene) and have highest number of transcript orthologs with a lot of dissimilarity. Different run modes determine how transcript-level orthologs are selected by their HSP coverages after grouping. After grouping, non-overlapping BLAST hits which maximize coverage for each transcript are chosen with WISARD. Transcripts which do not have bi-directional hits are discarded with COMPLETE::RBH(). Finally, HSP coverage is calculated with COMPLETE::calculate_HSP_coverage() and transcripts are processed according to run.mode option which can be one of,
- "coverage_distance" - Hits are filtered based on distance between bi-directional minimum HSP coverages (coverage_distance <= min_coverage_filter). This option selects more BLAST hits and should be used when the coverage values are very low (and the BLAST Hits/sequences are distant). coverage_distance = 1 - (2 * aligned_length) / (query_length + subject_length). (coverage_distance >= min_coverage_filter)
- "coverage_filter" - Filters Hits based on minimum coverage of HSPs from either direction. Use this option when the coverage values are high (and the BLAST Hits/sequences are closely related). (cov_q >= min_coverage_filter && cov_s >= min_coverage_filter)
- "both" - Uses both "coverage_distance" and "coverage_filter" and is very strict. (Default) . (coverage_distance >= min_coverage_filter && cov_q >= min_coverage_filter && cov_s >= min_coverage_filter)
- "no_filter" - Only calculates HSP coverages and does not filter any Hits .
NOTE : ONLY USE THIS FUNCTION WHEN RUNNING THE PIPELINE OF R-COMPLETE. Use other helper function to work with custom BLAST files not generated by this R package.

vizkidd / r-complete Goto Github PK

r-complete's Introduction

ToC

R - COMPLETE

Installation (on R - Linux or RStudio Docker with WSL in Windows) :

REQUIRES :

OrthoDB : (Optional)

Tools - (Paths in parameters file)

Files (Config)

Run Example :

Documentation :

PARAMETERS

USER DATA : (Optional)

COMPLETE.format.ids

BLAST Functions

QuickBLAST Options

FLOW

r-complete's People

Contributors

Stargazers

Watchers

Recommend Projects

Recommend Topics

Recommend Org