drostlab / orthologr Goto Github PK

Genome wide orthology inference and dNdS estimation

Home Page: https://drostlab.github.io/orthologr/

License: GNU General Public License v2.0

R 73.72% Perl 9.73% C++ 16.02% C 0.30% Shell 0.19% TeX 0.04%

blast-searches orthology-inference comparative-genomics-studies genomics sequence-alignments blast comparative dnds-estimation codon-alignment r

orthologr's Introduction

orthologr

Comparative Genomics with R

Motivation

The comparative method is a powerful approach in genomics research. Based on our knowledge about the phylogenetic relationships between species, we can study the evolution, diversification, and constraints of biological processes by comparing genomes, genes, and other genomic loci across species. The orthologr package aims to provide a framework to perform large scale comparative genomics studies with R. Orthologr aims to be as easy to use as possible - from genomic data retrieval to orthology inference and dNdS estimation between several genomes.

In combination with the R package biomartr, users can retrieve genomes, proteomes, or coding sequences for several species and use them as input for orthology inference and dN/dS estimation with orthologr. The advantage of using biomartr in combination with orthologr is that users can join the new wave of research that promotes and facilitates computational reproducibility in genomics studies and solve the issue of comparing genomes with different genome assembly qualities (also referred to as genome version crisis).

You can find a detailed list of all orthologr functions here: https://drostlab.github.io/orthologr/reference/index.html

Citation

Please cite the following paper in which I introduce orthologr when using this package for your own research. This will allow me to continue working on this software tool and will motivate me to extend its functionality and usability in the next years. Many thanks in advance :)

Drost et al. 2015. Evidence for Active Maintenance of Phylotranscriptomic Hourglass Patterns in Animal and Plant Embryogenesis. Mol. Biol. Evol. 32 (5): 1221-1231. doi:10.1093/molbev/msv012

Short package description

In detail, orthologr allows users to perform orthology inference and dN/dS estimation between two genomes or between several genomes. The following methods to infer orthologous relationships between genes of entire genomes are available in this package:

The most useful implementation in orthologr is the ability to compute synonymous versus non-synonymous substitution rates (dN/dS) for all orthologous genes between two entire genomes. Available dN/dS estimation methods are:

"NG": Nei, M. and Gojobori, T. (1986)
"LWL": Li, W.H., et al. (1985)
"LPB": Li, W.H. (1993) and Pamilo, P. and Bianchi, N.O. (1993)
"MLWL" (Modified LWL), MLPB (Modified LPB): Tzeng, Y.H., et al. (2004)
"YN": Yang, Z. and Nielsen, R. (2000)
"MYN" (Modified YN): Zhang, Z., et al. (2006)
"GMYN": Wang, D.P., et al. Biology Direct. (2009)
"GY": Goldman, N. and Yang, Z. (1994)
"MS": (Model Selection): based on a set of candidate models, Posada, D. (2003)
"MA" (Model Averaging): based on a set of candidate models, Posada, D. (2003)
"ALL": All models toghether

Please find more details here.

Install `orthologr`

# Install Bioconductor
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install()

# Install package dependencies
BiocManager::install(c(
        "Biostrings",
        "GenomicRanges",
        "GenomicFeatures",
        "Rsamtools",
        "rtracklayer"
))

# install CRAN dependencies
install.packages(c("doParallel", "foreach", "ape", "Rdpack", "benchmarkme", "devtools"))

# install BLAST dependency metablastr from GitHub
devtools::install_github("drostlab/metablastr")

# install DIAMOND dependency rdiamond from GitHub
devtools::install_github("drostlab/rdiamond")

# install orthologr from GitHub
devtools::install_github("drostlab/orthologr")

Use Cases

Learn orthologr by reading these tutorials:

Example

Small example with internal dataset

library(orthologr)

# Detect orthologous genes between a query species and a subject species
# and compute the synonymous versus non-synonymous substitution rates (dN/dS)
# following this paradigm:
# 1) reciprocal best hit for orthology inference (RBH)
# 2) Needleman-Wunsch for pairwise amino acid alignments
# 3) pal2nal for codon alignments
# 4) Comeron for dNdS estimation
# 5) multi-core processing 'comp_cores = 1'
dNdS(query_file      = system.file('seqs/ortho_thal_cds.fasta', package = 'orthologr'),
     subject_file    = system.file('seqs/ortho_lyra_cds.fasta', package = 'orthologr'),
     delete_corrupt_cds = TRUE, # coding sequences that cannot be divided by 3 (triplets) will be removed
     ortho_detection = "RBH", # perform DIAMOND best reciprocal hit orthology inference
     aa_aln_type     = "pairwise", # perform pairwise global alignments of AA seqs 
     aa_aln_tool     = "NW", # using Needleman-Wunsch
     codon_aln_tool  = "pal2nal", # perform codon alignments using the tool Pal2Nal
     dnds_est.method = "Comeron", # use Comeron's method for dN/dS inference
     comp_cores      = 1) # number of compute cores

# A tibble: 20 x 24
   query_id subject_id      dN    dS   dNdS perc_identity num_ident_match… alig_length
   <chr>    <chr>        <dbl> <dbl>  <dbl>         <dbl>            <int>       <int>
 1 AT1G010… 333554|PA… 0.106   0.254 0.420           74.0              347         469
 2 AT1G010… 470181|PA… 0.0402  0.104 0.388           91.1              224         246
 3 AT1G010… 470180|PA… 0.0150  0.126 0.118           95.5              343         359
 4 AT1G010… 333551|PA… 0.0135  0.116 0.116           92.0             1812        1970
 5 AT1G010… 909874|PA… 0       0.175 0              100                213         213
 6 AT1G010… 470177|PA… 0.0449  0.113 0.397           89.5              580         648
 7 AT1G010… 918864|PA… 0.0183  0.106 0.173           95.1              348         366
 8 AT1G010… 909871|PA… 0.0340  0.106 0.322           90.3              271         300
 9 AT1G010… 470171|PA… 0.00910 0.218 0.0417          96.8              420         434
10 AT1G011… 333544|PA… 0.0325  0.122 0.266           93.6              494         528
11 AT1G011… 918858|PA… 0.00307 0.133 0.0232          99.2              525         529
12 AT1G011… 470161|PA… 0.00567 0.131 0.0432          98.5              446         453
13 AT1G011… 918855|PA… 0.13    0.203 0.641           72.6              207         285
14 AT1G011… 918854|PA… 0.105   0.280 0.373           84.9              152         179
15 AT1G011… 311317|PA… 0       0.306 0               85.6               83          97
16 AT1G011… 909860|PA… 0.0297  0.176 0.168           92.6              287         310
17 AT1G011… 311315|PA… 0.0287  0.162 0.177           94.2              502         533
18 AT1G012… 470156|PA… 0.0190  0.168 0.114           95.8              228         238
19 AT1G012… 311313|PA… 0.0207  0.154 0.134           95.3              102         107
20 AT1G012… 470155|PA… 0.0157  0.153 0.102           96.7             1021        1056
# … with 16 more variables: mismatches <int>, gap_openings <int>, n_gaps <int>,
#   pos_match <int>, ppos <dbl>, q_start <int>, q_end <int>, q_len <int>, qcov <int>,
#   qcovhsp <int>, s_start <int>, s_end <dbl>, s_len <dbl>, evalue <dbl>, bit_score <dbl>,
#   score_raw <dbl>

When running your own query file, please specify query_file = "path/to/your/cds.fasta instead of system.file(..., package = "orthologr"). The command system.file(..., package = "orthologr") merely references the path to the example file stored in the orthologr package itself.

Example: Computing dN/dS values for all orthologous genes between two genomes

First, users can retrieve all coding sequences from entire genomes using the biomartr package (see details here).

install.packages("biomartr")
library(biomartr)
# download all coding sequences for Mus musculus
Mmusculus_file <- biomartr::getCDS(organism = "Mus musculus", path = getwd())
# download all coding sequences for Homo sapiens
Hsapiens_file <- biomartr::getCDS(organism = "Homo sapiens", path = getwd())
# compute dN/dS values for Homo sapiens versus Mus musculus
Hs_vs_Mm_dNdS <- 
  dNdS(query_file      = Hsapiens_file,
       subject_file    = Mmusculus_file,
       delete_corrupt_cds = FALSE,
       ortho_detection = "RBH", 
       aa_aln_type     = "pairwise",
       aa_aln_tool     = "NW", 
       codon_aln_tool  = "pal2nal", 
       dnds_est.method = "Comeron", 
       comp_cores      = 1 )
# store result in Excel readable csv file
install.packages("readr")
readr::write_excel_csv(Hs_vs_Mm_dNdS, "Hs_vs_Mm_dNdS.csv")

Users can find the corresponding map at https://github.com/drostlab/dNdS_database.

This way, users can compute dN/dS values for any pairwise genome comparison.

On Windows Systems

In some cases (when working with WINDOWS machines), the installation via devtools will not work properly. In this case users can try the follwing steps:

# On Windows, this won't work - see ?build_github_devtools
install_github("drostlab/orthologr", build_vignettes = TRUE, dependencies = TRUE)

# When working with Windows, first users need to install the
# R package: rtools -> install.packages("rtools")

# Afterwards users can install devtools -> install.packages("devtools")
# and then they can run:

devtools::install_github("drostlab/orthologr", build_vignettes = TRUE, dependencies = TRUE)

# and then call it from the library
library("orthologr", lib.loc = "C:/Program Files/R/R-3.1.1/library")

Troubleshooting on Windows Machines

Install orthologr on a Win 8 laptop: solution ( Thanks to Andres Romanowski )

Interfaces implemented in `orthologr`:

Perform BLAST searches with R

blast(): Perform a BLAST+ search
blast_best(): Perform a BLAST+ best hit search
blast_rec(): Perform a BLAST+ best reciprocal hit (BRH) search

Perform Pairwise and Multiple Sequence Alignements with R

multi_aln(): Compute Multiple Sequence Alignments based on the clustalw, t_coffee, muscle, clustalo, and mafft programs.
pairwise_aln(): Compute Pairwise Alignments
codon_aln(): Compute a Codon Alignment

Perform Orthology Inference with R

orthologs(): Main Orthology Inference Function

Perform Population Genomics with R

dNdS(): Compute dNdS values for two organisms
substitutionrate(): Internal function for dNdS computations

Read and Write CDS, Genomes, and Proteomes

read.cds(): Read the CDS of a given organism
read.genome(): Read the genome of a given organism
read.proteome(): Read the proteome of a given organism
write.proteome(): Save a proteome in fasta format

Discussions and Bug Reports

I would be very happy to learn more about potential improvements of the concepts and functions provided in this package.

Furthermore, in case you find some bugs, need additional (more flexible) functionality of parts of this package, or want to contribute to this project please let me know:

https://github.com/drostlab/orthologr/issues

Licenses

The orthologr package includes source code that has been published under following licenses:

gestimator

All files included in `orthologr` that were taken from gestimator are 
also part of libsequence.

libsequence is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, either version 2 of the License, or
(at your option) any later version.

libsequence is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
GNU General Public License for more details.

You should have received a copy of the GNU General Public License
long with libsequence.  If not, see <http://www.gnu.org/licenses/>.


Modified by Sarah Scharfenberg and Hajk-Georg Drost (2014) to work 
in orthologr without using external libraries from libsequence.

All changes are also free under the terms of GNU General Public License
version 2 of the License, or any later version.

KaKs_Calculator

In orthologr the file parseFastaIntoAXT.pl is stored in /inst/KaKs_Calc_parser.

The parseFastaIntoAXT.pl script is freely available under GNU GPL v3 
Licence and included in the KaKs_Calculator project that can be found at 
https://code.google.com/p/kaks-calculator/

orthologr's People

Contributors

Stargazers

Watchers

orthologr's Issues

Missing sequences even with delete_corrupt_cds = FALSE

I have a file with 251 fasta sequences and when I run blast_rec using the 251 sequences as query_file, the message printed in the console says:

Building a new DB, current time: 08/26/2022 19:18:53
New DB name:   C:\Users\irene\AppData\Local\Temp\RtmpuSmo3J\_blast_db\blastdb_VIRULENCE_GENES_sorted.fasta_protein.fasta
New DB title:  blastdb_VIRULENCE_GENES_sorted.fasta_protein.fasta
Sequence type: Protein
Deleted existing Protein BLAST database named C:\Users\irene\AppData\Local\Temp\RtmpuSmo3J\_blast_db\blastdb_VIRULENCE_GENES_sorted.fasta_protein.fasta
Keep MBits: T
Maximum file size: 1000000000B
Adding sequences from FASTA; added 245 sequences in 0.0125512 seconds.

therefore, only 245 sequences were blasted, what happened to the remaining 6? I am losing information through the process...

Error in ability to compile

Hi,

Thank you so much in advance. I was hoping to use this packge for my work, but I am having some trouble trying to compile it.
In terms of specs, I am working on a Windows computer with R 3.5.2 in R studio.

if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("Biostrings", version = "3.8")
install.packages("rtools")
install.packages("devtools")
install.packages("pkgbuild")
library(pkgbuild)
find_rtools(debug = TRUE)
devtools::install_github("HajkD/orthologr", build_vignettes = TRUE, dependencies = TRUE)

Error in i.p(...) :
(converted from warning) installation of package ‘C:/Users/DANIEL~1.STE/AppData/Local/Temp/RtmpW6Olzn/file301f4228420e9/orthologr_0.2.0.9000.tar.gz’ had non-zero exit status

I know Windows machines have problems and tried to correct for this by setting the path of my Rtools in the system settings. But this didn't seem to help. I think there seems to be some problem in the compilation of the binaries, but I'm new so I may have missed something.

Thanks again!
Dani

How can I output blast_rec() results to file?

Hi Hajk,

I ran your example command:

# save the BLAST output file to the current working directory
blast_rec(query_file = system.file('seqs/ortho_thal_aa.fasta', package = 'orthologr'),
subject_file = system.file('seqs/ortho_lyra_aa.fasta', package = 'orthologr'),
seq_type = "protein",
save.output = getwd())

I found the output files blastresult_ortho_lyra_aa.fasta.csv and blastresult_ortho_thal_aa.fasta.csv in my working directory. But I'm unable to find a .csv file with the full results of the following output that is printed to the console:

# A tibble: 20 x 21
# Groups: query_id [20]
query_id subject_id perc_identity num_ident_match… alig_length mismatches gap_openings n_gaps

1 AT1G010… 333554|PA… 74.0 347 469 80 8 42
2 AT1G010… 470181|PA… 91.1 224 246 22 0 0
3 AT1G010… 470180|PA… 95.5 343 359 12 2 4
20 AT1G012… 470155|PA… 96.7 1021 1056 35 0 0
# … with 13 more variables: pos_match , ppos , q_start , q_end , q_len ,
# qcov , qcovhsp , s_start , s_end , s_len , evalue ,
# bit_score , score_raw

Note: I excised rows 4 to 19 for brevity.

How can I best access the full table of these results? For example, is there a way to output them to a .csv file?

Thanks in Advance,

Paul

caught segfault, address cause 'invalid permissions'

CDS file could not be read properly, Please make sure the file contains only CDS sequences and is in fasta format.

I have check the filed is in CDS sequence.

BolC010000010-mRNA-1
ATGAAAAGCGGGATAATGTCGTTAATCTTTGCGCTCTATTTAACCGTTATGGTTGTTACAGCGAATAGCG
ATTCCAACTACTACGGGACGACCAAGCCGGTTCATCTGAAGGAGGAGAAAGTGACACGTCTCCACTTCTA
CCTTCATGATATCCTCAGTGGCAGAAACCCCAGCGCCGTCAGAATCGCACACGCCAATCTCACCGGCGGC
GCAGATTCGGCAGTAGGATTCGGAAGCCTGTTCGCGATGGACGATCCGCTCACCGTCGGACCCGGGAAAG
ATTCGAAGGAGATAGGAAACGGGCGTGGAATGTACGTATCCGGGAGCAAAGACATAAGAAAGTTCACGAT
CGTGATGTACGCGGATTTGGCGTTCACGACGGGGAAATTCAACGGAAGCTCGATAAGCGTGTTCTCGAGG
AATCCGGTGGCGGAAGAGGCCGGAGAGAGAGAGATTGGGATCGTGGGAGGACGTGGTAAATTCAGAATGG
CTAGAGGCTTCGTCAATATCAAAACGCATCAGATCGACATGAAGACAGGCGATGCTGTTCTCCGTTACGA
TGCTACTGTTTACCATTATTAA
BolC010000020-mRNA-1
ATGAGGATCTTACCCGAGCCTCTAGGTTCGGTTCCATGCCTCATTCTCCTCGTGTCGTCGGTTCTATTTT
CAGCGACTCTATCCCTCGCTCGTGTCGTCGAAGTTGTTGGTTACGCCGAGAGCAAAATCAAAAATCCCCA
TGCATTCACAGGACTCCGAGTGACGATTGAATGTAAGGGGGAGAAAGGGCATTTTGTTACAAAAGGTTCT
GGAAACATTGATGAGGAAGGAAAGTTCGGTCTGAAAGTTCTTCCTCATGACATTATCTCTGACGACGGAG
CCTTAAAGGAGGAGTGTTATGCCCAGCTTCACAGTGCGGCGGGAGCACCTTGTCCGGCTCACGACGGCCT
TGAGTCAAACAAGATCGTGTTTCTATACACATCCGGAGACAAACACGTTTTGGGCCTCAAACAAAACCTG

this is my CDS file.
is there something that is wrong with my file?

my commands is here;

dNdS(query_file = '/home/data1/bai/Aw/parents/Bra.cds', subject_file = '/home/data1/bai/Aw/parents/BOL.cds', delete_corrupt_cds = TRUE ortho_detection = "RBH", aa_aln_type = "pairwise", aa_aln_tool = "NW", codon_aln_tool = "pal2nal", dnds_est.method = "Comeron", comp_cores = 1 )

Problem with input files in pal2nal

Hi, we ran the program as follows:

library(orthologr)
dNdS(query_file = "/home/rohit/mario/V_vulnificus/orthologr/test1_genesv2.fasta",
 subject_file = "/home/rohit/mario/V_vulnificus/orthologr/test2_genesv2.fasta",
 seq_type = "cds",
 format = "fasta",
 ortho_detection = "RBH",
 eval = "1E-5",
 aa_aln_type = "multiple",
 aa_aln_tool = "muscle",
 codon_aln_tool = "pal2nal",
 dnds_est.method = "YN",
 comp_cores = 30,
 quiet = FALSE,
 clean_folders = FALSE)

However, we found this message when the program tries to use pal2nal: "Can't open /tmp/RtmpFcPl9s/_alignment/multi_aln/q3922_muscle.aln at /home/rohit/josemanuel/programs/R-3.4.0/library/orthologr/pal2nal/pal2nal.v14/pal2nal.pl line 365." Searching on the temporal files, we've seen that the program is calling to all the alignment files as q3922_aa.fasta_q3922_muscle.aln, instead of naming them according to the previuous name (q3922_muscle.aln). We've manually changed the name of the file as it should be, and we've run again the program. After this modification, pal2nal recognises the files giving us at the end the folling file:

"3893" "VV2_1642" "VVMO6_03409" 0.00492299 0.0714542 0.0688972 "YN"
"3894" "VV2_1643" "VVMO6_03410" 0.012254 0.0422009 0.290372 "YN"
"3895" "VV2_1644" "VVMO6_03411" 0.00118025 0.0344713 0.0342386 "YN"
"3896" "VV2_1645" "VVMO6_03412" NA 0.0383651 0 "YN"
"3897" "VV2_1646" "VVMO6_03413" 0.000974747 0.0588005 0.0165772 "YN"
"3898" "VV2_1647" "VVMO6_03414" NA 0.0235932 0 "YN"
"3899" "VV2_1648" "VVMO6_03415" NA 0.0331962 0 "YN"
"3900" "VV2_1649" "VVMO6_03416" 0.00118043 0.0173107 0.0681908 "YN"
"3901" "VV2_1650" "VVMO6_03417" 0.00458339 0.0794807 0.0576667 "YN"
"3902" "VV2_1651" "VVMO6_03418" 0.00266173 0.0492652 0.0540286 "YN"
"3903" "VV2_1652" "VVMO6_03419" 0.00821594 0.098678 0.0832601 "YN"
"3904" "VV2_1653" "VVMO6_03420" NA 0.0265769 0 "YN"

Can you please give us a solution to this error, in order to run it in one single step.

Many thanks in advance.

Problem calling `dNdS` on windows

Hi Hajk,

I am experiencing a problem with dNdS function on windows 10.
When I run the example:

res <- dNdS(query_file      = system.file('seqs/ortho_thal_cds.fasta', package = 'orthologr'),
            subject_file    = system.file('seqs/ortho_lyra_cds.fasta', package = 'orthologr'),
            delete_corrupt_cds = TRUE, 
            ortho_detection = "RBH",
            aa_aln_type     = "multiple", 
            aa_aln_tool     = "clustalw", 
            codon_aln_tool  = "pal2nal", 
            dnds_est.method = "Li", 
            comp_cores      = 1 )

The output ends with an error:

Starting orthology inference (RBH) and dNdS estimation (Li) using the follwing parameters:
query = 'ortho_thal_cds.fasta'
subject = 'ortho_lyra_cds.fasta'
seq_type = 'cds'
e-value: 1E-5
aa_aln_type = 'multiple'
aa_aln_tool = 'clustalw'
comp_cores = '1'


Starting Orthology Inference ...
Running blastp: 2.9.0+ ...


Building a new DB, current time: 07/09/2019 15:53:28
New DB name:   C:\Users\missu\AppData\Local\Temp\RtmpmIiYdX\_blast_db\blastdb_ortho_lyra_cds.fasta_protein.fasta
New DB title:  blastdb_ortho_lyra_cds.fasta_protein.fasta
Sequence type: Protein
Deleted existing Protein BLAST database named C:\Users\missu\AppData\Local\Temp\RtmpmIiYdX\_blast_db\blastdb_ortho_lyra_cds.fasta_protein.fasta
Keep MBits: T
Maximum file size: 1000000000B
Adding sequences from FASTA; added 20 sequences in 0.0024623 seconds.
Running blastp: 2.9.0+ ...


Building a new DB, current time: 07/09/2019 15:53:29
New DB name:   C:\Users\missu\AppData\Local\Temp\RtmpmIiYdX\_blast_db\blastdb_ortho_thal_cds.fasta_protein.fasta
New DB title:  blastdb_ortho_thal_cds.fasta_protein.fasta
Sequence type: Protein
Deleted existing Protein BLAST database named C:\Users\missu\AppData\Local\Temp\RtmpmIiYdX\_blast_db\blastdb_ortho_thal_cds.fasta_protein.fasta
Keep MBits: T
Maximum file size: 1000000000B
Adding sequences from FASTA; added 20 sequences in 0.0021401 seconds.
Orthology Inference Completed.
Starting dN/dS Estimation ...



 CLUSTAL 2.1 Multiple Sequence Alignments



Error in seqinr::read.alignment(file = file, format = format) : 
  File C:\Users\missu\AppData\Local\Temp\RtmpmIiYdX/_alignment/codon_aln/q1_pal2nal.aln is not readable

This alignment is not readable since it is not present at the mentioned address.

I trust this is a problem with orthologr:::codon_aln. When I run the example

codon_aln <- codon_aln(file_aln = system.file('seqs/aa_seqs.aln', package = 'orthologr'),
                       file_nuc = system.file('seqs/dna_seqs.fasta', package = 'orthologr'), 
                       format   = "clustal", 
                       tool     = "pal2nal", 
                       get_aln  = TRUE)

the output is

[1] "Codon Alignment successfully written in C:\\Users\\missu\\AppData\\Local\\Temp\\RtmpmIiYdX/_alignment/codon_aln/pal2nal.aln."
Error in value[[3L]](cond) : Something went wront with Pal2Nal.pl .
C:\Users\missu\AppData\Local\Temp\RtmpmIiYdX/_alignment/codon_aln/pal2nal.aln could not be read properly.

and again there is no file at the described address.

I verified in cmd that this call works

perl C:\Users\missu\Documents\R\win-library\3.5\orthologr\pal2nal\pal2nal.v14\pal2nal.pl

output:

pal2nal.pl  (v14)

Usage:  pal2nal.pl  pep.aln  nuc.fasta  [nuc.fasta...]  [options]


    pep.aln:    protein alignment either in CLUSTAL or FASTA format

    nuc.fasta:  DNA sequences (single multi-fasta or separated files)

    Options:  -h            Show help

              -output (clustal|paml|fasta|codon)
                            Output format; default = clustal

              -blockonly    Show only user specified blocks
                            '#' under CLUSTAL alignment (see example)

              -nogap        remove columns with gaps and inframe stop codons

              -nomismatch   remove mismatched codons (mismatch between
                            pep and cDNA) from the output

              -codontable  N
                    1  Universal code (default)
                    2  Vertebrate mitochondrial code
                    3  Yeast mitochondrial code
                    4  Mold, Protozoan, and Coelenterate Mitochondrial code
                       and Mycoplasma/Spiroplasma code
                    5  Invertebrate mitochondrial
                    6  Ciliate, Dasycladacean and Hexamita nuclear code
                    9  Echinoderm and Flatworm mitochondrial code
                   10  Euplotid nuclear code
                   11  Bacterial, archaeal and plant plastid code
                   12  Alternative yeast nuclear code
                   13  Ascidian mitochondrial code
                   14  Alternative flatworm mitochondrial code
                   15  Blepharisma nuclear code
                   16  Chlorophycean mitochondrial code
                   21  Trematode mitochondrial code
                   22  Scenedesmus obliquus mitochondrial code
                   23  Thraustochytrium mitochondrial code


              -html         HTML output (only for the web server)

              -nostderr     No STDERR messages (only for the web server)


    - sequence order in pep.aln and nuc.fasta should be the same.

    - IDs in pep.aln are used in the output.

Additionally I have pal2nal.pl in my PATH. So running pal2nal.pl in cmd also outputs as above.
I will attempt to nail the problem down further when I get more free time.

Thanks!

Kind regards,

Milan

unable to install on R4.2

Hi, I can't install megablastr or orthologr on my version of R (4.2.2). I used devtools::install_github("HajkD/metablastr") and
devtools::install_github("HajkD/orthologR") and got the following errors:
"Downloading GitHub repo HajkD/metablastr@HEAD
Skipping 17 packages not available: ggridges, ggsci, ggplot2, DBI, RSQLite, rtracklayer, fs, IRanges, readr, stringr, tibble, scales, GenomicRanges, GenomicFeatures, Biostrings, dplyr, seqinr
── R CMD build ───────────────────────────────────────────────────────────────────────────────────────────────────
✔ checking for file ‘/private/var/folders/vv/_4lv9dx12j7d3x_j30cs3qdh0000gq/T/RtmpVOKJjs/remotes1dc72978a0c/drostlab-metablastr-0c632b4/DESCRIPTION’ (480ms)
─ preparing ‘metablastr’:
✔ checking DESCRIPTION meta-information ...
─ checking for LF line-endings in source and make files and shell scripts
─ checking for empty or unneeded directories
Omitted ‘LazyData’ from DESCRIPTION
─ building ‘metablastr_0.3.1.tar.gz’

ERROR: dependencies ‘seqinr’, ‘ggridges’ are not available for package ‘metablastr’

removing ‘/Library/Frameworks/R.framework/Versions/4.2/Resources/library/metablastr’
Warning message:
In i.p(...) :
installation of package ‘/var/folders/vv/_4lv9dx12j7d3x_j30cs3qdh0000gq/T//RtmpVOKJjs/file1dc3ae8549a/metablastr_0.3.1.tar.gz’ had non-zero exit status"

"devtools::install_github("HajkD/orthologR")
Downloading GitHub repo HajkD/orthologR@HEAD
Skipping 20 packages not available: Rcpp, Rdpack, scales, ggplot2, ape, readr, tibble, rtracklayer, dtplyr, DBI, IRanges, stringr, RSQLite, Biostrings, foreach, doParallel, dplyr, fs, seqinr, data.table
Downloading GitHub repo HajkD/metablastr@HEAD
Skipping 17 packages not available: ggridges, ggsci, ggplot2, DBI, RSQLite, rtracklayer, fs, IRanges, readr, stringr, tibble, scales, GenomicRanges, GenomicFeatures, Biostrings, dplyr, seqinr
── R CMD build ───────────────────────────────────────────────────────────────────────────────────────────────────
✔ checking for file ‘/private/var/folders/vv/_4lv9dx12j7d3x_j30cs3qdh0000gq/T/RtmpVOKJjs/remotes1dc42fee7c6/drostlab-metablastr-0c632b4/DESCRIPTION’ (385ms)
─ preparing ‘metablastr’:
✔ checking DESCRIPTION meta-information ...
─ checking for LF line-endings in source and make files and shell scripts
─ checking for empty or unneeded directories
Omitted ‘LazyData’ from DESCRIPTION
─ building ‘metablastr_0.3.1.tar.gz’

ERROR: dependencies ‘seqinr’, ‘ggridges’ are not available for package ‘metablastr’

removing ‘/Library/Frameworks/R.framework/Versions/4.2/Resources/library/metablastr’
Downloading GitHub repo HajkD/metablastr@HEAD
Skipping metablastr, it is already being installed
── R CMD build ───────────────────────────────────────────────────────────────────────────────────────────────────
✔ checking for file ‘/private/var/folders/vv/_4lv9dx12j7d3x_j30cs3qdh0000gq/T/RtmpVOKJjs/remotes1dc420101fb/drostlab-orthologr-9c1a50e/DESCRIPTION’ ...
─ preparing ‘orthologr’:
✔ checking DESCRIPTION meta-information ...
─ cleaning src
Warning: Rd macro package 'Rdpack' is not installed.
Warning: /private/var/folders/vv/_4lv9dx12j7d3x_j30cs3qdh0000gq/T/RtmpuYcI8H/Rbuild81575160d39d/orthologr/man/map_generator_lnc.Rd:120: unknown macro '\insertRef'
─ checking for LF line-endings in source and make files and shell scripts
─ checking for empty or unneeded directories
─ building ‘orthologr_0.4.0.tar.gz’

ERROR: dependencies ‘seqinr’, ‘metablastr’, ‘doParallel’, ‘foreach’, ‘Rdpack’ are not available for package ‘orthologr’

removing ‘/Library/Frameworks/R.framework/Versions/4.2/Resources/library/orthologr’
Warning messages:
1: In i.p(...) :
installation of package ‘/var/folders/vv/_4lv9dx12j7d3x_j30cs3qdh0000gq/T//RtmpVOKJjs/file1dc6d229a86/metablastr_0.3.1.tar.gz’ had non-zero exit status
2: In i.p(...) :
installation of package ‘/var/folders/vv/_4lv9dx12j7d3x_j30cs3qdh0000gq/T//RtmpVOKJjs/file1dc764c7b21/orthologr_0.4.0.tar.gz’ had non-zero exit status"

Problem running codon_aln with PAL2NAL

Hi folks,

I am having a problem trying to use the function codon_aln. I run the command as follows:

  Test <- codon_aln(file_aln = system.file('./data/nuc_OG0009506.fasta'), file_nuc = system.file('./data/msa_OG0009506.fasta'), format = "fasta", tool = "pal2nal", get_aln = TRUE)

And I get the error:

Can't open  at /Library/Frameworks/R.framework/Versions/3.6/Resources/library/orthologr/pal2nal/pal2nal.v14/pal2nal.pl line 365.
[1] "Codon Alignment successfully written in /var/folders/qh/ymwbdpcx1s163pxcmfp7mv5h0000gn/T//RtmpDSOlRf/_alignment/codon_aln/pal2nal.aln."
Error in value[[3L]](cond) : Something went wront with Pal2Nal.pl .
/var/folders/qh/ymwbdpcx1s163pxcmfp7mv5h0000gn/T//RtmpDSOlRf/_alignment/codon_aln/pal2nal.aln could not be read properly.

This seems to be the same error message reported in this issue: #5. However, I have been unable to solve it myself.

I'm using R version 3.6.2 (2019-12-12).

Thank you so much- I really appreciate the work you've done putting this package together!

Docker image request

I would like to integrate this tool in Galaxy, as part of the myTAI pipeline, and it would be great if you could provide a docker image, since it would simplify quite a lot the implementation of the tool in Galaxy.

Thanks a lot.

From orthologs to multiple align

Hello,
I have id'ed orthologs between my two finch species assembles using

res <- orthologs("GCF_003957565.2_bTaeGut1.4.pri_protein.faa",
          "GCF_005870125.1_lonStrDom2_protein.faa",
          seq_type="protein",
          ortho_detection = "RBH",
          comp_cores =2,
          clean_folders = FALSE)

readr::write_csv(res, "proteinRBH.csv", col_names = TRUE)

and retrieved my ~17.000+ orthologues

query_id        subject_id      perc_identity  num_ident_matches  alig_length  mismatches  gap_openings  n_gaps  pos_match  ppos   q_start  q_end  q_len  qcov  qcovhsp  s_start  s_end  s_len  evalue                   bit_score  score_raw
NP_001041718.1  XP_021405980.1  98.601         141                143          2           0             0       142        99.3   1        143    143    100   100      1        143    143    1.4599999999999998e-95   272        695
NP_001041719.1  XP_021385112.1  100            193                193          0           0             0       193        100    1        193    193    100   100      1        193    193    2.3399999999999997e-145  402

....

Now I wanna produce an alignment for those guys. Does orthologr has a function to retrieve the sequences for each pair of orthologues from the parent fastas and order than one after the other or I should get it other way?

>NP_001041718.1
aaseqhere
>XP_021405980.1
aaseqhere

for example

Thank you!

clustalw in multi_aln

Hi Hajk,

I installed clustalw with "conda install -c biobuilds clustalw" on my Linux system.

And the binary is called "clustalw2".

So I can not call multi_aln.

best,
Claus

Package release request

would be possible also to provide a package release? It is required for creating the conda package.

Thanks a lot!

error back translating AA alignment to nucleotide pal2nal

Hi,

Thanks for the useful program. I am getting an error that I am struggling to debug. The cmd that I use to run is perl pal2nal.pl all_nt.nhmmer.unaln.gt1300bp.frame1table0_mafft-aligned_test.faa all_nt.nhmmer.unaln.gt1300bp.fa 2> log. In trying to back translate my amino acid alignment to a nucleotide alignment, I get the following error, can you please help?:

#---  ERROR: inconsistency between the following pep and nuc seqs  ---#
>MH218720_5355-6993_1
MMMASKDAPQSADGASGAGQLVPEVNTADPLPMEPVAGPTTAVATAGQVNMIDPWIVNNF
VQSPQGEFTISPNNTPGDILFDLQLGPHLNPFLSHLSQMYNGWVGNMRVRILLAGNAFSA
GKIIVCCVPPGFTSSSLTIAQATLFPHVIADVRTLEPIEMPLEDVRNVLYHTNDNQPTMR
LVCMLYTPLRTGGGSGNSDSFVVAGRVLTAPSSDFSFLFLVPPTIEQKTRAFTVPNIPLQ
TLSNSRFPSLIQGMILSPDASQVVQFQNGRCLIDGQLLGTTPATSGQLFRVRGKINQGAR
TLNLTEVDGKPFMAFDSPAPVGFPDFGKCDWHMRISKTPNNTSSGDPMRSVSVQTNVQGF
VPHLGSIQFDEVFNHPTGDYIGTIEWISQPSTPPGTDIDLWEIPDYGSSLSQAANLAPPV
FPPGFGEALVYFVSAFPGPNNRSAPNDVPCLLPQEYITHFVSEQAPTMGDAALLHYVDPD
TNRNLGEFKLYPGGYLTCVPNGVGAGPQQLPLNGVFLFVSWVSRFYQLKPVGTASTARGR
LGVRRIX
>MH218720/5355-6993
ATGATGATGGCGTCTAAGGACGCCCCTCAAAGCGCTGATGGCGCAAGCGGCGCAGGTCAA
CTGGTGCCGGAGGTTAATACAGCTGACCCCTTACCCATGGAACCTGTGGCTGGGCCAACA
ACAGCCGTAGCCACTGCTGGGCAAGTTAATATGATTGATCCCTGGATTGTTAATAATTTT
GTCCAGTCACCTCAAGGTGAGTTCACAATCTCTCCTAACAATACCCCCGGTGATATTTTG
TTTGATTTACAATTAGGTCCACATCTAAACCCTTTCTTGTCACATTTGTCCCAAATGTAT
AATGGCTGGGTTGGGAACATGAGAGTCAGAATTCTCCTTGCTGGGAATGCATTCTCAGCT
GGAAAGATTATAGTTTGTTGTGTCCCCCCTGGCTTTACATCTTCTTCTCTCACCATAGCT
CAGGCCACATTGTTTCCCCATGTAATTGCTGATGTGAGAACCCTTGAGCCAATAGAAATG
CCCCTCGAGGATGTACGCAATGTCCTCTATCACACCAATGATAATCAACCAACAATGCGG
TTGGTGTGTATGCTATACACGCCGCTCCGCACTGGTGGGGGGTCTGGTAATTCTGATTCC
TTTGTAGTTGCTGGCAGGGTTCTCACAGCCCCTAGTAGCGACTTTAGTTTCTTGTTCCTT
GTCCCGCCTACCATAGAGCAGAAGACTCGGGCTTTCACTGTGCCTAATATCCCCTTGCAA
ACCTTGTCCAATTCTAGGTTTCCTTCCCTCATCCAGGGGATGATTCTGTCCCCCGATGCA
TCTCAAGTGGTCCAATTCCAAAATGGGCGCTGCCTTATAGATGGTCAACTCCTAGGCACT
ACACCCGCTACATCAGGACAGCTGTTCAGAGTAAGAGGAAAGATAAATCAGGGAGCCCGc
acaCTTAACCTCACAGAGGTGGATGGTAAACCATTCATGGCATTTGATTCCCCTGCACCT
GTGGGGTTCCCCGATTTTGGAAAATGTGATTGGCATATGAGAATCAGCAAAACCCCAAAC
AACACAAGtTCAGGTGACCCCATGcgcAGTGTCAGCGTGCAAACCAATGTGCAGGGTTTT
GTGCCACACCTGGGAAGTATACAATTTGATGAAGTGTTTaaccatcCCACAGGTGACTAC
ATTGGCACCATTGAATGGATTTCCCAGCCATCTACACCCCCTGGAACAGATATTGATCTG
TGGGAGATCCCCGATTATGGATCATCCCTTTCCCAAGCAGCTAATCTGGCCCCCCCAGTA
TTCCCCCCTGGATTTGGTGAGGCCCTTGTGTACTTTGTTTCTGCTTTCCCGGGCCCCAAT
AACCGCTCAGCCCCGAATGATGTACCCTGTCTTCTCCCTCAAGAGTACATAACCCACTTT
GTCAGTGAACAAGCCCCAACGATGGGTGACGCAGCCTTACTGCATTATGTCGACCCTGAT
ACCAACAGGAACCTTGGGGAGTTCAAGCTATACCCTGGAGGTTACCTCACCTGTGTACCA
AATGGGGTAGGTGCCGGGCCTCAACAGCTTCCTCTTAATGGTGTTTTTCTCTTTGTTTCT
TGGGTGTCTCGTTTTTATCAGCTTAAGCCTGTGGGAACAGCCAGTACGGCAAGAGGTAGG
CTTGGAGTgcgCCGTATAT


        ###-----   Check the following TBLASTN output:           -----###
        ###-----       your pep -> 'Query'                       -----###
        ###-----       your nuc -> 'Sbjct'                       -----###

Query= MH218720_5355-6993_1
         (547 letters)



>MH218720/5355-6993
          Length = 1639

 Score = 1120 bits (2897), Expect = 0.0
 Identities = 546/546 (100%), Positives = 546/546 (100%)
 Frame = +1

Query: 1   MMMASKDAPQSADGASGAGQLVPEVNTADPLPMEPVAGPTTAVATAGQVNMIDPWIVNNF 60
           MMMASKDAPQSADGASGAGQLVPEVNTADPLPMEPVAGPTTAVATAGQVNMIDPWIVNNF
Sbjct: 1   MMMASKDAPQSADGASGAGQLVPEVNTADPLPMEPVAGPTTAVATAGQVNMIDPWIVNNF 180

Query: 61  VQSPQGEFTISPNNTPGDILFDLQLGPHLNPFLSHLSQMYNGWVGNMRVRILLAGNAFSA 120
           VQSPQGEFTISPNNTPGDILFDLQLGPHLNPFLSHLSQMYNGWVGNMRVRILLAGNAFSA
Sbjct: 181 VQSPQGEFTISPNNTPGDILFDLQLGPHLNPFLSHLSQMYNGWVGNMRVRILLAGNAFSA 360

Query: 121 GKIIVCCVPPGFTSSSLTIAQATLFPHVIADVRTLEPIEMPLEDVRNVLYHTNDNQPTMR 180
           GKIIVCCVPPGFTSSSLTIAQATLFPHVIADVRTLEPIEMPLEDVRNVLYHTNDNQPTMR
Sbjct: 361 GKIIVCCVPPGFTSSSLTIAQATLFPHVIADVRTLEPIEMPLEDVRNVLYHTNDNQPTMR 540

Query: 181 LVCMLYTPLRTGGGSGNSDSFVVAGRVLTAPSSDFSFLFLVPPTIEQKTRAFTVPNIPLQ 240
           LVCMLYTPLRTGGGSGNSDSFVVAGRVLTAPSSDFSFLFLVPPTIEQKTRAFTVPNIPLQ
Sbjct: 541 LVCMLYTPLRTGGGSGNSDSFVVAGRVLTAPSSDFSFLFLVPPTIEQKTRAFTVPNIPLQ 720

Query: 241 TLSNSRFPSLIQGMILSPDASQVVQFQNGRCLIDGQLLGTTPATSGQLFRVRGKINQGAR 300
           TLSNSRFPSLIQGMILSPDASQVVQFQNGRCLIDGQLLGTTPATSGQLFRVRGKINQGAR
Sbjct: 721 TLSNSRFPSLIQGMILSPDASQVVQFQNGRCLIDGQLLGTTPATSGQLFRVRGKINQGAR 900

Query: 301 TLNLTEVDGKPFMAFDSPAPVGFPDFGKCDWHMRISKTPNNTSSGDPMRSVSVQTNVQGF 360
           TLNLTEVDGKPFMAFDSPAPVGFPDFGKCDWHMRISKTPNNTSSGDPMRSVSVQTNVQGF
Sbjct: 901 TLNLTEVDGKPFMAFDSPAPVGFPDFGKCDWHMRISKTPNNTSSGDPMRSVSVQTNVQGF 1080

Query: 361 VPHLGSIQFDEVFNHPTGDYIGTIEWISQPSTPPGTDIDLWEIPDYGSSLSQAANLAPPV 420
           VPHLGSIQFDEVFNHPTGDYIGTIEWISQPSTPPGTDIDLWEIPDYGSSLSQAANLAPPV
Sbjct: 1081VPHLGSIQFDEVFNHPTGDYIGTIEWISQPSTPPGTDIDLWEIPDYGSSLSQAANLAPPV 1260

Query: 421 FPPGFGEALVYFVSAFPGPNNRSAPNDVPCLLPQEYITHFVSEQAPTMGDAALLHYVDPD 480
           FPPGFGEALVYFVSAFPGPNNRSAPNDVPCLLPQEYITHFVSEQAPTMGDAALLHYVDPD
Sbjct: 1261FPPGFGEALVYFVSAFPGPNNRSAPNDVPCLLPQEYITHFVSEQAPTMGDAALLHYVDPD 1440

Query: 481 TNRNLGEFKLYPGGYLTCVPNGVGAGPQQLPLNGVFLFVSWVSRFYQLKPVGTASTARGR 540
           TNRNLGEFKLYPGGYLTCVPNGVGAGPQQLPLNGVFLFVSWVSRFYQLKPVGTASTARGR
Sbjct: 1441TNRNLGEFKLYPGGYLTCVPNGVGAGPQQLPLNGVFLFVSWVSRFYQLKPVGTASTARGR 1620

Query: 541 LGVRRI 546
           LGVRRI
Sbjct: 1621LGVRRI 1638



 Score = 16.2 bits (30), Expect = 1.2
 Identities = 9/33 (27%), Positives = 14/33 (42%)
 Frame = -3

Query: 193 GGSGNSDSFVVAGRVLTAPSSDFSFLFLVPPTI 225
           GG+ N     + G V T P++        PP +
Sbjct: 668 GGTRNKKLKSLLGAVRTLPATTKESELPDPPPV 570



 Score = 15.4 bits (28), Expect = 2.1
 Identities = 6/20 (30%), Positives = 8/20 (40%)
 Frame = -2

Query: 482 NRNLGEFKLYPGGYLTCVPN 501
           N  +    L   GY  C P+
Sbjct: 168 NPGINHINLPSSGYGCCWPS 109


Lambda     K      H
   0.319    0.137    0.424 

Gapped
Lambda     K      H
   0.267   0.0410    0.140 


Matrix: BLOSUM62
Gap Penalties: Existence: 11, Extension: 1
Number of Sequences: 1
Number of Hits to DB: 2018
Number of extensions: 41
Number of successful extensions: 6
Number of sequences better than 10.0: 1
Number of HSP's gapped: 6
Number of HSP's successfully gapped: 6
Length of query: 547
Length of database: 546
Length adjustment: 36
Effective length of query: 511
Effective length of database: 510
Effective search space:   260610
Effective search space used:   260610
Neighboring words threshold: 13
Window for multiple hits: 40
X1: 16 ( 7.4 bits)
X2: 38 (14.6 bits)
X3: 64 (24.7 bits)
S1: 27 (15.3 bits)
S2: 27 (15.0 bits)

I'm assuming this error is because there are multiple hits in bl2seq?

Thanks,

Mark

Clarification on pre-processing data and data types for dNdS estimation

Hello Dr. Hajk-Georg Drost!

For starters thank you for creating this package! It looks incredibly helpful for people who are getting started in phylogenetics—like myself—and I am excited to dive in! Specifically, I am interested in using the dNdS() function but have some questions about the best data type to use.

I have annotated Drosophila assemblies for three closely related taxa from NCBI—here is an example for D. teissieri. From my reading it seems like I would need to filter the _protein.faa.gz file for the longest isoform using the retrieve_longest_isoforms() function to avoid potential duplications and subsequent underestimation of the number of 1:1 orthologs—is this correct or does the dNdS() function do this natively? Additionally, if I have annotated assemblies is it better to use a proteome file versus the nucleotide equivalent _cds_from_genomic.fna.gz file for the dNdS() function?

Thanks in advance and I look forward to diving into this package this weekend!

-Dave

ERROR: number of input seqs differ (aa: 1; nuc: 2)!!

Hello, I'm running dNdS() on the cds of 2 species containing 13486 orthologous pairs, but only 1754 genes get the calculations done for. The rest runs into this error.

ERROR: number of input seqs differ (aa: 1; nuc: 2)!!

I'm running the program as follow:

rm(list=ls())
library(orthologr);
getwd();
workingDir = "/users/mfariasv/data/mfariasv/aligned_newBFV2/dNdS/"
setwd(workingDir);
args = commandArgs(trailingOnly=TRUE)
query = args[1]
subject = args[2]
res <- dNdS( query , subject ,
                 ortho_detection = "RBH",
                 seq_type = "cds",
                 aa_aln_type     = "multiple",
                 aa_aln_tool     = "clustalo",
                 codon_aln_tool  = "pal2nal",
                 dnds_est.method = "YN",
                 comp_cores      = 1,
                 store_locally = TRUE)
write.csv(res, gsub(".fa","ZF.dNdS", basename(args[2])))

The program runs:

Starting orthology inference (RBH) and dNdS estimation (YN) using the follwing parameters:
query = 'ZFcdsorth.fa'
subject = 'BFcdsorth.fa'
seq_type = 'cds'
e-value: 1E-5
aa_aln_type = 'multiple'
aa_aln_tool = 'clustalo'
comp_cores = '1'


Creating folder 'orthologr_alignment_files' to store alignment files ...
Starting Orthology Inference ...
Running blastp: 2.9.0+ ...
There seem to be 6 coding sequences in your input dataset which cannot be properly divided in base triplets, because their sequence length cannot be divided by 3.
A fasta file storing all corrupted coding sequences for inspection was generated and stored at '/gpfs/data/ehuertas/mfariasv/aligned_newBFV2/dNdS/ZFcdsorth.fa_corrupted_cds
_seqs.fasta'.


You chose option 'delete_corrupt_cds = FALSE', thus corrupted coding sequences were retained for subsequent analyses.
The following modifications were made to the CDS sequences that were not divisible by 3:
- If the sequence had 1 residue nucleotide then the last nucleotide of the sequence was removed.
- If the sequence had 2 residue nucleotides then the last two nucleotides of the sequence were removed.
If after consulting the file 'ZFcdsorth.fa_corrupted_cds_seqs.fasta' you wish to remove all corrupted coding sequences please specify the argument 'delete_corrupt_cds = TRU
E'.
All corrupted CDS were trimmed.


Building a new DB, current time: 01/24/2022 19:25:22
New DB name:   /tmp/RtmpUFtcuE/_blast_db/blastdb_BFcdsorth.fa_protein.fasta
New DB title:  blastdb_BFcdsorth.fa_protein.fasta
Sequence type: Protein
Keep MBits: T
Maximum file size: 1000000000B
Adding sequences from FASTA; added 13486 sequences in 0.380335 seconds.
Running blastp: 2.9.0+ ...
There seem to be 6 coding sequences in your input dataset which cannot be properly divided in base triplets, because their sequence length cannot be divided by 3.
A fasta file storing all corrupted coding sequences for inspection was generated and stored at '/gpfs/data/ehuertas/mfariasv/aligned_newBFV2/dNdS/ZFcdsorth.fa_corrupted_cds
_seqs.fasta'.


You chose option 'delete_corrupt_cds = FALSE', thus corrupted coding sequences were retained for subsequent analyses.
The following modifications were made to the CDS sequences that were not divisible by 3:
- If the sequence had 1 residue nucleotide then the last nucleotide of the sequence was removed.
- If the sequence had 2 residue nucleotides then the last two nucleotides of the sequence were removed.
If after consulting the file 'ZFcdsorth.fa_corrupted_cds_seqs.fasta' you wish to remove all corrupted coding sequences please specify the argument 'delete_corrupt_cds = TRU
E'.
All corrupted CDS were trimmed.


Building a new DB, current time: 01/24/2022 20:21:27
New DB name:   /tmp/RtmpUFtcuE/_blast_db/blastdb_ZFcdsorth.fa_protein.fasta
New DB title:  blastdb_ZFcdsorth.fa_protein.fasta
Sequence type: Protein
Keep MBits: T
Maximum file size: 1000000000B
Adding sequences from FASTA; added 13486 sequences in 0.404176 seconds.
There seem to be 6 coding sequences in your input dataset which cannot be properly divided in base triplets, because their sequence length cannot be divided by 3.
A fasta file storing all corrupted coding sequences for inspection was generated and stored at '/gpfs/data/ehuertas/mfariasv/aligned_newBFV2/dNdS/ZFcdsorth.fa_corrupted_cds
_seqs.fasta'.


You chose option 'delete_corrupt_cds = FALSE', thus corrupted coding sequences were retained for subsequent analyses.
The following modifications were made to the CDS sequences that were not divisible by 3:
- If the sequence had 1 residue nucleotide then the last nucleotide of the sequence was removed.
- If the sequence had 2 residue nucleotides then the last two nucleotides of the sequence were removed.
If after consulting the file 'ZFcdsorth.fa_corrupted_cds_seqs.fasta' you wish to remove all corrupted coding sequences please specify the argument 'delete_corrupt_cds = TRU
E'.
All corrupted CDS were trimmed.
Orthology Inference Completed.
Starting dN/dS Estimation ...

ERROR: number of input seqs differ (aa: 1;  nuc: 2)!!

   aa  'A1CF'
   nuc 'A1CF A1CF'
*****************************************************************
Function: Parse fasta file with aligned pairwise sequences into AXT file
Reference: Zhang Z, Li J, Zhao XQ, Wang J, Wong GK, Yu J: KaKs Calculator: Calculating Ka and Ks through model selection and model averaging. Genomics Proteomics Bioinforma
tics 2006 , 4:259-263.
Web Link: Documentation, example and updates at <http://code.google.com/p/kaks-calculator>
*****************************************************************

I noticed that all the orthologous pairs for which the error DOES NOT have different names

[mfariasv@login005 dNdS]$ head BFcdsorthZF.dNdS
"","query_id","subject_id","dN","dS","dNdS","method","perc_identity","num_ident_matches","alig_length","mismatches","gap_openings","n_gaps","pos_match","ppos","q_start","q_end","q_len","qcov","qcovhsp","s_start","s_end","s_len","evalue","bit_score","score_raw"
"1","ABCF2","LOC110475106",0.000859565,0.0378507,0.0227094,"YN",99.801,501,502,1,0,0,501,99.8,52,553,553,100,91,123,624,624,0,1051,2719

[mfariasv@login005 dNdS]$ sed -n -e '/ABCF2/,/>/ p' ZFcdsorth.fa
>ABCF2
ATGCCCTCCGACCTGGCCAAAAAGAAGGCGGCCAAGAAGAAGGAGGCGGCCAAGGCCCGG
CAGCGGCCGCGCCGGGTCCCGGACGAGAACGGTGATGCCGGGACGGAGCCGCAGGAAGTC
CGGTCCCCGGAGGCCAACGGGACGGTGCTGCCAGGGAAATCCATGCTTTTGTCAGCTATT
GGGAAGCGAGAAGTGCCTATCCCAGAGCACATTGACATCTATCACCTGACCCGAGAGATG
CCTCCCAGTGACAAGACCCCTCTGCAGTGTGTGATGGAAGTGGATACAGAGAGGGCCATG
TTGGAGCGAGAAGCGGAACGTTTAGCTCATGAAGATGCGGAATGTGAGAAACTCCTGGAG
TTATATGAACGCCTGGAGGAGCTGGATGCTGATAAGGCAGAAGCACGAGCCTCACGTATC
CTTCACGGCTTGGGGTTCACACCGGCCATGCAGAGGAAGAAGCTGAAGGACTTCAGTGGT
GGCTGGCGAATGAGGGTGGCCCTTGCCAGAGCGCTCTTCATTCGGCCTTTCATGCTGCTG
CTTGATGAGCCCACAAACCACCTTGACCTGGATGCCTGTGTGTGGTTGGAGGAAGAGCTG
AAAACGTTCAAGCGGATTCTTGTGCTGATATCCCACTCCCAGGACTTCCTGAATGGCGTC
TGCACCAACATCATCCACATGCACAACCGCAAACTTAAGTACTACACGGGAAATTATGAT
CAGTATGTAAAGACTCGCTTAGAACTAGAAGAAAATCAAATGAAGCGATTCCACTGGGAG
CAAGATCAGATTGCTCATATGAAGAATTACATTGCACGATTTGGCCATGGTAGTGCGAAG
CTGGCCAGGCAAGCTCAGAGCAAGGAGAAGACCCTTCAAAAAATGATGGCTTCTGGCTTG
ACAGAGAGAGTTGTGAATGATAAGACTTTATCATTCTACTTTCCACCCTGTGGGAAAATT
CCCCCTCCTGTCATCATGGTGCAGAATGTCAGCTTCAGATACACCAAGGATGGGCCATGG
ATCTATAATAACCTGGAGTTTGGGATTGATCTGGATACTCGTGTAGCTCTTGTTGGACCC
AATGGAGCTGGAAAGTCAACACTGCTGAAACTGCTCACAGGAGAGCTGCTGCCCACAGAT
GGGATGATTCGCAAGCACTCACATGTGAAGATCGGTAGATACCACCAGCACTTGCAAGAG
CAGTTGGACTTAGACCTCTCACCATTGGAGTACATGCTGAAATGCTACCCAGAGATCAAG
GAGAAGGAGGAGATGAGGAAAATCATTGGCAGATACGGTTTGACAGGGAAGCAGCAGGTG
AGCCCCATCAGGAACCTCTCTGATGGGCAGAAGTGCCGTGTGTGCTTTGCCTGGCTGGCC
TGGCAGAACCCTCACATGCTCTTCCTGGACGAGCCCACCAACCACCTGGACATAGAAACC
ATAGATGCACTGGCAGATGCTATCAATGAGTTCGAGGGAGGAATGATGCTTGTCAGCCAT
GACTTCAGACTCATCCAACAGGTTGCACAGGAAATCTGGGTCTGTGAGAAGCAGACAATC
GCCAAGTGGCAAGGGGACATCCTTGCCTACAAGGAGCATCTCAAGTCGAAGCTGGTGGAT
GAGGACCCGCAGCTCACCAAACGGACCCACAATGTGTGA
>ABCG1
[mfariasv@login005 dNdS]$ sed -n -e '/LOC110475106/,/>/ p' BFcdsorth.fa
>LOC110475106
ATGCCCTCCGACCTGGCCAAGAAGAAGGCGGCCAAGAAGAAGGAGGCGGCCAAGGCCCGG
CAGCGGCCGCGCCGGGTCCCGGACGAGAACGGTGATGCCGGGACGGAGCCGCAGGAAGTC
CGGTCCCCGGAGGCCAACGGGACGGTGCTACCAGAGGTGGATGCTCTTACAAAGGAGCTG
GAGGATTTTGAGTTAAAGAAAGCTGCTGCCCGAGCCGTGACAGGAGTGCTGGCCTCCCAC
CCCAACAGCACTGATGTGCATATCATCAACCTCTCACTGACCTTTCATGGCCAGGAGCTG
CTGAGTGACACCAAACTGGAGCTGAACTCTGGGAGACGCTATGGCCTGATTGGACTCAAT
GGGATTGGGAAATCCATGCTTTTGTCAGCTATTGGGAAGCGAGAAGTGCCCATCCCAGAG
CACATTGACATCTATCACCTGACCCGAGAGATGCCTCCCAGTGACAAGACCCCTCTGCAG
TGTGTGATGGAAGTGGATACAGAGAGGGCCATGTTGGAGCGAGAAGCGGAACGTTTAGCT
CATGAAGATGCGGAATGTGAGAAACTCCTGGAGTTATATGAACGCCTGGAGGAGCTGGAT
GCTGATAAGGCAGAAGCACGAGCCTCACGTATCCTTCATGGCTTGGGGTTCACGCCGGCC
ATGCAGAGGAAGAAGCTGAAGGACTTCAGTGGTGGCTGGCGAATGAGGGTGGCCCTTGCC
AGAGCGCTCTTCATTCGGCCTTTCATGCTGCTGCTCGATGAGCCCACAAACCACCTTGAC
CTGGATGCCTGTGTGTGGTTGGAGGAAGAGCTGAAAACGTTCAAGCGGATTCTTGTGCTG
ATATCCCACTCCCAGGACTTCCTGAATGGTGTCTGCACCAACATCATCCACATGCACAAC
CGCAAACTTAAGTACTACACGGGAAATTATGATCAGTATGTAAAGACACGCTTAGAACTA
GAAGAAAATCAAATGAAGCGATTCCACTGGGAGCAAGATCAGATTGCTCATATGAAGAAT
TACATTGCACGATTTGGCCATGGTAGTGCGAAGCTGGCCAGGCAAGCTCAGAGCAAGGAG
AAGACCCTTCAAAAAATGATGGCTTCTGGCTTGACAGAGAGAGTTGTGAATGATAAGACT
TTATCATTCTACTTTCCACCCTGTGGGAAAATTCCCCCTCCTGTCATCATGGTGCAGAAT
GTCAGCTTCAGATACACCAAGGATGGGCCATGGATCTATAATAACCTGGAGTTTGGGATT
GACCTGGATACTCGTGTAGCTCTTGTTGGACCCAATGGAGCTGGAAAGTCAACCCTGCTG
AAACTGCTCACAGGAGAGCTGCTGCCCACAGATGGGATGATTCGCAAGCACTCGCATGTG
AAGATCGGTAGATACCACCAGCACTTGCAAGAGCAGTTGGACTTAGACCTCTCACCATTA
GAGTACATGCTGAAATGCTACCCAGAGATCAAGGAGAAGGAGGAGATGAGGAAAATCATT
GGCAGATACGGTTTGACAGGGAAGCAGCAGGTGAGTCCCATCAGGAACCTCTCTGATGGA
CAGAAGTGCCGTGTGTGCTTTGCCTGGCTGGCCTGGCAGAACCCTCACATGCTCTTCCTG
GATGAGCCCACCAACCACCTGGACATAGAAACTATAGATGCACTGGCAGATGCTATCAAT
GAGTTTGAGGGAGGAATGATGCTTGTCAGCCATGACTTCAGACTCATCCAACAGGTTGCA
CAGGAAATCTGGGTCTGTGAGAAGCAGACAATCACCAAGTGGCAAGGGGACATCCTTGCC
TACAAGGAGCATCTCAAGTCGAAGCTGGTGGATGAGGACCCGCAGCTCACCAAACGGACC
CACAACGTGTGA
>LOC110475116

While all genes for which the error happens and dNdS is not calculated have the same names in both species:

For example for ABCG1


[mfariasv@login005 dNdS]$ sed -n -e '/ABCG1/,/>/ p' ZFcdsorth.fa
>ABCG1
ATGGCATGTCTGATGGCGGCTTTCTCCCTGGGCAGCGCTTTGGGTGGCAGCAGTTCTGGT
TGCACCATGGCCGAGCCAAAGTCTGTGTGTGTTTCTGTGGACGAGGTGGTCTCCAATGGC
ACAGACACCCAGGACATCCGACTCATCAATGGACACTTAAAAAAAGTGGACAATGCTCTG
ACAGAAGCCCACAGGTTCTCCTACCTGCCCCGCAGGCCAGCTGTGAACATTGAGTTTAAA
GAACTGTCCTACTCTATCCAGGAAGGGCCATGGTGGAGAAAGAAAGGTTATAAGACCCTT
TTGAAAGGAATTTCAGGGAAATTCAGCAGTGGAGAACTCGTTGCAATTATGGGACCTTCA
GGAGCTGGGAAGTCAACGCTTATGAATATTCTGGCAGGATACAGAGAGACGGGGATGAAA
GGAGAAATCCTCATCAACGGGCAGCCCCGCGACCTGCGCTCCTTCCGCAAGGTCTCCTGC
TACATCATGCAGGATGACATGCTCCTTCCTCACCTCACTGTCCAGGAAGCTATGATGGTA
TCTGCTCATCTGAAACTTCAAGAGAAAGATGAAGGGAGGAGAGAAATGGTGAAGGAAATC
CTGACAGCCCTTGGTTTGCTGGCGTGTGCCAACACCAGGACTGGGAGTCTCTCAGGAGGC
CAGAGGAAGCGCCTCGCCATCGCTCTGGAGCTGGTGAACAACCCTCCTGTCATGTTCTTC
GATGAACCAACCAGTGGCTTGGACAGTGCATCATGTTTCCAGGTGGTCTCTCTGATGAAG
GCTTTGGCCCAGGGTGGCAGATCCATCATCTGCACGATTCACCAGCCCAGTGCAAAACTG
TTTGAGCTCTTTGACCAGCTCTATGTTCTAAGTCAAGGTCAGTGCATTTACCGTGGGAAG
GTGACAAACCTCGTCCCTTACTTGAGAGATTTGGGGTTGAATTGTCCAACCTACCACAAC
CCAGCAGATTTTGTGATGGAAGTGGCCTCGGGTGAGTACGGGGACCAGAACAGCCGCCTG
GTCAGGGCTGTGAGGGAGAGGATTTGTGACACAGACTACAAGAGAGACGTGGTTGGGGAG
CACGAGCTGAACCCCTTCCTCTGGCACCGGCCCTCTGAAGAGGACTCATCCTCCACAGAA
GGCTGCCACAGCTTCTCTGCCAGCTGCCTAACCCAGTTCTGCATCCTCTTCAAAAGAACT
TTCCTCACCATCATGAGAGACTCGGTCCTGACACACTTGAGGATCACCTCACACATTGGC
ATTGGGCTGCTCATTGGATTGCTCTACTTGGGCATTGGCAATGAAGCCAAGAAAGTCCTC
AGCAACTCGGGGTTCCTCTTCTTCTCCATGTTGTTCCTCATGTTTGCTGCGCTCATGCCG
ACCGTCCTCACCTTTCCCCTTGAGATGGGAGTGTTTCTCAGAGAGCACCTGAACTACTGG
TACAGCCTAAAAGCCTATTACCTCGCCAAAACCATGGCTGATGTTCCTTTCCAGATCATG
TTCCCTGTGGCTTACTGCAGCATCGTGTACTGGATGACTTCCCAGCCCTCCGACGCGCTC
CGCTTCGTCCTCTTCGCAGCCCTGGGGACCATGACATCCCTGGTGGCTCAGTCACTGGGC
CTGCTCATAGGTGCAGCCTCCACATCCCTCCAGGTGGCAACTTTTGTGGGCCCAGTTACT
GCCATCCCAGTCCTCCTGTTCTCTGGGTTTTTTGTCAGCTTTGACACCATCCCAACATAC
CTCCAGTGGATGTCCTACATTTCCTATGTCAGATATGGGTTCGAAGGAGTCATCCTCTCC
ATCTACGGACTGGATCGAGAAGATCTGCATTGTGACAAAGATGACACCTGCCACTTCCAA
AAATCAGAGGCCATCCTGAAAGAACTGGATGTAGAAAATGCCAAACTTTACCTGGACTTC
ATTGTTCTTGGGATTTTCTTCTTCTCTCTGCGCCTGATTGCCTATTTTGTCCTCAGATAC
AAAATCCGAGCGGAGAGGTAA
>ABCG2
[mfariasv@login005 dNdS]$ sed -n -e '/ABCG1/,/>/ p' BFcdsorth.fa
>ABCG1
ATGGCATGTCTGATGGCGGCTTTCTCCCTGGGCAGCGCTTCGGGTGGCAGCAGTTCTGGT
TGCACCATGGCCGAGCCAAAGTCTGTGTGTGTTTCTGTGGACGAGGTGGTCTCCAATGGC
ACAGACACCCAGGACATCCGACTCATCAATGGACACTTAAAAAAAGTGGACAATGCTCTG
ACAGAAGCTCACAGGTTCTCCTACCTGCCCCGCAGGCCAGCTGTGAACATTGAGTTTAAA
GAACTCTCCTACTCTATCCAGGAAGGGCCATGGTGGAGAAAGAAAGGTTATAAAACCCTT
TTGAAAGGAATTTCAGGGAAGTTCAGCAGTGGAGAGCTCGTTGCAATTATGGGACCTTCA
GGAGCTGGGAAGTCAACGCTTATGAATATTCTGGCAGGATACAGAGAGACGGGGATGAAA
GGAGAAATCCTCATCAACGGGCAGCCCCGCGACCTGCGCTCCTTCCGCAAGGTCTCCTGC
TACATCATGCAGGATGACATGCTCCTTCCTCACCTCACTGTCCAGGAAGCTATGATGGTA
TCTGCTCATCTGAAACTTCAAGAGAAAGATGAAGGGAGGAGAGAAATGGTGAAGGAAATC
CTGACAGCCCTTGGTTTGCTGGCCTGTGCCAACACCAGGACTGGGAGCCTCTCAGGAGGC
CAGAGGAAGCGCCTCGCCATCGCTCTGGAGCTGGTGAACAACCCTCCTGTCATGTTCTTC
GATGAACCAACCAGTGGCTTGGACAGTGCATCATGTTTTCAGGTGGTCTCTCTGATGAAG
GCTTTGGCCCAGGGTGGCAGATCCATCATCTGCACAATTCACCAGCCCAGTGCAAAACTG
TTTGAGCTCTTTGACCAGCTCTATGTTCTAAGTCAAGGTCAGTGCATTTACCGTGGGAAG
GTGACAAACCTTGTCCCTTACTTGAGAGATTTGGGGTTGAATTGTCCAACCTACCACAAC
CCAGCAGATTTTGTAATGGAAGTGGCCTCGGGTGAGTACGGGGACCAGAACAGCCGCCTG
GTCAGGGCTGTGAGAGAGAGGATTTGTGACACAGACTACAAGAGAGACGTGGCTGGGGAG
CACGAGCTGAACCCCTTCCTCTGGCACCGGCCCTCTGAAGAGGATTCCTCCTCCACAGAA
GGATGCCACAGCTTCTCTGCCAGCTGCCTAACCCAGTTCTGCATCCTCTTCAAAAGAACT
TTCCTCACCATCATGAGGGACTCGGTCCTGACACACTTGAGGATCACCTCACACATTGGC
ATTGGGCTGCTCATTGGACTGCTCTACTTGGGCATTGGCAATGAAGCCAAGAAAGTCCTC
AGCAACTCAGGGTTCCTCTTCTTCTCCATGTTGTTCCTCATGTTTGCTGCACTCATGCCG
ACCGTCCTCACCTTTCCCCTTGAGATGGGAGTGTTTCTCAGAGAGCATCTGAACTACTGG
TACAGCCTGAAAGCCTATTACCTCGCCAAAACCATGGCTGATGTTCCTTTTCAGATCATG
TTCCCTGTGGCTTACTGCAGCATCGTGTACTGGATGACTTCCCAGCCCTCCGACGCGCTC
CGCTTCGTCCTCTTCGCAGCCCTGGGGACCATGACATCCCTGGTGGCTCAGTCACTGGGC
CTGCTCATAGGTGCAGCCTCCACATCCCTCCAGGTGGCAACTTTTGTGGGCCCAGTTACT
GCCATCCCAGTCCTCCTGTTCTCTGGGTTTTTTGTCAGCTTTGACACCATCCCAACATAC
CTCCAGTGGATGTCCTACATTTCCTATGTCAGATACGGGTTCGAAGGAGTCATCCTCTCC
ATCTACGGACTGGATCGAGAAGATCTGCATTGTGACAAAGATGACACCTGCCACTTCCAA
AAATCAGAGGCCATCCTGAAAGAACTGGATGTAGAAAATGCCAAACTCTACCTGGACTTC
ATCGTTCTTGGGATTTTCTTCTTCTCTCTGCGCCTGATTGCCTATTTTGTCCTCAGATAC
AAAATCCGAGCGGAGAGGTAA
>ABCG2

But the sequences are indeed different:

diff ABCG1_BF ABCG1_ZF
2c2
< ATGGCATGTCTGATGGCGGCTTTCTCCCTGGGCAGCGCTTCGGGTGGCAGCAGTTCTGGT
---
> ATGGCATGTCTGATGGCGGCTTTCTCCCTGGGCAGCGCTTTGGGTGGCAGCAGTTCTGGT
5,7c5,7
< ACAGAAGCTCACAGGTTCTCCTACCTGCCCCGCAGGCCAGCTGTGAACATTGAGTTTAAA
< GAACTCTCCTACTCTATCCAGGAAGGGCCATGGTGGAGAAAGAAAGGTTATAAAACCCTT
< TTGAAAGGAATTTCAGGGAAGTTCAGCAGTGGAGAGCTCGTTGCAATTATGGGACCTTCA
---
> ACAGAAGCCCACAGGTTCTCCTACCTGCCCCGCAGGCCAGCTGTGAACATTGAGTTTAAA
> GAACTGTCCTACTCTATCCAGGAAGGGCCATGGTGGAGAAAGAAAGGTTATAAGACCCTT
> TTGAAAGGAATTTCAGGGAAATTCAGCAGTGGAGAACTCGTTGCAATTATGGGACCTTCA
12c12
< CTGACAGCCCTTGGTTTGCTGGCCTGTGCCAACACCAGGACTGGGAGCCTCTCAGGAGGC
---
> CTGACAGCCCTTGGTTTGCTGGCGTGTGCCAACACCAGGACTGGGAGTCTCTCAGGAGGC
14,15c14,15
< GATGAACCAACCAGTGGCTTGGACAGTGCATCATGTTTTCAGGTGGTCTCTCTGATGAAG
< GCTTTGGCCCAGGGTGGCAGATCCATCATCTGCACAATTCACCAGCCCAGTGCAAAACTG
---
> GATGAACCAACCAGTGGCTTGGACAGTGCATCATGTTTCCAGGTGGTCTCTCTGATGAAG
> GCTTTGGCCCAGGGTGGCAGATCCATCATCTGCACGATTCACCAGCCCAGTGCAAAACTG
17,26c17,26
< GTGACAAACCTTGTCCCTTACTTGAGAGATTTGGGGTTGAATTGTCCAACCTACCACAAC
< CCAGCAGATTTTGTAATGGAAGTGGCCTCGGGTGAGTACGGGGACCAGAACAGCCGCCTG
< GTCAGGGCTGTGAGAGAGAGGATTTGTGACACAGACTACAAGAGAGACGTGGCTGGGGAG
< CACGAGCTGAACCCCTTCCTCTGGCACCGGCCCTCTGAAGAGGATTCCTCCTCCACAGAA
< GGATGCCACAGCTTCTCTGCCAGCTGCCTAACCCAGTTCTGCATCCTCTTCAAAAGAACT
< TTCCTCACCATCATGAGGGACTCGGTCCTGACACACTTGAGGATCACCTCACACATTGGC
< ATTGGGCTGCTCATTGGACTGCTCTACTTGGGCATTGGCAATGAAGCCAAGAAAGTCCTC
< AGCAACTCAGGGTTCCTCTTCTTCTCCATGTTGTTCCTCATGTTTGCTGCACTCATGCCG
< ACCGTCCTCACCTTTCCCCTTGAGATGGGAGTGTTTCTCAGAGAGCATCTGAACTACTGG
< TACAGCCTGAAAGCCTATTACCTCGCCAAAACCATGGCTGATGTTCCTTTTCAGATCATG
---
> GTGACAAACCTCGTCCCTTACTTGAGAGATTTGGGGTTGAATTGTCCAACCTACCACAAC
> CCAGCAGATTTTGTGATGGAAGTGGCCTCGGGTGAGTACGGGGACCAGAACAGCCGCCTG
> GTCAGGGCTGTGAGGGAGAGGATTTGTGACACAGACTACAAGAGAGACGTGGTTGGGGAG
> CACGAGCTGAACCCCTTCCTCTGGCACCGGCCCTCTGAAGAGGACTCATCCTCCACAGAA
> GGCTGCCACAGCTTCTCTGCCAGCTGCCTAACCCAGTTCTGCATCCTCTTCAAAAGAACT
> TTCCTCACCATCATGAGAGACTCGGTCCTGACACACTTGAGGATCACCTCACACATTGGC
> ATTGGGCTGCTCATTGGATTGCTCTACTTGGGCATTGGCAATGAAGCCAAGAAAGTCCTC
> AGCAACTCGGGGTTCCTCTTCTTCTCCATGTTGTTCCTCATGTTTGCTGCGCTCATGCCG
> ACCGTCCTCACCTTTCCCCTTGAGATGGGAGTGTTTCTCAGAGAGCACCTGAACTACTGG
> TACAGCCTAAAAGCCTATTACCTCGCCAAAACCATGGCTGATGTTCCTTTCCAGATCATG
31c31
< CTCCAGTGGATGTCCTACATTTCCTATGTCAGATACGGGTTCGAAGGAGTCATCCTCTCC
---
> CTCCAGTGGATGTCCTACATTTCCTATGTCAGATATGGGTTCGAAGGAGTCATCCTCTCC
33,34c33,34
< AAATCAGAGGCCATCCTGAAAGAACTGGATGTAGAAAATGCCAAACTCTACCTGGACTTC
< ATCGTTCTTGGGATTTTCTTCTTCTCTCTGCGCCTGATTGCCTATTTTGTCCTCAGATAC
---
> AAATCAGAGGCCATCCTGAAAGAACTGGATGTAGAAAATGCCAAACTTTACCTGGACTTC
> ATTGTTCTTGGGATTTTCTTCTTCTCTCTGCGCCTGATTGCCTATTTTGTCCTCAGATAC

Issue running blastp in dnds

Hello,

Thanks in advance for your help! I'm a newbie to this kind of work, so I apologize if I'm making a simple error. I am attempting to use orthologr for dnds calculation with the following input:
ES114_H905_dnds <- dNdS(
query_file = "/Users/jennifercalawa/Desktop/dnds/H905_CDS.fasta",
subject_file = "/Users/jennifercalawa/Desktop/dnds/ES114_CDS.fasta",
seq_type = "cds",
format = "fasta",
ortho_detection = "RBH",
eval = "1E-5",
aa_aln_type = "multiple",
aa_aln_tool = "clustalw",
codon_aln_tool = "pal2nal",
dnds_est.method = "MYN",
comp_cores = 1,
quiet = TRUE,
clean_folders = FALSE,
print_citation = TRUE
)

The output I get is this:
Starting orthology inference (RBH) and dNdS estimation (MYN) using the follwing parameters:
query = 'H905_CDS.fasta'
subject = 'ES114_CDS.fasta'
seq_type = 'cds'
e-value: 1E-5
aa_aln_type = 'multiple'
aa_aln_tool = 'clustalw'
comp_cores = '1'

Starting Orthology Inference ...
Running blastp: 2.9.0+ ...
Error in lapply(list(...), as.character) :
argument is missing, with no default
In addition: Warning message:
In seqinr::s2c(sequence) : Wrong argument type in s2c(), NA returned

I'm really not sure what the error is saying - my best guess is it's an incompatibility with the BLAST or seqinr version? If so, I'm not sure what to do. I'm on a Mac , if that's important.

Thanks again for any help you can give!

How to use Orthofinder2 as ortho_detection?

Hi!

Thanks for developing orthologr.

I'm trying to use orthologr to calculate dNdS. According to the tutorial, ortho_detection can be set as "Orthofinder2". However, when I called function this way, an error was thrown: Error: Please choose a orthology detection method that is supported by this function. and I checked the codes that generated the error, it is:

if (!is.ortho_detection_method(ortho_detection)) 
    stop("Please choose a orthology detection method that is supported by this function.", 
      call. = FALSE)

And the is.ortho_detection_method() is:

is.ortho_detection_method
function (method = NULL) 
{
    provided <- c("BH", "RBH", "PO", "DELTA")
    if (is.null(method)) {
        print("The followig methods are provided: ")
        print(provided)
        return(FALSE)
    }
    return(is.element(method, provided))
}

It shows that "Orthofinder2" was not provided.

Does it mean that "Orthofinder2" was not supported?

Here's my session info:

> sessionInfo()
R version 4.3.0 (2023-04-21)
Platform: x86_64-apple-darwin20 (64-bit)
Running under: macOS 14.2.1

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: America/New_York
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] orthologr_0.4.2   data.table_1.15.0

loaded via a namespace (and not attached):
 [1] GenomeInfoDb_1.36.4     codetools_0.2-19        zlibbioc_1.46.0         xfun_0.41               iterators_1.0.14       
 [6] GenomeInfoDbData_1.2.10 knitr_1.45              XVector_0.40.0          BiocGenerics_0.46.0     RCurl_1.98-1.14        
[11] Biostrings_2.68.1       stats4_4.3.0            Rdpack_2.6              bitops_1.0-7            foreach_1.5.2          
[16] IRanges_2.34.1          compiler_4.3.0          rbibutils_2.2.16        rstudioapi_0.15.0       tools_4.3.0            
[21] Rcpp_1.0.12             S4Vectors_0.38.2        crayon_1.5.2

Please check the correct path to Comeron... the interface call did not work properly

Hi Dr. Drost, @HajkD
The example works fine, and by use of the built-in data I could get a data frame of Divergence Stratigraphy.

test <- divergence_stratigraphy(
  query_file      = system.file('seqs/ortho_thal_cds.fasta', package = 'orthologr'),  
  subject_file    = system.file('seqs/ortho_lyra_cds.fasta', package = 'orthologr'),
  eval            = "1E-5", 
  ortho_detection = "RBH",
  comp_cores      = 30, 
  quiet           = TRUE, 
  clean_folders   = TRUE )

After blastp ran, it shows

Orthology Inference Completed.
Starting dN/dS Estimation ...
dN/dS Estimation Completed.
Initial input contains 20 rows.
Filtering done. New output table contains 20 rows.
Divergence Stratigraphy completed successfully.

But dN/dS Estimation of my datasets failed.

Xt_vs_Xl_DM <- divergence_stratigraphy(
  query_file      = "/mnt/data4/disk/YXJ/EmbroGenesis/Ref/Xenopus_tropicalis.9.0/Xtropicalis.v9_Xenbase.cds.fa",
  subject_file    = "/mnt/data4/disk/YXJ/EmbroGenesis/Ref/Xenopus_tropicalis.9.0/XENLA_9.2_Xenbase.cds.fa",
  eval            = "1E-5", 
  ortho_detection = "RBH",
  comp_cores      = 30, 
  quiet           = TRUE, 
  clean_folders   = TRUE )

After blastp ran, it shows

Starting dN/dS Estimation ...
Error in { : 
  task 1 failed - "Please check the correct path to Comeron... the interface call did not work properly."
Calls: divergence_stratigraphy ... nrow -> dNdS -> compute_dnds -> %dopar% -> <Anonymous>
Execution halted

I don't know how to figure it out. I use the Xenbase genome data.
I try to set environment in the Studio, the KaKs_Calculator is the "~/usr/bin".

Sys.setenv(PATH=paste(Sys.getenv("PATH"), "~/usr/bin:~/usr/local/src/ncbi-blast-2.11.0+/bin", sep=":"))
system("KaKs_Calculator -h")
****************************************************************************************
  Program: KaKs_Calculator Toolbox
  Version: 2.0, June. 2009
  Description: A toolbox based on integrating gamma methods and sliding window strategy.
****************************************************************************************

The Error still remained.
Many thanks in advance for your help!

h(simpleErrpr(msg, call)) error

Hi! I just first wanted to say thank you so much for developing this package. It is quite handy and overall greta resource for so many. I've been using the blasts comparison to confirm some previous classification for grouping a protein family. So far it has worked wonderfully. Though for one fasta file, I ran into the follow:

Rhodo_T1_T2 <- as.data.frame(orthologr::blast(query_file = '~/Documents/Mining_MAMPs/Mining_Known_MAMPs/Analyses/Catagorizing_CSPs/CSP_types/Typing_fasta_files/Rhodococcus_Type1.fasta', subject_file = '~/Documents/Mining_MAMPs/Mining_Known_MAMPs/Analyses/Catagorizing_CSPs/CSP_types/Typing_fasta_files/Rhodococcus_Type2.fasta', seq_type = 'protein'))

Running blastp: 2.11.0+ ...

Building a new DB, current time: 06/16/2023 17:03:59
New DB name: /tmp/RtmpvS42vk/_blast_db/blastdb_Rhodococcus_Type2.fasta_protein.fasta
New DB title: blastdb_Rhodococcus_Type2.fasta_protein.fasta
Sequence type: Protein
Deleted existing Protein BLAST database named /tmp/RtmpvS42vk/_blast_db/blastdb_Rhodococcus_Type2.fasta_protein.fasta
Keep MBits: T
Maximum file size: 1000000000B
Adding sequences from FASTA; added 5 sequences in 0.000640154 seconds.

Error in h(simpleError(msg, call)) :
error in evaluating the argument 'x' in selecting a method for function 'as.data.frame': File blastresult_Rhodococcus_Type1.fasta.csv could not be read correctly. Please check the correct path to blastresult_Rhodococcus_Type1.fasta.csv or whether BLAST did write the resulting hit table correctly.

Type1 fast file is fine when comparing against itself or the other two types. Type 2 errors against Type1 and Type3 but runs ok comparing against itself. I've tried reuploading the fasta files but that did nothing. Any ideas? Thanks in advance!

Control output value of orthologs function

Hi @HajkD ,

I'm using the orthologs function with ortho_detection = "RBH" to detect orthologs for a query_file containing 8 protein fasta sequences in multiple subject_files.

For easier downstream data parsing, I would be very interested to set the orthologs function in a way to output results for any query_id including those queries that did not give any hits (i.e. fill result table with NA).

I would appreciate any help how to achieve this.

Many thanks in advance!

Jan

Ensembl gff does not work

I was trying to get the longest isoform from ensemble genomesbut the function is exiting with an error

Error: Gene_ID or Peptide_IDs between the gtf file and the headers of the fasta file did not match! Please make sure that the 'transcript_id' column in the gtf file matches the IDs given in the headers of the fasta file.

I have a feeling that the function only works for features from refseq.
Is there an easy way to fix this?

Multiple cores

Dear HajkD,
I am trying to perform divergence stratigraphy using orthologr package. The problem I am facing is that it is not using the number of cores I defined using comp_cores. Could you suggest possible reasons and remedies?

PS.lscpu shows 64 CPUs..

Thanks
Vinay

File input issue

Hello Dr. HajkD.
I am amazed at your program! Thank you so much for sharing your work!

I am an college student and totally new to bioinfomatics So I may commit mistakes during my command.
But I can not run your command properly even in your sample commands. (homo sapiens one)

I want to calculate genome wide dn/ds between two fish strains.

I took my fasta.files from ensemble. HNI strain fromhttps://asia.ensembl.org/Oryzias_latipes_hni/Info/Index
and HdrR strain from https://asia.ensembl.org/Oryzias_latipes/Info/Index

And this is my command.(almost copied your sample command)

`library(biomartr)
dNdS(query_file = system.file('Oryzias_latipes.ASM223467v1.cds.all.fa', package = 'orthologr'),

 subject_file    = system.file('Oryzias_latipes_hni.ASM223471v1.cds.all.fa', package = 'orthologr'),

```
 delete_corrupt_cds = TRUE, 
```
```
 ortho_detection = "RBH",
```
```
 aa_aln_type     = "pairwise", 
```
```
 aa_aln_tool     = "NW",
```
```
 codon_aln_tool  = "pal2nal",
```
```
 dnds_est.method = "Comeron",
```
```
 comp_cores      = 1 )
```

value[3L] でエラー: File could not be read properly.
Please make sure that contains only CDS sequences and is in fasta format.

This error message happen in your all sample commands on orthologr. So I wonder what is the problem.

I am using R3.5.1 version. my OS is windows 10

I would appreciate very much if you enlighten me on this problem.

Problem calling `blast` on windows

Hi again,

Another issue I am experiencing is with the blast function on Windows 10. When I try to run a simple call like:

res <- blast(query_file = system.file('seqs/ortho_thal_cds.fasta', package = 'orthologr'),
             subject_file = system.file('seqs/ortho_lyra_cds.fasta', package = 'orthologr'),
             save.output = getwd())

I get the following error:

Error: Too many positional arguments (1), the offending value: qseqid
Error:  (CArgException::eSynopsis) Too many positional arguments (1), the offending value: qseqid
Error in value[[3L]](cond) : 
  File blastresult_ortho_thal_cds.fasta.csv could not be read correctly. Please check the correct path to blastresult_ortho_thal_cds.fasta.csv or whether BLAST did write the resulting hit table correctly.

Blast is in my PATH and

system("blastp -help")

returns what it should

I trust this is related to: https://stackoverflow.com/questions/20086291/blastdbcmd-too-many-positional-arguments-1-the-offending-value-f and http://www.metagenomics.wiki/tools/blast/error-too-many-positional-arguments

I could not find the offending - in the source code.

Thanks

failed with dnds_est.method ="YN"

Hi Hajk,
I have installed the package, and succeeded with the dnds_est.method ="Comeron" and "Li", But failed with the rests like "YN" or "MYN", the code are following:
`dNdS(query_file = system.file('seqs/ortho_thal_cds.fasta', package = 'orthologr'),
subject_file = system.file('seqs/ortho_lyra_cds.fasta', package = 'orthologr'),
ortho_detection = "RBH",
aa_aln_type = "pairwise",
aa_aln_tool = "NW",
codon_aln_tool = "pal2nal",
dnds_est.method = "YN",
comp_cores = 1 ,
kaks_calc_path = "/home/lhy/software/KaKs_Calculator2.0/src/")
Building a new DB, current time: 02/24/2017 05:52:24
New DB name: /tmp/RtmpaZIBhK/_blast_db/blastdb_ortho_lyra_cds.fasta_protein.fasta
New DB title: blastdb_ortho_lyra_cds.fasta_protein.fasta
Sequence type: Protein
Deleted existing Protein BLAST database named /tmp/RtmpaZIBhK/_blast_db/blastdb_ortho_lyra_cds.fasta_protein.fasta
Keep Linkouts: T
Keep MBits: T
Maximum file size: 1000000000B
Adding sequences from FASTA; added 20 sequences in 0.00255609 seconds.

Building a new DB, current time: 02/24/2017 05:52:24
New DB name: /tmp/RtmpaZIBhK/_blast_db/blastdb_ortho_thal_cds.fasta_protein.fasta
New DB title: blastdb_ortho_thal_cds.fasta_protein.fasta
Sequence type: Protein
Deleted existing Protein BLAST database named /tmp/RtmpaZIBhK/_blast_db/blastdb_ortho_thal_cds.fasta_protein.fasta
Keep Linkouts: T
Keep MBits: T
Maximum file size: 1000000000B
Adding sequences from FASTA; added 20 sequences in 0.00225496 seconds.
[1] "File successfully written to /tmp/RtmpaZIBhK/_alignment/pairwise_aln/q1_NW_AA.aln"
[1] "Codon Alignment successfully written in /tmp/RtmpaZIBhK/_alignment/codon_aln/q1_pal2nal.aln."

Function: Parse fasta file with aligned pairwise sequences into AXT file
Reference: Zhang Z, Li J, Zhao XQ, Wang J, Wong GK, Yu J: KaKs Calculator: Calculating Ka and Ks through model selection and model averaging. Genomics Proteomics Bioinformatics 2006 , 4:259-263.
Web Link: Documentation, example and updates at http://code.google.com/p/kaks-calculator

Input file = /tmp/RtmpaZIBhK/_alignment/codon_aln/q1_pal2nal.aln
Opening the input file...
Reading sequences..................................................
Formatting AXT sequences...
Outputing AXT file: /tmp/RtmpaZIBhK/_calculation/q1_pal2nal.aln.axt

Mission accomplished.
sh: 1: KaKs_Calculator: not found
Error in value[3L] : Something went wront with KaKs_Calculator.
/tmp/RtmpaZIBhK/_calculation/q1_pal2nal.aln.axt.kaks could not be read properly.
Besides: Warning message:
In file(file, "rt") :
can not open '/tmp/RtmpaZIBhK/_calculation/q1_pal2nal.aln.axt.kaks': no such file`

If I use the KaKs_Calculator to calculate the da/ds of q1_pal2nal.aln.axt using "-c YN" or "MYN", I can get the result correctly.
The version of KaKs_Calculator is 2.0.
Thanks!

Running time

Hi there,
I am running divergence_stratigraphy in r, I am wondering how long does it normally take? It has been a few hours and still hasn't finished..

Thank you
Best
CW

input files

I followed the vignette examples and everything worked perfectly. When I tried to use my fasta sequences I have got always the follow error:
"Error in value[3L] : File could not be read properly.
Please make sure that contains only CDS sequences and is in fasta format."

My files are in the follow format:

001
ATGCACCATGGCACTGGCCCCCAGAACGTCCAGCATCAGCTGCAGAGGTCCAGGGCCTG
002
CTGGCAGCGAGGGTGAGGAGCAGCCGGCCCACCCCAACCCACCCCCGTCCCCCGCAGCT

Any advice?

sessionInfo()
R version 3.2.2 (2015-08-14)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.11.2 (El Capitan)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats4 parallel stats graphics grDevices utils datasets methods base

other attached packages:
[1] Biostrings_2.38.3 XVector_0.10.0 IRanges_2.4.6 S4Vectors_0.8.7 BiocGenerics_0.16.1 orthologr_0.0.1
[7] data.table_1.9.6 seqinr_3.1-3 ade4_1.7-3 dplyr_0.4.3 plyr_1.8.3

loaded via a namespace (and not attached):
[1] Rcpp_0.12.3 codetools_0.2-14 foreach_1.4.3 assertthat_0.1 chron_2.3-47 R6_2.1.1 DBI_0.3.1
[8] magrittr_1.5 zlibbioc_1.16.0 iterators_1.0.8 tools_3.2.2

Installation error

Installation of orthologr keeps failing right at the end.

I entered:

if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager")
BiocManager::install()
# Install package dependencies
BiocManager::install(c("Biostrings", "GenomicRanges", "GenomicFeatures", "Rsamtools", "rtracklayer"))
# install orthologr from GitHub
devtools::install_github("HajkD/metablastr")

Each of these commands worked with no errors or warnings. However, I could not successfully install orthologr using:

# install orthologr from GitHub
devtools::install_github("HajkD/orthologr")

Here is the output:

The downloaded source packages are in ‘/tmp/RtmpzBSnhp/downloaded_packages’
✔ checking for file ‘/tmp/RtmpzBSnhp/remotes18a86db25059/drostlab-orthologr-c4f4b36/DESCRIPTION’ ...
─ preparing ‘orthologr’:
✔ checking DESCRIPTION meta-information ...
─ cleaning src
─ installing the package to process help pages
-----------------------------------
─ installing *source* package ‘orthologr’ ...
** libs
g++ -std=gnu++11 -I"/exports/applications/apps/SL7/R/3.5.3/lib64/R/include" -DNDEBUG -I../src -DRCPP_DEFAULT_INCLUDE_CALL=false -DCOMPILING_ORTHOLOGR -DBOOST_NO_INT64_T -DBOOST_NO_INTEGRAL_INT64_T -DBOOST_NO_LONG_LONG -DRCPP_USING_UTF8_ERROR_STRING -DRCPP_USE_UNWIND_PROTECT -Xclang -fopenmp -I"/exports/eddie/scratch/tregan/R/Rcpp/include" -I/usr/local/include -fpic -g -O2 -c RcppExports.cpp -o RcppExports.o
g++: error: unrecognized command line option ‘-Xclang’
make: *** [RcppExports.o] Error 1
ERROR: compilation failed for package ‘orthologr’

─ removing ‘/tmp/RtmpnBxfT7/Rinst2aa561cd54004/orthologr’
-----------------------------------
ERROR: package installation failed
Error in processx::run(bin, args = real_cmdargs, stdout_line_callback = real_callback(stdout), : System command error

I've tried wiping my R directory, reinstalling BiocManager from the start but it's the same issue each time.
Any ideas what might be wrong?

Many thanks,
Tim

Missing headings blast_rec

When using blast_rec, headings in blastresult{query}.fasta_ and blastresult{subject}.fasta_ are not included. I suppose they are the same as output from blast_rec. Right?

How to set PATH to OrthoFinder2 correctly?

Hi @drostlab,

I'm trying to set up orthologr for detection of orthologs in multiple species.

I've installed orthofinder into my PATH using conda install orthofinder.
Calling orthofinder -h from terminal shows OrthoFinder version 2.4.0 Copyright (C) 2014 David Emms suggesting that orthofinder is installed correctly.

Yet, I still run into the following issue when running the example code:

# specify species names
orgs <- c("Arabidopsis lyrata", 
          "Capsella rubella", "Solanum lycopersicum")
# download proteome files for all species          
biomartr::getProteomeSet(db = "refseq", organisms = orgs, path = "of_proteomes")
# download annotation files for all species          
biomartr::getGFFSet(db = "refseq", organisms = orgs, path = "of_gff")
# select longest splice variant per gene locus
retrieve_longest_isoforms_all(proteome_folder = "of_proteomes", 
                              annotation_folder = "of_gff",
                              annotation_format = "gff", 
                              output_folder = "of_proteomes_longest_sv")

> orthofinder2(proteome_folder = "of_proteomes_longest_sv", comp_cores = 4)
sh: orthofinder: command not found
Error: It seems like you don't have OrthoFinder2 installed locally on your machine or the PATH variable to the OrthoFinder2 program is not set correctly.

How should I setup orthofinder correctly?

Many thanks in advance for your help!

Dependency question

Hi,
would be technically possible to replace the metablastr dependency by rdiamond? I'm currently working in the orthologr conda recipe.

Best regards.

Problem calling `multi_aln` on windows

Hi,

First of all I'd like to thank you for the effort you put in this package.

I am trying to setup orthologr and one of the issues is with multi_aln function. When I try to run a simple call like:

res <- multi_aln(file = system.file('seqs/aa_seqs.fasta', package = 'orthologr'),
                 tool = "clustalw", 
                 get_aln = TRUE)

I get an error

Error: It seems like you don't have ClustalW installed locally on your machine or the PATH variable to the ClustalW program is not set correctly.

however clustalw2.exe is in my PATH and

system("clustalw2 -help")

returns what it should.

I have checked the source for multi_aln and I trust the offending part is the internal function

orthologr:::is_installed_clustalw

since it checks

system("clustalw -help", intern = TRUE)

regardless of operating system.

I think a simple solution would be to change this line to

system(paste0(call_clustalw, " -help"))

Thanks.

EDIT: Suggestion. Perhaps it would be more user friendly to add the package msa: https://bioconductor.org/packages/release/bioc/html/msa.html as a dependency to your package and remove some external dependencies

Can not download orthologr

Dear Contributor,

  I am trying to download orthologr, but always have the the following errors. Could you take a look or give me some suggestions? thanks. I am using R 3.6 in Wiondows 7 enterprise.   yu

devtools::install_github("HajkD/orthologr", build_vignettes = TRUE, dependencies = TRUE)
Downloading GitHub repo HajkD/orthologr@master
Skipping 4 packages not available: biomartr, Biostrings, IRanges, rtracklayer
Downloading GitHub repo HajkD/metablastr@master
Skipping 7 packages not available: Biostrings, GenomicFeatures, GenomicRanges, Rsamtools, biomartr, IRanges, rtracklayer
Skipping 2 packages ahead of CRAN: DBI, BH
Installing 20 packages: dplyr, doParallel, RPostgreSQL, GenomicFeatures, GenomicRanges, Rsamtools, scales, RSQLite, biomartr, ggridges, rtracklayer, rlang, bit64, blob, digest, plyr, vctrs, nlme, Matrix, bit
Error: Failed to install 'orthologr' from GitHub:
Failed to install 'metablastr' from GitHub:
(converted from warning) packages ‘GenomicFeatures’, ‘GenomicRanges’, ‘Rsamtools’, ‘biomartr’, ‘rtracklayer’ are not available (for R version 3.6.0)

"path" option for Blast calls is non-functional

Trying out the orthologr package on a system where I cannot do root, either pw or sudo, etc.

Installed blast and others to ~/bin

In the docs, it purports to support this via the path option in the call to blast(), but when I run in the debugger, I see that the option does not seem to be passed to the is_installed_blast() call.

So, I cannot run Blast -- need to have that functionality of the path option.

Integrate blastx and tblastx options

Hi,
Would it be possible to integrate an option to run blastx and tblastx through orthologr ?
Thanks,
Best.

Unable to install in Windows

Dear HajkD,
I tried to install orthologr in a Win 8 laptop, running R Studio 0.98.1091 + R version 3.1.2 (2014-10-31) -- "Pumpkin Helmet", Platform: x86_64-w64-mingw32/x64 (64-bit), and Bioconductor version 3.0 (BiocInstaller 1.16.5).

When I tried to install orthologr, I got the following message:

devtools::install_github("HajkD/orthologr", build_vignettes = TRUE, dependencies = TRUE)
Downloading github repo HajkD/orthologr@master
Installing orthologr
Skipping 2 packages not available: Biostrings, IRanges
"C:/PROGRA~~1/R/R-31~~1.2/bin/x64/R" --no-site-file --no-environ --no-save
--no-restore CMD INSTALL
"C:/Users/Andres/AppData/Local/Temp/RtmpyKKyw9/devtools1620ecb385b/HajkD-orthologr-0c8658a"
--library="d:/Users/Andres/Documents/R/win-library/3.1" --install-tests

installing source package 'orthologr' ...
** libs

*** arch - i386
Warning: running command 'make -f "C:/PROGRA~~1/R/R-31~~1.2/etc/i386/Makeconf" -f >"C:/PROGRA~~1/R/R-31~~1.2/share/make/winshlib.mk" SHLIB_LDFLAGS='$(SHLIB_CXXLDFLAGS)' >SHLIB_LD='$(SHLIB_CXXLD)' SHLIB="orthologr.dll" OBJECTS="RcppExports.o gestimator.o"' had >status 127
ERROR: compilation failed for package 'orthologr'

removing 'd:/Users/Andres/Documents/R/win-library/3.1/orthologr'
Error: Command failed (1)

Could you help me?

Thank you very much in advance,
Andres

File blastresult_AllColdShockProteins.fasta.csv could not be read correctly.

Hi, Thanks for the nice package.

I usually use the blast function giving two Genome file of 2 organism in fasta format and run the command. Here is the command:

blast(query_file = "/path/to/Bdivergens1802A_Genome.fasta",
subject_file = "/path/to/BmicrotiATCC30222_Genome.fasta",
delete_corrupt_cds = T, seq_type = "protein",
format = "fasta", blast_algorithm = "blastp")

So far I was able to get the output without any issue.
What I am trying to do now is find the orthologs of couple of protein sequences in a single species. I have the proteins sequences in fasta format as well as the subject file. here is my command:

orthlg <- orthologs(
query_file = "/path/to/AllColdShockProteins.fasta",
subject_files = "/path/to/BmicrotiATCC30222_Genome.fasta",
seq_type = "protein", format = "fasta", eval = 0.0001)

I am not sure if i am doing something wrong!! I am getting the following error:

Error in value[3L] :
File blastresult_AllColdShockProteins.fasta.csv could not be read correctly. Please check the correct path to blastresult_AllColdShockProteins.fasta.csv or whether BLAST did write the resulting hit table correctly.

Any Idea on how to fix it is so appreciated.
Thanks

Installation error

I was trying to install orthologr this is what I did:
install.packages("devtools")
Warning messages:
1: In install.packages("devtools") :
installation of package ‘openssl’ had non-zero exit status
2: In install.packages("devtools") :
installation of package ‘git2r’ had non-zero exit status
3: In install.packages("devtools") :
installation of package ‘httr’ had non-zero exit status
4: In install.packages("devtools") :
installation of package ‘devtools’ had non-zero exit status

source("http://bioconductor.org/biocLite.R")
Bioconductor version 3.2 (BiocInstaller 1.20.3), ?biocLite for help
A new version of Bioconductor is available after installing the most recent
version of R; see http://bioconductor.org/install
biocLite("HajkD/orthologr")
BioC_mirror: https://bioconductor.org
Using Bioconductor 3.2 (BiocInstaller 1.20.3), R 3.2.3 (2015-12-10).
Error: github installation of ‘HajkD/orthologr’ only available after
biocLite("devtools")

lots of CDS lost during orthology inference

Hi there,

I am trying to compute dNdS values between the entire CDS of two genomes (from two subspecies). Each CDS file contain about 30000 sequences. When I do the analysis (using the "dNdS" function), the total amount of CDS pairs that get analyze though is dramatically reduced to nearly 5 to 6 thousand. I have read the manual, but I am not entirely sure whether there is a way inside the dNdS function to use customized settings for the blast search (e.g. reducing alignment length, etc). I would be very grateful if you could perhaps indicate to me a way to change specific blast settings or filters inside the dNdS function to increase the number of CDS I get analysed.

Many thanks!

Oscar

Confirm that use of BLAST's `-max_target_seqs` is intentional

Hi there,

This is a semi-automated message from a fellow bioinformatician. Through a GitHub search, I found that the following source files make use of BLAST's -max_target_seqs parameter:

R/blast.R

Based on the recently published report, Misunderstood parameter of NCBI BLAST impacts the correctness of bioinformatics workflows, there is a strong chance that this parameter is misused in your repository.

If the use of this parameter was intentional, please feel free to ignore and close this issue but I would highly recommend to add a comment to your source code to notify others about this use case. If this is a duplicate issue, please accept my apologies for the redundancy as this simple automation is not smart enough to identify such issues.

Thank you!
-- Arman (armish/blast-patrol)

drostlab / orthologr Goto Github PK

orthologr's Introduction

orthologr

Comparative Genomics with R

Motivation

Citation

Short package description

Install orthologr

Use Cases

Example

Small example with internal dataset

Example: Computing dN/dS values for all orthologous genes between two genomes

On Windows Systems

Troubleshooting on Windows Machines

Interfaces implemented in orthologr:

Perform BLAST searches with R

Perform Pairwise and Multiple Sequence Alignements with R

Perform Orthology Inference with R

Perform Population Genomics with R

Read and Write CDS, Genomes, and Proteomes

Discussions and Bug Reports

Licenses

gestimator

KaKs_Calculator

orthologr's People

Contributors

Stargazers

Watchers

Forkers

orthologr's Issues

Recommend Projects

Recommend Topics

Recommend Org

Install `orthologr`

Interfaces implemented in `orthologr`: