drostlab / metablastr Goto Github PK

Seamless Integration of BLAST Sequence Searches in R

Home Page: https://drostlab.github.io/metablastr/

License: GNU General Public License v2.0

R 100.00%

blast-searches blast blastn blasting sequence-diversity biological-sequences biomartr species nucleotide blast-hits blast-search protein

metablastr's Introduction

metablastr

Seamless Integration of BLAST Sequence Searches in R

Motivation

With the rapid expansion of available sequences in biological databases, the landscape of modern life science research is being transformed. Currently, around several hundred thousand genomic sequences from a diverse array of species in the tree of life are freely accessible to the public. The R package biomartr enables automated access and retrieval of this vast data, paving the way to delve into the rich tapestry of sequence diversity, uncovering new insights into evolvability, variation, and the emergence of diseases.

The biomartr package streamlines the retrieval of a massive amount of biological sequence data in a standardized and reproducible manner. Complementing it, the metablastr package is tailored to facilitate large-scale sequence searches, also in a standardized and reproducible approach.

In synergy, biomartr and metablastr provide researchers with a comprehensive toolset, allowing them to efficiently gather thousands of biological sequences (genomes, proteomes, annotations, etc.) and conduct extensive sequence comparisons using the gold standard sequence search engine BLAST. This facilitates the extraction of novel patterns highlighting similarities and divergences among vast sets of species.

It's worth noting that the go-to instrument for large-scale sequence searches is BLAST (Basic Local Alignment Search Tool). It is purposefully built to identify regions of sequence similarity between a given query and subject sequences or sequence databases.

Building on these advancements, we have recently introduced DIAMOND2, a groundbreaking software solution designed to accelerate BLAST searches by an factor of up to 10,000x. To offer researchers even more flexibility and integration, we provide rdiamond, a dedicated interface package that allows programmatic handling of DIAMOND2 sequence searches directly through R. This not only streamlines the sequence search process but also ensures that researchers can access and utilize the power of DIAMOND2 within a familiar R environment.

Short package description

The metablastr package harnesses the power of BLAST by providing interface functions between R and the standalone (command line tool) version of this program.

Together, the metablastr package may enable a new level of data-driven genomics research by providing the computational tools and data science standards needed to perform reproducible research at scale.

Citation

The metablastr package is still under development and not formally published yet. However, we did develop parts of metablastr for this Methods chapter which you can cite until a metablastr specific manuscript is prepared.

M Benoit, HG Drost. A Predictive Approach to Infer the Activity and Natural Variation of Retrotransposon Families in Plants. In: Cho J. (eds) Plant Transposable Elements. Methods in Molecular Biology, vol 2250. Humana, New York, NY (2021).

Install `metablastr`

For Linux Users:

Please install the libpq-dev library on you linux machine by typing into the terminal:

sudo apt-get install libpq-dev

For all systems install `metablastr` by typing

# install BiocManager if required
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install()

# install package dependencies
BiocManager::install(
  c(
    "Biostrings",
    "GenomicFeatures",
    "GenomicRanges",
    "Rsamtools",
    "IRanges",
    "rtracklayer")
)

# install.packages("devtools")
# install the current version of metablastr on your system
devtools::install_github("drostlab/metablastr", build_vignettes = TRUE, dependencies = TRUE)

Please follow the Installation Vignette to install all standalone sequence search tools.

Quick start

library(metablastr)
# run blastn (nucleotide to nucleotide search) between example query and subject sequences
blast_test <- blast_nucleotide_to_nucleotide(
                 query   = system.file('seqs/qry_nn.fa', package = 'metablastr'),
                 subject = system.file('seqs/sbj_nn.fa', package = 'metablastr'),
                 output.path = tempdir(),
                 db.import  = FALSE)
                 
# look at BLAST results
blast_test

   query_id     subject_id perc_identity num_ident_match… alig_length mismatches
   <chr>        <chr>              <dbl>            <int>       <int>      <int>
 1 333554|PACi… AT1G01010…          84.2              640         760         63
 2 333554|PACi… AT1G01010…          84.0              536         638         90
 3 333554|PACi… AT1G01010…          78.6               44          56         12
 4 470181|PACi… AT1G01020…          94.7              699         738         39
 5 470180|PACi… AT1G01030…          95.2             1025        1077         40
 6 333551|PACi… AT1G01040…          96.0             3627        3779        125
 7 333551|PACi… AT1G01040…          95.5             1860        1948         82
 8 909874|PACi… AT1G01050…          96.6              617         639         22
 9 470177|PACi… AT1G01060…          92.8             1804        1944        110
10 918864|PACi… AT1G01070…          95.3             1046        1098         40
# ... with 13 more rows, and 15 more variables: gap_openings <int>,
#   n_gaps <int>, pos_match <int>, ppos <dbl>, q_start <int>, q_end <int>,
#   q_len <int>, qcov <int>, qcovhsp <int>, s_start <int>, s_end <dbl>,
#   s_len <dbl>, evalue <dbl>, bit_score <chr>, score_raw <int>

Discussions and Bug Reports

I would be very happy to learn more about potential improvements of the concepts and functions provided in this package.

Furthermore, in case you find some bugs, need additional (more flexible) functionality of parts of this package, or want to contribute to this project please let me know:

https://github.com/drostlab/metablastr/issues

Interfaces implemented in `metablastr`:

Perform BLAST searches

blast_protein_to_protein(): Perform Protein to Protein BLAST Searches (BLASTP)
blast_nucleotide_to_nucleotide(): Perform Nucleotide to Nucleotide BLAST Searches (BLASTN)
blast_nucleotide_to_protein(): Perform Nucleotide to Protein BLAST Searches (BLASTX)
blast_protein_to_nucleotide(): Perform Protein to Nucleotide BLAST Searches (TBLASTN)
blast_best_hit(): Retrieve only the best BLAST hit for each query
blast_best_reciprocal_hit(): Retrieve only the best reciprocal BLAST hit for each query
blast_rpsblast: Perform Reverse PSI-BLAST searches (rpsblast)
read_blast(): Import BLAST output into R session (in memory) or via PostgresSQL database connection.

BLAST against common NCBI databases

blast_protein_to_nr_database(): Perform Protein to Protein BLAST Searches against the NCBI non-redundant database
blast_nt(): Perform Nucleotide to Nucleotide BLAST Searches against the NCBI non-redundant database
blast_est(): Perform Nucleotide to Nucleotide BLAST Searches against the NCBI expressed sequence tags database
blast_pdb_protein():
blast_pdb_nucleotide():
blast_swissprot():
blast_delta():
blast_refseq_rna():
blast_refseq_gene():
blast_refseq_protein():

BLAST against a set of organisms

blast_nucleotide_to_genomes(): Perfrom BLAST Searches Against a Set of Genomes
blast_protein_to_proteomes(): Perfrom BLAST Searches Against a Set of Proteomes
detect_homologs_cds_to_cds(): Perform CDS to CDS BLAST Searches against a set of CDS files
detect_homologs_proteome_to_proteome(): Perform Proteome to Proteome BLAST Searches against a set of Proteomes
extract_hit_seqs_from_genomes(): Extract sequences of BLAST hits in respective genomes and store it as 'fasta' file(s)
extract_random_seqs_from_genome(): Extract random loci from a genome of interest
sample_chromosome_intervals(): Helper function to sample random intervals of length 'interval_width' from chromosomes

Analyze BLAST Report

filter_blast_:

Navigation functions

list_outformats(): List available BLAST output formats

metablastr's People

Contributors

Stargazers

Watchers

Forkers

michbur gogleva hongzhonglu jianshu93 alexga yongming-duan rufus-willy

metablastr's Issues

Feature Request: extract the same regions from multiple samples

It would be helpful if your command extract_random_seqs_from_genome() could be modified to include multiple samples, so that the same random sequences are extracted from multiple genomes (potentially based on scaffold/chromosome ID and position).

Reg: Reciprocal hits from already run blast outcomes

Hi!

I have performed the following to obtain blast format 6 tabular format:

A vs B with Diamond
B vs A with Diamond

I read AvB.hits and BvA.hits using read_blast() function. Will it be possible to identify the reciprocal hits? Or should I rerun the searches with metablastr?

Error in .Call2("new_input_filexp", filepath, PACKAGE = "XVector") : cannot open file ''

Hi I finally downloaded Blast+ and created the pathway to R. Now I am having issues loading my fasta file. Thanks!

I tried to load it from my hard drive and from my desktop and get the same errors:

sealionfeces <- readDNAStringSet("H:/ONRdolphinsealionpooled/sealionfecespooled/canu/medaka/consensus.fasta", + package = "rBLAST")) Error: unexpected ')' in: " sealionfeces <- readDNAStringSet("H:/ONRdolphinsealionpooled/sealionfecespooled/canu/medaka/consensus.fasta", package = "rBLAST"))"

sealionfeces <- readDNAStringSet(system.file("H:/ONRdolphinsealionpooled/sealionfecespooled/canu/medaka/consensus.fasta", + package = "rBLAST")) Error in .Call2("new_input_filexp", filepath, PACKAGE = "XVector") : cannot open file ''

sealionfeces <- readDNAStringSet(system.file("C:\Users\katie\OneDrive\Desktop\R\sealionfecespooled.consensus.fasta", Error: '\U' used without hex digits in character string starting ""C:\U"

sealionfeces <- readDNAStringSet(system.file("C:/Users/katie/OneDrive/Desktop/R/sealionfecespooled.consensus.fasta", + package = "rBLAST")) Error in .Call2("new_input_filexp", filepath, PACKAGE = "XVector") : cannot open file ''
--

| >

additional functionality when extracting random sequences

Hi,

I just stumbled across your package today while looking for a way to extract the same set of 10,000 1000kb loci randomly from five different genomes. Thank you so much for writing this package, I think it's going to be a huge help in this process. If you're still open to hearing requests for additional functionality/flexibility, I have two questions for you. For your function 'extract_random_seqs_from_genome', is it possible to set it so that replacement = FALSE once each random locus is selected and extracted (or is this already true)? Would it also be possible to set a minimum distance between randomly selected loci, e.g. if I wanted to specify that all loci are at least 50bp apart?

Thank you,
Amy

Feature Request: exclude sequences with Ns

In extract_random_seqs_from_genome(), It would be helpful to have an option that allows users to decide whether to exclude sequences with too many Ns (e.g. N > 0 or N > 10%). For me, it would be fine for this filtering step to happen after X sequences are drawn (e.g. if 100 sequences are drawn, then 10 are excluded because they have too many Ns, resulting in 90 sequences). It would be great to have a short printout at the end that says how many sequences were drawn and how many were filtered out due to an issue with Ns.

How can I load sequences from a CSV file and do the massive Blast search locally?

Hi,

I have a CSV dataset of edited DNA sequences by DADA2 pipeline and wonder how I can load and blast these sequences automatically using metablastr packages.

In the CSV file, each row represents a unique sequence and each column has a sample name(see attached image):

These COI gene sequences are clean and ready to Blast directly on NCBI website. Most of the sequences are from mammalian and avian blood. Since there are over 2000 sequences, it'd be great if I can use this package to load and blast automatically instead of manually.

Any R scripts to achieve this goal with metablastr package would highly appreciated. Thank you.

Best,

Gabriel

Feature request : add taxon id for each blast hit

Default blast tabular format output (outfmt 7) doesn't add taxon id for each blast hit. Taxon id is very important for downstream phylogenetic analysis. Indirect approach to add taxon id is to run the blastdbcmd with option %T once the results are obtained. This is very time consuming as you have to get taxon first and map back to original blast results. Can metablstr has function which can map taxon id to blast outcome ?

Package release request

Hi,
would be possible to provide a package release? It is required for creating the conda package.

Regards

blast_nt for nucleotide against NCBI database

Hello
thank you for developping this set of tools
I am trying to run my nucleotide sequences againts NCBI database
From your vignette, it should be blast_n()?
but the function is not available is the version of metablastr I just installed with

devtools::install_github("drostlab/metablastr", build_vignettes = TRUE, dependencies = TRUE)

do you have an alternative to this function for nucleotides?
https://drostlab.github.io/metablastr/reference/blast_protein_to_nr_database.html

does the database need to be stored locally? advices?

thank you!

function blast_protein_to_protein with argument is.subject.db = TRUE doesn't work with nr database

I have nr blast database downloaded from NCBI, which contains the files given in the attached snapshot. When I run the command below, it throws the as shown. I wonder, what input should I give as a blast-able database ?

blast_test <- blast_protein_to_protein(
        query   =  "aa_query.fasta",
        subject = "path/to/nr/db/nr",
        is.subject.db = TRUE,
        output.path = tempdir(),
        db.import  = FALSE ,cores = 4)

Error in .Call2("new_input_filexp", filepath, PACKAGE = "XVector") : 
  cannot open file '/Users/chiragparsania/Documents/Database/nr_protein_db/nr'

blast_best_reciprocal_hit nucleotide-protein comparison task error

I ran the following code:

blast_test_reciprocal <- blast_best_reciprocal_hit(
    query   = 'A.fasta', ##protein sequence
    subject = 'B.fasta', ##nucleotide sequence
    search_type = "protein_to_nucleotide",
    task = "tblastn",
    evalue = 0.000001,
    output.path = tempdir(),
    db.import  = FALSE)

which gives the following result:


Starting 'tblastn -task tblastn' with  query: A.fasta and subject: B.fasta using 1 core(s) ...

BLAST search finished! The BLAST output file was imported into the running R session. The BLAST output file has been stored at: C:/Users/A_B_tblastn_eval_1e-06.blast_tbl
Error: Please choose a nucleotide-protein comparison task that is supported by BLAST: task = 'blastx' or task = 'blastx-fast'.

How to specify the second blast task ('blastx') when performing tblastn?

Error: Internal error in `dict_hash_with()`: Dictionary is full.

Thank you metablastr developers for sharing this tool with the community. I'd like to seek for your help for the error I've encountered following blast_best_reciprocal_hit() run. Both BLASTp seem to have completed, but the reciprocal best hit step appears to have failed. One database I'm using has around 25M records, and I'm wondering if this could be the reason why the reciprocal best hit step failed. For reference, I'm sharing the snippets of the error:

BLAST search finished! The BLAST output file was imported into the running R session. The BLAST output file has been stored at: /expt/datb/data/HiC/Rp_RNA-Seq/embryo/timeseries-vs-RpedSuzhou/rna-seq_rpedszv-reannot/annot_gene_sym/metablastr_bbh/metazoa_refseq_biopython-validated_Riptortus_pedestris_SZV_blastp-fast_eval_1e-05.blast_tbl
Error: Internal error in dict_hash_with(): Dictionary is full.

rlang::last_error()
<error/rlang_error>
Internal error in dict_hash_with(): Dictionary is full.
Backtrace:

metablastr::blast_best_reciprocal_hit(...)

metablastr::blast_best_hit(...)

dplyr:::group_by.data.frame(blast_res, query_id)

dplyr::grouped_df(groups$data, groups$group_names, .drop)

dplyr:::compute_groups(data, vars, drop = drop)

dplyr:::vec_split_id_order(group_vars)

vctrs::vec_group_loc(x)
Run rlang::last_trace() to see the full context.

Is it also possible to skip the BLAST step to directly proceed with the reciprocal best hit step when re-running this procedure?

Thank you very much!

drostlab / metablastr Goto Github PK

metablastr's Introduction

metablastr

Seamless Integration of BLAST Sequence Searches in R

Motivation

Short package description

Citation

Install metablastr

For Linux Users:

For all systems install metablastr by typing

Quick start

Discussions and Bug Reports

Interfaces implemented in metablastr:

Perform BLAST searches

BLAST against common NCBI databases

BLAST against a set of organisms

Analyze BLAST Report

Navigation functions

metablastr's People

Contributors

Stargazers

Watchers

Forkers

metablastr's Issues

Recommend Projects

Recommend Topics

Recommend Org

Install `metablastr`

For all systems install `metablastr` by typing

Interfaces implemented in `metablastr`: