Giter Site home page Giter Site logo

drostlab / metablastr Goto Github PK

View Code? Open in Web Editor NEW
31.0 5.0 8.0 672 KB

Seamless Integration of BLAST Sequence Searches in R

Home Page: https://drostlab.github.io/metablastr/

License: GNU General Public License v2.0

R 100.00%
blast-searches blast blastn blasting sequence-diversity biological-sequences biomartr species nucleotide blast-hits blast-search protein

metablastr's Introduction

metablastr

Seamless Integration of BLAST Sequence Searches in R

Motivation

With the rapid expansion of available sequences in biological databases, the landscape of modern life science research is being transformed. Currently, around several hundred thousand genomic sequences from a diverse array of species in the tree of life are freely accessible to the public. The R package biomartr enables automated access and retrieval of this vast data, paving the way to delve into the rich tapestry of sequence diversity, uncovering new insights into evolvability, variation, and the emergence of diseases.

The biomartr package streamlines the retrieval of a massive amount of biological sequence data in a standardized and reproducible manner. Complementing it, the metablastr package is tailored to facilitate large-scale sequence searches, also in a standardized and reproducible approach.

In synergy, biomartr and metablastr provide researchers with a comprehensive toolset, allowing them to efficiently gather thousands of biological sequences (genomes, proteomes, annotations, etc.) and conduct extensive sequence comparisons using the gold standard sequence search engine BLAST. This facilitates the extraction of novel patterns highlighting similarities and divergences among vast sets of species.

It's worth noting that the go-to instrument for large-scale sequence searches is BLAST (Basic Local Alignment Search Tool). It is purposefully built to identify regions of sequence similarity between a given query and subject sequences or sequence databases.

Building on these advancements, we have recently introduced DIAMOND2, a groundbreaking software solution designed to accelerate BLAST searches by an factor of up to 10,000x. To offer researchers even more flexibility and integration, we provide rdiamond, a dedicated interface package that allows programmatic handling of DIAMOND2 sequence searches directly through R. This not only streamlines the sequence search process but also ensures that researchers can access and utilize the power of DIAMOND2 within a familiar R environment.

Short package description

The metablastr package harnesses the power of BLAST by providing interface functions between R and the standalone (command line tool) version of this program.

Together, the metablastr package may enable a new level of data-driven genomics research by providing the computational tools and data science standards needed to perform reproducible research at scale.

Citation

The metablastr package is still under development and not formally published yet. However, we did develop parts of metablastr for this Methods chapter which you can cite until a metablastr specific manuscript is prepared.

M Benoit, HG Drost. A Predictive Approach to Infer the Activity and Natural Variation of Retrotransposon Families in Plants. In: Cho J. (eds) Plant Transposable Elements. Methods in Molecular Biology, vol 2250. Humana, New York, NY (2021).

Install metablastr

For Linux Users:

Please install the libpq-dev library on you linux machine by typing into the terminal:

sudo apt-get install libpq-dev

For all systems install metablastr by typing

# install BiocManager if required
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install()

# install package dependencies
BiocManager::install(
  c(
    "Biostrings",
    "GenomicFeatures",
    "GenomicRanges",
    "Rsamtools",
    "IRanges",
    "rtracklayer")
)

# install.packages("devtools")
# install the current version of metablastr on your system
devtools::install_github("drostlab/metablastr", build_vignettes = TRUE, dependencies = TRUE)

Please follow the Installation Vignette to install all standalone sequence search tools.

Quick start

library(metablastr)
# run blastn (nucleotide to nucleotide search) between example query and subject sequences
blast_test <- blast_nucleotide_to_nucleotide(
                 query   = system.file('seqs/qry_nn.fa', package = 'metablastr'),
                 subject = system.file('seqs/sbj_nn.fa', package = 'metablastr'),
                 output.path = tempdir(),
                 db.import  = FALSE)
                 
# look at BLAST results
blast_test
   query_id     subject_id perc_identity num_ident_match… alig_length mismatches
   <chr>        <chr>              <dbl>            <int>       <int>      <int>
 1 333554|PACi… AT1G01010…          84.2              640         760         63
 2 333554|PACi… AT1G01010…          84.0              536         638         90
 3 333554|PACi… AT1G01010…          78.6               44          56         12
 4 470181|PACi… AT1G01020…          94.7              699         738         39
 5 470180|PACi… AT1G01030…          95.2             1025        1077         40
 6 333551|PACi… AT1G01040…          96.0             3627        3779        125
 7 333551|PACi… AT1G01040…          95.5             1860        1948         82
 8 909874|PACi… AT1G01050…          96.6              617         639         22
 9 470177|PACi… AT1G01060…          92.8             1804        1944        110
10 918864|PACi… AT1G01070…          95.3             1046        1098         40
# ... with 13 more rows, and 15 more variables: gap_openings <int>,
#   n_gaps <int>, pos_match <int>, ppos <dbl>, q_start <int>, q_end <int>,
#   q_len <int>, qcov <int>, qcovhsp <int>, s_start <int>, s_end <dbl>,
#   s_len <dbl>, evalue <dbl>, bit_score <chr>, score_raw <int>

Discussions and Bug Reports

I would be very happy to learn more about potential improvements of the concepts and functions provided in this package.

Furthermore, in case you find some bugs, need additional (more flexible) functionality of parts of this package, or want to contribute to this project please let me know:

https://github.com/drostlab/metablastr/issues

Interfaces implemented in metablastr:

Perform BLAST searches

  • blast_protein_to_protein(): Perform Protein to Protein BLAST Searches (BLASTP)
  • blast_nucleotide_to_nucleotide(): Perform Nucleotide to Nucleotide BLAST Searches (BLASTN)
  • blast_nucleotide_to_protein(): Perform Nucleotide to Protein BLAST Searches (BLASTX)
  • blast_protein_to_nucleotide(): Perform Protein to Nucleotide BLAST Searches (TBLASTN)
  • blast_best_hit(): Retrieve only the best BLAST hit for each query
  • blast_best_reciprocal_hit(): Retrieve only the best reciprocal BLAST hit for each query
  • blast_rpsblast: Perform Reverse PSI-BLAST searches (rpsblast)
  • read_blast(): Import BLAST output into R session (in memory) or via PostgresSQL database connection.

BLAST against common NCBI databases

  • blast_protein_to_nr_database(): Perform Protein to Protein BLAST Searches against the NCBI non-redundant database
  • blast_nt(): Perform Nucleotide to Nucleotide BLAST Searches against the NCBI non-redundant database
  • blast_est(): Perform Nucleotide to Nucleotide BLAST Searches against the NCBI expressed sequence tags database
  • blast_pdb_protein():
  • blast_pdb_nucleotide():
  • blast_swissprot():
  • blast_delta():
  • blast_refseq_rna():
  • blast_refseq_gene():
  • blast_refseq_protein():

BLAST against a set of organisms

  • blast_nucleotide_to_genomes(): Perfrom BLAST Searches Against a Set of Genomes
  • blast_protein_to_proteomes(): Perfrom BLAST Searches Against a Set of Proteomes
  • detect_homologs_cds_to_cds(): Perform CDS to CDS BLAST Searches against a set of CDS files
  • detect_homologs_proteome_to_proteome(): Perform Proteome to Proteome BLAST Searches against a set of Proteomes
  • extract_hit_seqs_from_genomes(): Extract sequences of BLAST hits in respective genomes and store it as 'fasta' file(s)
  • extract_random_seqs_from_genome(): Extract random loci from a genome of interest
  • sample_chromosome_intervals(): Helper function to sample random intervals of length 'interval_width' from chromosomes

Analyze BLAST Report

  • filter_blast_:

Navigation functions

  • list_outformats(): List available BLAST output formats

metablastr's People

Contributors

gogleva avatar hajkd avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

metablastr's Issues

Reg: Reciprocal hits from already run blast outcomes

Hi!

I have performed the following to obtain blast format 6 tabular format:

  1. A vs B with Diamond
  2. B vs A with Diamond

I read AvB.hits and BvA.hits using read_blast() function. Will it be possible to identify the reciprocal hits? Or should I rerun the searches with metablastr?

Error in .Call2("new_input_filexp", filepath, PACKAGE = "XVector") : cannot open file ''

Hi I finally downloaded Blast+ and created the pathway to R. Now I am having issues loading my fasta file. Thanks!

I tried to load it from my hard drive and from my desktop and get the same errors:

sealionfeces <- readDNAStringSet("H:/ONRdolphinsealionpooled/sealionfecespooled/canu/medaka/consensus.fasta", + package = "rBLAST")) Error: unexpected ')' in: " sealionfeces <- readDNAStringSet("H:/ONRdolphinsealionpooled/sealionfecespooled/canu/medaka/consensus.fasta", package = "rBLAST"))"

sealionfeces <- readDNAStringSet(system.file("H:/ONRdolphinsealionpooled/sealionfecespooled/canu/medaka/consensus.fasta", + package = "rBLAST")) Error in .Call2("new_input_filexp", filepath, PACKAGE = "XVector") : cannot open file ''

sealionfeces <- readDNAStringSet(system.file("C:\Users\katie\OneDrive\Desktop\R\sealionfecespooled.consensus.fasta", Error: '\U' used without hex digits in character string starting ""C:\U"

sealionfeces <- readDNAStringSet(system.file("C:/Users/katie/OneDrive/Desktop/R/sealionfecespooled.consensus.fasta", + package = "rBLAST")) Error in .Call2("new_input_filexp", filepath, PACKAGE = "XVector") : cannot open file ''
--
 
| >

additional functionality when extracting random sequences

Hi,

I just stumbled across your package today while looking for a way to extract the same set of 10,000 1000kb loci randomly from five different genomes. Thank you so much for writing this package, I think it's going to be a huge help in this process. If you're still open to hearing requests for additional functionality/flexibility, I have two questions for you. For your function 'extract_random_seqs_from_genome', is it possible to set it so that replacement = FALSE once each random locus is selected and extracted (or is this already true)? Would it also be possible to set a minimum distance between randomly selected loci, e.g. if I wanted to specify that all loci are at least 50bp apart?

Thank you,
Amy

Feature Request: exclude sequences with Ns

In extract_random_seqs_from_genome(), It would be helpful to have an option that allows users to decide whether to exclude sequences with too many Ns (e.g. N > 0 or N > 10%). For me, it would be fine for this filtering step to happen after X sequences are drawn (e.g. if 100 sequences are drawn, then 10 are excluded because they have too many Ns, resulting in 90 sequences). It would be great to have a short printout at the end that says how many sequences were drawn and how many were filtered out due to an issue with Ns.

How can I load sequences from a CSV file and do the massive Blast search locally?

Hi,

I have a CSV dataset of edited DNA sequences by DADA2 pipeline and wonder how I can load and blast these sequences automatically using metablastr packages.

In the CSV file, each row represents a unique sequence and each column has a sample name(see attached image):
Screen Shot 2022-08-11 at 6 53 11 PM

These COI gene sequences are clean and ready to Blast directly on NCBI website. Most of the sequences are from mammalian and avian blood. Since there are over 2000 sequences, it'd be great if I can use this package to load and blast automatically instead of manually.

Any R scripts to achieve this goal with metablastr package would highly appreciated. Thank you.

Best,

Gabriel

Feature request : add taxon id for each blast hit

Default blast tabular format output (outfmt 7) doesn't add taxon id for each blast hit. Taxon id is very important for downstream phylogenetic analysis. Indirect approach to add taxon id is to run the blastdbcmd with option %T once the results are obtained. This is very time consuming as you have to get taxon first and map back to original blast results. Can metablstr has function which can map taxon id to blast outcome ?

blast_nt for nucleotide against NCBI database

Hello
thank you for developping this set of tools
I am trying to run my nucleotide sequences againts NCBI database
From your vignette, it should be blast_n()?
but the function is not available is the version of metablastr I just installed with

devtools::install_github("drostlab/metablastr", build_vignettes = TRUE, dependencies = TRUE)

do you have an alternative to this function for nucleotides?
https://drostlab.github.io/metablastr/reference/blast_protein_to_nr_database.html

does the database need to be stored locally? advices?

thank you!

function blast_protein_to_protein with argument is.subject.db = TRUE doesn't work with nr database

I have nr blast database downloaded from NCBI, which contains the files given in the attached snapshot. When I run the command below, it throws the as shown. I wonder, what input should I give as a blast-able database ?

blast_test <- blast_protein_to_protein(
        query   =  "aa_query.fasta",
        subject = "path/to/nr/db/nr",
        is.subject.db = TRUE,
        output.path = tempdir(),
        db.import  = FALSE ,cores = 4)

Error in .Call2("new_input_filexp", filepath, PACKAGE = "XVector") : 
  cannot open file '/Users/chiragparsania/Documents/Database/nr_protein_db/nr'

image

blast_best_reciprocal_hit nucleotide-protein comparison task error

I ran the following code:

blast_test_reciprocal <- blast_best_reciprocal_hit(
    query   = 'A.fasta', ##protein sequence
    subject = 'B.fasta', ##nucleotide sequence
    search_type = "protein_to_nucleotide",
    task = "tblastn",
    evalue = 0.000001,
    output.path = tempdir(),
    db.import  = FALSE)

which gives the following result:


Starting 'tblastn -task tblastn' with  query: A.fasta and subject: B.fasta using 1 core(s) ...

BLAST search finished! The BLAST output file was imported into the running R session. The BLAST output file has been stored at: C:/Users/A_B_tblastn_eval_1e-06.blast_tbl
Error: Please choose a nucleotide-protein comparison task that is supported by BLAST: task = 'blastx' or task = 'blastx-fast'.

How to specify the second blast task ('blastx') when performing tblastn?

Error: Internal error in `dict_hash_with()`: Dictionary is full.

Thank you metablastr developers for sharing this tool with the community. I'd like to seek for your help for the error I've encountered following blast_best_reciprocal_hit() run. Both BLASTp seem to have completed, but the reciprocal best hit step appears to have failed. One database I'm using has around 25M records, and I'm wondering if this could be the reason why the reciprocal best hit step failed. For reference, I'm sharing the snippets of the error:

BLAST search finished! The BLAST output file was imported into the running R session. The BLAST output file has been stored at: /expt/datb/data/HiC/Rp_RNA-Seq/embryo/timeseries-vs-RpedSuzhou/rna-seq_rpedszv-reannot/annot_gene_sym/metablastr_bbh/metazoa_refseq_biopython-validated_Riptortus_pedestris_SZV_blastp-fast_eval_1e-05.blast_tbl
Error: Internal error in dict_hash_with(): Dictionary is full.

rlang::last_error()
<error/rlang_error>
Internal error in dict_hash_with(): Dictionary is full.
Backtrace:

  1. metablastr::blast_best_reciprocal_hit(...)
  2. metablastr::blast_best_hit(...)
  3. dplyr:::group_by.data.frame(blast_res, query_id)
  4. dplyr::grouped_df(groups$data, groups$group_names, .drop)
  5. dplyr:::compute_groups(data, vars, drop = drop)
  6. dplyr:::vec_split_id_order(group_vars)
  7. vctrs::vec_group_loc(x)
    Run rlang::last_trace() to see the full context.

Is it also possible to skip the BLAST step to directly proceed with the reciprocal best hit step when re-running this procedure?

Thank you very much!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.