Giter Site home page Giter Site logo

rdiamond's Introduction

rdiamond

Visitors

Seamless Integration of DIAMOND2 Sequence Searches in R

Motivation

We are excited to introduce DIAMOND2, a cutting-edge pairwise protein aligner tailored to meet the extensive demands of the Earth BioGenome Project and other expansive genomics initiatives. DIAMOND2 is a groundbreaking software solution designed to accelerate BLAST searches by an factor of up to 10,000x. To offer researchers even more flexibility and integration, we provide rdiamond, a dedicated interface package that allows programmatic handling of DIAMOND2 sequence searches directly through R.

The rdiamond package offers streamlined interface functions, enabling users to seamlessly run DIAMOND2 directly within R. Notably, it's designed to handle vast outputs, processing terabytes of DIAMOND2 hit files directly from the disk on a local machine, bypassing memory limitations.

Furthermore, when paired with the biomartr R package, users have the convenience of automatically fetching large-scale genomic data and subsequently searching through it using rdiamond."

This version emphasizes the utility and integration capabilities of the rdiamond package while maintaining clarity.

Install rdiamond

For Linux Users:

Please install the libpq-dev library on you linux machine by typing into the terminal:

sudo apt-get install libpq-dev

For all systems install rdiamond by typing

# install Bioconductor
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install()

# install Biostrings -> see here for different Biostrings verions:
# http://bioconductor.org/about/release-announcements/
BiocManager::install(c("Biostrings"))

# install.packages("devtools")
# install the current version of rdiamond on your system
devtools::install_github("drostlab/rdiamond", build_vignettes = TRUE, dependencies = TRUE)

Citation

This R package is not formally published yet, but please cite the following paper when using this software for your research:

Buchfink B, Reuter K, Drost HG, "Sensitive protein alignments at tree-of-life scale using DIAMOND", Nature Methods 18, 366–368 (2021). doi:10.1038/s41592-021-01101-x

Quick start

# run diamond assuming that the diamond executable is available
# via the system path ('diamond_exec_path = NULL') and using
# sensitivity_mode = "ultra-sensitive"
diamond_example <- rdiamond::diamond_protein_to_protein(
              query   = system.file('seqs/qry_aa.fa', package = 'rdiamond'),
              subject = system.file('seqs/sbj_aa.fa', package = 'rdiamond'),
              sensitivity_mode = "ultra-sensitive",
              output_path = tempdir(),
              use_arrow_duckdb_connection  = FALSE)

# look at DIAMOND results
diamond_example
Running diamond with 'diamond blastp  with query: /library/rdiamond/seqs/qry_aa.fa and subject: /library/rdiamond/seqs/sbj_aa.fa using 2 core(s) ...


Alignment Sensitivity Mode:  --ultra-sensitive and max-target-seqs: 500


Masking of low-complexity regions: TANTAN


diamond v2.0.4.142 (C) Max Planck Society for the Advancement of Science
Documentation, support and updates available at http://www.diamondsearch.org

#CPU threads: 4
Scoring parameters: (Matrix=BLOSUM62 Lambda=0.267 K=0.041 Penalties=11/1)
Database input file: /library/rdiamond/seqs/sbj_aa.fa
Opening the database file...  [0.001s]
Loading sequences...  [0s]
Masking sequences...  [0.003s]
Writing sequences...  [0s]
Hashing sequences...  [0s]
Loading sequences...  [0s]
Writing trailer...  [0s]
Closing the input file...  [0s]
Closing the database file...  [0s]
Database hash = e7cd8e84df51b22dd27f3c01d5765fe1
Processed 20 sequences, 9444 letters.
Total time = 0.006s
diamond v2.0.4.142 (C) Max Planck Society for the Advancement of Science
Documentation, support and updates available at http://www.diamondsearch.org

#CPU threads: 2
Scoring parameters: (Matrix=BLOSUM62 Lambda=0.267 K=0.041 Penalties=11/1)
Temporary directory: /var/folders/yn/mgwl8_b56hz4v2c2vlfxj07w00076j/T//RtmpUCSsbL
Opening the database...  [0.002s]
#Target sequences to report alignments for: 500
Reference = /var/folders/yn/mgwl8_b56hz4v2c2vlfxj07w00076j/T//RtmpUCSsbL/sbj_aa.fa.dmnd
Sequences = 20
Letters = 9444
Block size = 400000000
Opening the input file...  [0s]
Opening the output file...  [0s]
Loading query sequences...  [0s]
Masking queries...  [0.002s]
Building query seed set... Algorithm: Double-indexed
 [0s]
Building query histograms...  [0.008s]
Allocating buffers...  [0s]
Loading reference sequences...  [0s]
Masking reference...  [0.001s]
Initializing temporary storage...  [0.013s]
Building reference histograms...  [0.007s]
Allocating buffers...  [0s]
Processing query block 1, reference block 1/1, shape 1/64.
Building reference seed array...  [0s]
Building query seed array...  [0s]
Computing hash join...  [0.001s]
Building seed filter...  [0.001s]
Searching alignments...  [0.002s]

... 


Building reference seed array...  [0s]
Building query seed array...  [0s]
Computing hash join...  [0s]
Building seed filter...  [0.033s]
Searching alignments...  [0.001s]
Deallocating buffers...  [0s]
Clearing query masking...  [0s]
Computing alignments...  [0.065s]
Deallocating reference...  [0s]
Loading reference sequences...  [0s]
Deallocating buffers...  [0s]
Deallocating queries...  [0s]
Loading query sequences...  [0s]
Closing the input file...  [0s]
Closing the output file...  [0s]
Closing the database file...  [0s]
Deallocating taxonomy...  [0s]
Total time = 0.797s
Reported 20 pairwise alignments, 20 HSPs.
20 queries aligned.


DIAMOND search finished successfully! 


The DIAMOND output file was imported into the running R session. The DIAMOND output file has been stored at: /var/folders/RtmpUCSsbL/qry_aa_sbj_aa_blastp_eval_0.001.blast_tbl
A tibble: 20 x 20
   query_id subject_id perc_identity num_ident_match…
   <chr>    <chr>              <dbl>            <int>
 1 333554|… AT1G01010…          73.2              347
 2 470181|… AT1G01020…          91.1              224
 3 470180|… AT1G01030…          93.3              335
 4 333551|… AT1G01040…          93.4             1840
 5 909874|… AT1G01050…         100                213
 6 470177|… AT1G01060…          87.5              567
 7 918864|… AT1G01070…          92.6              339
 8 909871|… AT1G01080…          89.3              268
 9 470171|… AT1G01090…          96.8              420
10 333544|… AT1G01110…          87.7              463
11 918858|… AT1G01120…          99.2              525
12 470161|… AT1G01140…          98.5              446
13 918855|… AT1G01150…          72.6              207
14 918854|… AT1G01160…          78.8              141
15 311317|… AT1G01170…          85.6               83
16 909860|… AT1G01180…          92.6              287
17 311315|… AT1G01190…          94.2              502
18 470156|… AT1G01200…          95.8              228
19 311313|… AT1G01210…          95.3              102
20 470155|… AT1G01220…          96.5             1019
# … with 16 more variables: alig_length <int>,
#   mismatches <int>, gap_openings <int>, n_gaps <int>,
#   pos_match <int>, ppos <dbl>, q_start <int>,
#   q_end <int>, q_len <int>, qcovhsp <dbl>, s_start <int>,
#   s_end <int>, s_len <int>, evalue <dbl>,
#   bit_score <dbl>, score_raw <dbl>

or

dplyr::glimpse(diamond_example)
Rows: 20
Columns: 20
$ query_id          <chr> "333554|PACid:16033839", "470181|PACid:16064328", "470180|PAC…
$ subject_id        <chr> "AT1G01010.1", "AT1G01020.1", "AT1G01030.1", "AT1G01040.1", "…
$ perc_identity     <dbl> 73.2, 91.1, 93.3, 93.4, 100.0, 87.5, 92.6, 89.3, 96.8, 87.7, …
$ num_ident_matches <int> 347, 224, 335, 1840, 213, 567, 339, 268, 420, 463, 525, 446, …
$ alig_length       <int> 474, 246, 359, 1969, 213, 648, 366, 300, 434, 528, 529, 453, …
$ mismatches        <int> 75, 22, 20, 58, 0, 71, 23, 25, 8, 65, 4, 6, 68, 30, 0, 20, 30…
$ gap_openings      <int> 8, 0, 2, 7, 0, 5, 2, 2, 3, 0, 0, 1, 3, 2, 1, 1, 1, 0, 0, 0
$ n_gaps            <int> 52, 0, 4, 71, 0, 10, 4, 7, 6, 0, 0, 1, 10, 8, 14, 3, 1, 0, 0,…
$ pos_match         <int> 369, 231, 338, 1870, 213, 587, 342, 275, 425, 475, 527, 448, …
$ ppos              <dbl> 77.8, 93.9, 94.2, 95.0, 100.0, 90.6, 93.4, 91.7, 97.9, 90.0, …
$ q_start           <int> 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 16, 2, 4, 1, 6, 1, 1, 1
$ q_end             <int> 466, 246, 355, 1963, 213, 640, 362, 299, 433, 528, 529, 453, …
$ q_len             <int> 466, 246, 355, 1963, 213, 640, 362, 299, 433, 528, 529, 453, …
$ qcovhsp           <dbl> 100.0, 100.0, 100.0, 99.9, 100.0, 100.0, 100.0, 100.0, 100.0,…
$ s_start           <int> 1, 1, 1, 6, 1, 1, 1, 1, 1, 1, 1, 1, 5, 4, 2, 1, 1, 1, 1, 1
$ s_end             <int> 430, 246, 359, 1910, 213, 646, 366, 294, 429, 528, 529, 452, …
$ s_len             <int> 430, 246, 359, 1910, 213, 646, 366, 294, 429, 528, 529, 452, …
$ evalue            <dbl> 1.83e-212, 2.79e-157, 4.66e-209, 0.00e+00, 4.51e-158, 0.00e+0…
$ bit_score         <dbl> 584, 428, 568, 3541, 427, 1041, 613, 499, 816, 841, 1029, 866…
$ score_raw         <dbl> 1506, 1100, 1464, 9181, 1098, 2691, 1581, 1284, 2108, 2172, 2…

Discussions and Bug Reports

I would be very happy to learn more about potential improvements of the concepts and functions provided in this package.

Furthermore, in case you find some bugs, need additional (more flexible) functionality of parts of this package, or want to contribute to this project please let me know:

https://github.com/drostlab/rdiamond/issues

rdiamond's People

Contributors

hajkd avatar lotharukpongjs avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

rdiamond's Issues

Number of cores not recognised in `diamond_protein_to_protein_best_hits()`

Hi @HajkD ,

I was using diamond_protein_to_protein_best_hits() and I noticed that the diamond_protein_to_protein_best_hits() and probably the diamond_protein_to_protein_best_reciprocal_hits() do not recognise the number of cores added to the parameter cores.

> best_rec_hits <- rdiamond::diamond_protein_to_protein_best_hits(
+   query = file_path.ENSEMBL, 
+   subject = file_path.REFSEQ,
+   cores = 8)
diamond v2.1.8.162 (C) Max Planck Society for the Advancement of Science, Benjamin Buchfink, University of Tuebingen
Documentation, support and updates available at http://www.diamondsearch.org
Please cite: http://dx.doi.org/10.1038/s41592-021-01101-x Nature Methods (2021)

#CPU threads: 8
Scoring parameters: (Matrix=BLOSUM62 Lambda=0.267 K=0.041 Penalties=11/1)
Database input file: _uniprot_downloads/proteomes/Caenorhabditis_elegans_protein_uniprot.faa.gz
Opening the database file...  [0.028s]
Loading sequences...  [0.093s]
Masking sequences...  [1.337s]
Writing sequences...  [0.02s]
Hashing sequences...  [0.003s]
Loading sequences...  [0s]
Writing trailer...  [0s]
Closing the input file...  [0s]
Closing the database file...  [0.009s]

Database sequences  19827
  Database letters  8133982
     Database hash  6a425f34faeffe53753689f5f3d11321
        Total time  1.494000s
diamond v2.1.8.162 (C) Max Planck Society for the Advancement of Science, Benjamin Buchfink, University of Tuebingen
Documentation, support and updates available at http://www.diamondsearch.org
Please cite: http://dx.doi.org/10.1038/s41592-021-01101-x Nature Methods (2021)

#CPU threads: 1
Scoring parameters: (Matrix=BLOSUM62 Lambda=0.267 K=0.041 Penalties=11/1)

etc.

I believe this is because the internal call of rdiamond::diamond_protein_to_protein() in these two functions doesn't set the cores parameter, thus defaulting to cores = 1. This can be resolved by adding this parameter to the two functions.

hit_tbl <- rdiamond::diamond_protein_to_protein(
query = query,
subject = subject,
is_subject_db = is_subject_db,
task = "blastp",
sensitivity_mode = sensitivity_mode,
out_format = out_format,
evalue = evalue,
max_target_seqs = max_target_seqs,
hard_mask = hard_mask,
diamond_exec_path = diamond_exec_path,
add_makedb_options = add_makedb_options,
add_diamond_options = add_diamond_options,
output_path = output_path)

I will try to change this and make a pull request when I have the time.

Best,
Sodai

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.