Giter Site home page Giter Site logo

nationalgenomicscenter / ednassay Goto Github PK

View Code? Open in Web Editor NEW
4.0 1.0 2.0 2.94 MB

This repository contains scripts and data for training and utilizing machine learning classifiers to predict qPCR cross-amplification.

R 100.00%
environmental-dna base-pair-mismatches predictive-modeling machine-learning random-forest assay-specificity assay-generality qpcr

ednassay's Introduction

eDNAssay: a learned model of qPCR cross-amplification

We used supervised machine learning to enhance the prediction of qPCR assay specificity. Our training data were produced via two reaction chemistries: SYBR Green intercalating dye and TaqMan MGB probes. Separate models were trained for each and the full-assay model (TaqMan probe-based results) is also available online as eDNAssay. Degenerate bases with IUPAC ambiguity codes are accepted. Indels are treated as N (i.e., any base) as a conservative estimate of assay specificity.

The impetus for building these models was to streamline development of environmental DNA (eDNA) assays. Environmental DNA assays need to discriminate among suites of sequences that may be very similar. To ensure assay specificity, eDNA practitioners typically evaluate sequences from all closely related taxa (e.g., confamilials) within a pre-defined geographic area. Any taxa that are not deemed "different enough" in computer-based in silico testing must be put through time- and resource-intensive, laboratory-based in vitro testing. However, the determination that an assay is "different enough" in silico is often dubious. Instead of relying on thermodynamic models and simple mismatch heuristics (as do the vast majority of existing in silico tools), our models have been trained on empirical data and are therefore highly accurate. Results from model training scripts in this repository will vary slightly from those in Kronenberger et al. (2022) depending on the seed selected.

IMPORTANT: eDNAssay was trained using qPCR results generated at the National Genomics Center for Wildlife and Fish Conservation and using it to predict specificity under different reaction conditions may be less accurate. For optimal performance, we recommend users either follow the reaction conditions outlined in Kronenberger et al. (2022) or reassess/retrain the model using their conditions of choice.

File guide

  • SYBR_training_data.csv - Empirical dataset containing information on base-pair mismatches, oligonucleotide characteristics, and the results of SYBR Green-based qPCR tests. These data were used to test the primer-only model. See Training_variable_definitions.xlsx for more information.
  • SYBR_testing_data.csv - Empirical dataset containing SYBR model predictions and qPCR results of assay-template combinations used for testing. See Testing_variable_definitions.xlsx for more information.
  • SYBR_model_training.R - Script used to train a random forest model to predict cross-amplification of SYBR Green-based qPCR assays.
  • SYBR_trained_model.RData - The learned primer-only model (SYBR Green results; produced using the SYBR_model_training.R script).
  • SYBR_optimal_thresholds.R - Script used to calculate optimal class assignment probability thresholds for the primer-only model (SYBR Green results) and a range of false negative (FN) to false positive (FP) cost ratios. For a given FN:FP cost ratio, the threshold that results in the lowest total error cost is optimal.
  • SYBR_specificity_prediction - Script used to calculate base-pair mismatches between assay oligonucleotides and templates, and then assign templates probabilities of belonging to the "amplify" class via the learned primer-only model (SYBR Green results).
  • TaqMan_training_data.csv - Empirical dataset containing information on base-pair mismatches, oligonucleotide characteristics, and the results of TaqMan probe-based qPCR tests. These data were used to train the full-assay model. See Training_variable_definitions.xlsx for more information.
  • TaqMan_testing_data.csv - Empirical dataset containing SYBR model predictions and qPCR results of assay-template combinations used for testing. See Testing_variable_definitions.xlsx for more information.
  • TaqMan_model_training.R - Script used to train a random forest model to predict cross-amplification of TaqMan probe-based qPCR assays.
  • TaqMan_trained_model.RData - The learned full-assay model (TaqMan probe-based results; produced using the TaqMan_model_training.R script), referred to as eDNAssay.
  • TaqMan_optimal_thresholds.R - Script used to calculate optimal class assignment probability thresholds for the full-assay model (TaqMan probe-based results) and a range of false negative (FN) to false positive (FP) cost ratios. For a given FN:FP cost ratio, the threshold that results in the lowest total error cost is optimal.
  • eDNAssay_offline_version.R - Script used to calculate base-pair mismatches between assay oligonucleotides and templates, and then assign templates probabilities of belonging to the "amplify" class via the learned full-assay model (TaqMan probe-based results). This may be used as an alternative to the eDNAssay Shiny app.
  • Testing_variable_definitions.xlsx - Spreadsheet defining the column names in SYBR_testing_data.csv and TaqMan_testing_data.csv.
  • Training_variable_definitions.xlsx - Spreadsheet defining the column names in SYBR_training_data.csv and TaqMan_training_data.csv.
  • app.R - Script behind the eDNAssay Shiny app.
  • eDNAssay_alignment_example.fas - An example sequence alignment file for use with eDNAssay.
  • eDNAssay_AP_stats.R - Script used to calculate summary statistics (minimum, maximum, mean, and standard deviation of the mean) of assignment probabilties when multiple sequences are included per taxon.

Contact information

Please reach out to us at the National Genomics Center for Wildlife and Fish Conservation with any questions or comments. Scripts and models were created by John Kronenberger at [email protected] and Taylor Wilcox at [email protected].

ednassay's People

Contributors

forestjak avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.