sage-bionetworks / challengescoring Goto Github PK

This R package provides scoring mechanisms for computational challenges and implements the bayesBootLadderBoot approach for avoiding test data leakage.

License: Apache License 2.0

R 100.00%

dream challenges bootstrapping-statistics

challengescoring's Introduction

challengescoring

This R package provides scoring mechanisms for computational challenges and implements the bayesBootLadderBoot approach for avoiding test data leakage.

Installation

remotes::install_github("Sage-Bionetworks/challengescoring@master")

Usage

View vignettes.

challengescoring's People

Contributors

Stargazers

Watchers

challengescoring's Issues

refactor package to make it easier to reuse scoring functions directly in challenge pipeline

challengescoring functions were written with post-challenge analysis in mind.

most challengescoring score functions are written as score_fun(prediction_vector, gold_vector)
most challenge pipeline scoring functions are written as score_fun(path_to_pred, path_to_gold)

It would be much easier if the scoring functions were easy to both bootstrap as well as call package functions in the challenge pipeline.

However, a small challenge with this is that the scoring functions are called 1000's of times when bootstrapping, so if they are reading in the csv each time, that could be a problem. Additionally, for bootstrapping, the resampling needs to be paired across all prediction files tested, so the resampling step needs to happen outside the scoring function, not inside. This is how it is currently implemented, but would be harder if the scoring function is reading from a path rather than a list of resampled prediction dfs.

Issue with challengescoring vignette

I'm getting this error trying to get the vignette to run:

library(survival)
goldStandard <- Surv(d.time, death) %>% as.character() %>% as.data.frame() %>% magrittr::set_colnames(c("gold"))
predictions <- age_new %>% as.data.frame() %>% magrittr::set_colnames(c("pred"))
prevPredictions <- age_previous %>% as.data.frame() %>% magrittr::set_colnames(c("pred"))

bootLadderBoot(predictions = predictions,
               predictionColname = "pred",
               prevPredictions = prevPredictions,
               goldStandard = goldStandard,
               goldStandardColname = "gold",
               scoreFun = c_statistic,
               verbose = T)

Error: `by` required, because the data sources have no common variables
16.
stop(fallback)
15.
signal_abort(cnd)
14.
.abort(text)
13.
glubort(fmt_args(args), ..., .envir = .envir)
12.
bad_args("by", "required, because the data sources have no common variables")
11.
common_by.NULL(by, x, y)
10.
common_by(by, x, y)
9.
full_join.tbl_df(tbl_df(x), y, by = by, copy = copy, ...)
8.
full_join(tbl_df(x), y, by = by, copy = copy, ...)
7.
as.data.frame(full_join(tbl_df(x), y, by = by, copy = copy, ...))
6.
full_join.data.frame(goldStandardDF, predictionsDF)
5.
dplyr::full_join(goldStandardDF, predictionsDF)
4.
eval(lhs, parent, parent)
3.
eval(lhs, parent, parent)
2.
dplyr::full_join(goldStandardDF, predictionsDF) %>% dplyr::full_join(prevPredictionsDF) %>% dplyr::select_(goldStandardColname, predictionColname, "prevpred") %>% purrr::set_names("gold", "pred", "prevpred") at bootstrap.R#71
1.
bootLadderBoot(predictions = predictions, predictionColname = "pred", prevPredictions = prevPredictions, goldStandard = goldStandard, goldStandardColname = "gold", scoreFun = c_statistic, verbose = T)

Set up dockerhub build

Add Dockerfile

Add readme

Should add readme about this repository. Should include

What this package is
Contributing guide
how to run tests


trim_vec <- function(vec, trim = 10){
  if(length(vec) > trim){
    vec <- vec[1:trim]
    vec <- as.character(vec)
    vec[trim+1] <- '...'
  }else{
    vec
  }
}

validate <- function(prediction_path, template_path){
  
  pred <- readr::read_csv(prediction_path)
  temp <- readr::read_csv(template_path)
  
  ###configure validation
  ncol_req <- ncol(temp)
  nrow_req <- nrow(temp)
  colnames_req <- colnames(temp)
  target_ids <- unique(temp$target)
  
  errs <- list()

  if(ncol(pred)<ncol_req){
    errs["ncol_short"] <- paste0("Prediction file is missing cols. Only ", ncol(pred), " cols detected.")
  }
  
  if(ncol(pred)>ncol_req){
    errs["ncol_long"] <- paste0("Prediction file has extra  cols ", ncol(pred), " cols detected.")
  }
  
  if(nrow(pred)<nrow_req){
    errs["nrow_short"] <- paste0("Prediction file is missing rows Only ", nrow(pred), " rows detected.")
  }
  
  if(nrow(pred)>nrow_req){
    errs["nrow_long"] <- paste0("Prediction file has extra  rows ", nrow(pred), " rows detected.")
  }
  
  if(isTRUE(colnames(pred) %in% temp)){
    errs["colnames"] <- paste0("Column names are not correct. Column names must be ", cat(colnames_req))
  }
  
  if(!(all(pred[-1] >= 0) & all(pred[-1] <= 1))){
    errs["wrong_range"] <- paste0("Confidence values are not between 0 and 1.")
  }
  
  if(!all(apply(pred[-1], 1:2, is.numeric))){
    errs["non_numeric"] <- paste0("Predictions are not all numeric values.")
  }
  
  if(any(!pred$target %in% target_ids)){
    invalid <- unique(pred$target[!pred$target %in% target_ids]) %>% trim_vec()
    errs["non_target"] <- paste0("Invalid target identifiers were included in your prediction file (up to 10 displayed): ", invalid)
  }
  
  return(errs)
}

Add in CI to auto test package

Would be helpful on pull request to run all the tests