Giter Site home page Giter Site logo

tariff's People

Contributors

mboyas-mitre avatar richardli avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

tariff's Issues

Throw a warning if intersect removes some columns

The code works and is harmless in intended case but it would be a good idea to alert the user. If they inadvertently passed in the wrong dataset or something, a warning would catch it.

Tariff/Tariff/R/Tariff.r

Lines 78 to 81 in 2324381

# make sure train and test has the same columns
joint <- intersect(colnames(symps.train), colnames(symps.test))
symps.test <- symps.test[, joint]
symps.train <- symps.train[, joint]

CSMF not being reported in correct order

It appears that when you use the reported csmf from the tariff function, this is not identical to calculating it directly from the most likely COD for each test subject. Furthermore, it appears that this has to do with the naming of the items:

data("RandomVA3")
test <- RandomVA3[1:200, ]
train <- RandomVA3[201:400, ]
allcauses <- unique(train$cause)
set.seed(123)
fit <- tariff(causes.train = "cause", symps.train = train, 
              symps.test = test, causes.table = allcauses)
fit$csmf
prop.table(table(fit$causes.test[,2]))

You can see above that in fact the percentages are the same, but the names associated with each one are not. This is important to know which is the right way to get the CSMF, as it may confuse users.

Check to see if column 1 is ID before assignment

An assumption is being made that ID is column 1. Safer to add a check first or alternatively could access by column name (this may require name to be part of the function input, though, if name may not be consistent)

Tariff/Tariff/R/Tariff.r

Lines 74 to 77 in 2324381

id.train <- symps.train[, 1]
symps.train <- symps.train[, -1]
id.test <- symps.test[, 1]
symps.test <- symps.test[, -1]

Question: First Bootstrap steps unclear

The paper documenting this algorithm doesn't seem to provide the process to select significant tariffs in the first bootstrap. Can you clarify this process for us to aid in our review?

Tariff/Tariff/R/Tariff.r

Lines 203 to 233 in 2324381

##################################################################
# first bootstrap step, removing insignificant cause-symptom combo
if(use.sig){
cat("\nStart re-sampling for significant Tariff cells\n")
all.tariff.boot <- array(0, dim = c(nboot.sig, C, S))
# all.tariff.score.boot <- matrix(0, N.boot*N.train, C)
for(i in 1 : nboot.sig){
sample.boot <- sample(1:N.train, size = N.train, replace = TRUE)
symps.boot <- symps.num[sample.boot, ]
cause.boot <- causes.train[sample.boot]
count.boot <- count.combo(symps.boot, cause.boot, causes.table2, binary=T)
all.tariff.boot[i, , ] <- getTariff(count.boot)
# if(i %% 10 == 0) cat(".")
}
# get the lower bound for tariff, C by S matrix
lower <- apply(all.tariff.boot, c(2,3), function(x){quantile(x, 0.025)})
# get the upper bound for tariff, C by S matrix
upper <- apply(all.tariff.boot, c(2,3), function(x){quantile(x, 1-0.025)})
# check if it covers zero
cover <- sign(lower * upper)
# define the indicator matrix of which tariff to be removed
insig <- cover * 0
insig[which(cover != -1)] <- 1
if(sum(insig) == 0){
warning("No Tariff is significant, remove bootstrapping step")
insig <- matrix(1, C, S)
}
}else{
insig <- matrix(1, C, S)
}

Question: Ranking steps unclear

The paper documenting this algorithm does not provide details for the ranking process. What is the pseudocode for this part of the algorithm? At a first glance it appears this code (as well as the code leading up to calling this code) may be able to be simplified.

Tariff/Tariff/R/Tariff.r

Lines 153 to 176 in 2324381

# function to convert score to rank compared only to Resampled training set
# @para
# mat : score matrix (C by N)
# all : all score matrix (K by C)
# return
# mat.rank: score rank matrix (C by N)
toRank <- function(mat, all){
cat("Calculating ranks\n")
N <- dim(mat)[2]
C <- dim(mat)[1]
# notice rank from small to large, so take negative
mat.rank <- lapply(seq(1:C),
function(kk){
# cat(".");
return(
sapply(mat[kk, ], function(x, y){rank(-c(x, y))[1]}, all[,kk])
)})
out <- mat
for(i in 1:C){
out[i, ] <- mat.rank[[i]]
}
return(out)
}

Alert user column is being removed

Although it may be harmless in the expected scenario, there is a possibility the user accidentally passed in the wrong dataset. It is never a bad idea to throw a warning to alert the user something is being deleted.

Tariff/Tariff/R/Tariff.r

Lines 63 to 67 in 2324381

# also remove this from testing data if it is provided
if(!is.na(colindex2)){
causes.test <- symps.test[, colindex2]
symps.test <- symps.test[, -colindex2]
}

Rename indices for readability

The names colindex and colindex2 could be renamed to something having to do with train and test, respectively. This would improve readability and maintainability.

Tariff/Tariff/R/Tariff.r

Lines 54 to 55 in 2324381

colindex <- match(causes.train, colnames(symps.train))
colindex2 <- match(causes.train, colnames(symps.test))

if condition contains duplicated code

if condition contains duplicated code

The code inside the if statement on line 323 is the same as what is outside at line 321 - is this intentional?

Tariff/Tariff/R/Tariff.r

Lines 321 to 324 in 5144ad8

CSMF <- (table(c(causes.test, causes.table2)) - 1) / length(causes.test)
if(use.rank){
CSMF <- (table(c(causes.test, causes.table2)) - 1)/length(causes.test)
}

Code could be condensed

Could condense the following code to
causes.train.out <- data.frame(ID = id.train, cause = causes.train, stringsAsFactors = F)

Tariff/Tariff/R/Tariff.r

Lines 324 to 325 in 2324381

causes.train.out <- data.frame(ID = id.train)
causes.train.out$cause <- as.character(causes.train)

Enhancement: Style Guide

We recommend adhering to the Tidyverse R style guide (https://style.tidyverse.org) to improve readability and consistency of code. The tidyverse style guide sums up the need for a consistent style in a particularly effective way: “Good coding style is like correct punctuation: you can manage without it, butitsuremakesthingseasiertoread.”

As the project continues to grow, particularly in an open source environment, having consistent style standards enables easy onboarding of new staff as the code can be quickly read and understood. While a specific style guide is inherently personal preference, we like the tidyverse style guide because it specifically applies to R and works with the lintR tool that can perform automated adherence checks.

Two examples where the style guide could be helpful:

tariff <- function(causes.train, symps.train, symps.test, causes.table = NULL, use.rank = TRUE, nboot.rank = 1, use.sig = TRUE, nboot.sig = 500, use.top = FALSE, ntop = 40, ...){

In general across the Tariff code, it appears that period separated variable names are used throughout for object names while functions are named in camelCase. However, some functions (like this one) are also named using period separation. Having a consistent naming convention will enable easy differentiation between code functionalities. Even more importantly, perhaps, the tidyverse style guide specifically recommends against using periods in names because R uses periods in other base functionality. Having periods in object/function/class names can result in confusing code such as as.data.frame.data.frame().

count.combo <- function(symps, causes, causelist, binary){

Sometimes code lines can get extremely long, and having to scroll to the right adds unnecessary complexity. Limiting line length and adding line breaks can make the code significantly easier to read. The tidyverse style guide specifically recommends:

Strive to limit your code to 80 characters per line. This fits comfortably on a printed page with a reasonably sized font […] If a function call is too long to fit on a single line, use one line each for the function name, each argument, and the closing ). This makes the code easier to read and to change later.

Code incorrectly sets top tariffs to zero

The algorithm describes taking the top 40 Tariffs, but this code seems to be setting the top 40 Tariffs to zero thereby only looking at 41 to the end

Tariff/Tariff/R/Tariff.r

Lines 235 to 251 in 2324381

###########################################################################
# calculate Tariff from training data
# function to remove tail tariff values
cleanTariff <- function(tariff, nonzero){
for(i in 1:dim(tariff)[1]){
order.tmp <- order(abs(tariff[i, ]), decreasing = TRUE)
tariff[i, order.tmp[1:nonzero]] <- 0
}
return(tariff)
}
# calculate actual tariff and delete the insignificant combo
X.train <- count.combo(symps.num, causes.train, causes.table2, binary=T)
tariff <- getTariff(X.train) * insig
if(use.top){
tariff <- cleanTariff(tariff, ntop)
}

Unnecessary assignment code can be removed

Because of the join performed at line 318, it seems that line 321 is unnecessary. There should not be anything in CSMF that is not in causes.table2

Tariff/Tariff/R/Tariff.r

Lines 317 to 321 in 2324381

# find CSMF for testing set
CSMF <- (table(c(causes.test, causes.table2)) - 1) / length(causes.test)
# names(CSMF) <- causes.table2
CSMF <- CSMF[causes.table2]

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.