The tariff from verbal-autopsy-software

if statement that has no code attached

Logic is either missing or the entire statement can be removed to clean up the code.

Lines 115 to 117 in 5144ad8

    
           if(length(list) == 1){ 
        
           }

Throw a warning if intersect removes some columns

The code works and is harmless in intended case but it would be a good idea to alert the user. If they inadvertently passed in the wrong dataset or something, a warning would catch it.

Tariff/Tariff/R/Tariff.r

Lines 78 to 81 in 2324381

    
           # make sure train and test has the same columns 
        
           joint <- intersect(colnames(symps.train), colnames(symps.test)) 
        
           symps.test <- symps.test[, joint] 
        
           symps.train <- symps.train[, joint]

CSMF not being reported in correct order

It appears that when you use the reported csmf from the tariff function, this is not identical to calculating it directly from the most likely COD for each test subject. Furthermore, it appears that this has to do with the naming of the items:

data("RandomVA3")
test <- RandomVA3[1:200, ]
train <- RandomVA3[201:400, ]
allcauses <- unique(train$cause)
set.seed(123)
fit <- tariff(causes.train = "cause", symps.train = train, 
              symps.test = test, causes.table = allcauses)
fit$csmf
prop.table(table(fit$causes.test[,2]))

You can see above that in fact the percentages are the same, but the names associated with each one are not. This is important to know which is the right way to get the CSMF, as it may confuse users.

Check to see if column 1 is ID before assignment

An assumption is being made that ID is column 1. Safer to add a check first or alternatively could access by column name (this may require name to be part of the function input, though, if name may not be consistent)

Tariff/Tariff/R/Tariff.r

Lines 74 to 77 in 2324381

    
           id.train <- symps.train[, 1] 
        
           symps.train <- symps.train[, -1] 
        
           id.test <- symps.test[, 1] 
        
           symps.test <- symps.test[, -1]

Question: First Bootstrap steps unclear

The paper documenting this algorithm doesn't seem to provide the process to select significant tariffs in the first bootstrap. Can you clarify this process for us to aid in our review?

Tariff/Tariff/R/Tariff.r

Lines 203 to 233 in 2324381

    
           ################################################################## 
        
           # first bootstrap step, removing insignificant cause-symptom combo  
        
           if(use.sig){ 
        
           	cat("\nStart re-sampling for significant Tariff cells\n") 
        
           	all.tariff.boot <- array(0, dim = c(nboot.sig, C, S)) 
        
           	# all.tariff.score.boot <- matrix(0, N.boot*N.train, C) 
        
           	for(i in 1 : nboot.sig){ 
        
           		sample.boot <- sample(1:N.train, size = N.train, replace = TRUE) 
        
           		symps.boot <- symps.num[sample.boot, ] 
        
           		cause.boot <- causes.train[sample.boot] 
        
           		count.boot <- count.combo(symps.boot, cause.boot, causes.table2, binary=T) 
        
           		all.tariff.boot[i, , ] <- getTariff(count.boot) 
        
           		# if(i %% 10 == 0) cat(".") 
        
           	} 
        
           	# get the lower bound for tariff, C by S matrix 
        
           	lower <- apply(all.tariff.boot, c(2,3), function(x){quantile(x, 0.025)}) 
        
           	# get the upper bound for tariff, C by S matrix 
        
           	upper <- apply(all.tariff.boot, c(2,3), function(x){quantile(x, 1-0.025)}) 
        
           	# check if it covers zero 
        
           	cover <- sign(lower * upper) 
        
           	# define the indicator matrix of which tariff to be removed 
        
           	insig <- cover * 0 
        
           	insig[which(cover != -1)] <- 1 
        
           	if(sum(insig) == 0){ 
        
           		warning("No Tariff is significant, remove bootstrapping step") 
        
           		insig <- matrix(1, C, S) 
        
           	} 
        
           }else{ 
        
           	insig <- matrix(1, C, S) 
        
           }

Define all variables for clarity and documentation purposes

The documentation should define all variables for users who might not be familiar with the algorithms at a detailed level

Tariff/Tariff/R/Tariff.r

Lines 10 to 12 in 2324381

    
           #' @param symps.train N.train by S matrix  
        
           #' @param symps.test N.test by S matrix  
        
           #' @param causes.table list of causes in the data

Check for upper and lower case

Is it possible for y to be lower case here? If so, recommendation to compare using 'toUpper()' just in case.

Tariff/Tariff/R/Tariff.r

Lines 118 to 120 in 2324381

    
           }else{ 
        
           	count <- apply(cases, 2, function(x){length(which(x == "Y"))/length(x)}) 
        
           }

Documentation error

We believe N causes is a vector here, not a matrix

Tariff/Tariff/R/Tariff.r

Line 99 in 2324381

# causes : N matrix

Question: Ranking steps unclear

The paper documenting this algorithm does not provide details for the ranking process. What is the pseudocode for this part of the algorithm? At a first glance it appears this code (as well as the code leading up to calling this code) may be able to be simplified.

Tariff/Tariff/R/Tariff.r

Lines 153 to 176 in 2324381

    
           # function to convert score to rank compared only to Resampled training set 
        
           # @para 
        
           #	mat     : score matrix (C by N) 
        
           #   all     : all score matrix (K by C) 
        
           # return 
        
           #	mat.rank: score rank matrix (C by N) 
        
           toRank <- function(mat, all){ 
        
           	cat("Calculating ranks\n") 
        
           	N <- dim(mat)[2] 
        
           	C <- dim(mat)[1] 
        
           	# notice rank from small to large, so take negative 
        
           	mat.rank <- lapply(seq(1:C),  
        
           		function(kk){ 
        
           			# cat(".");  
        
           			return( 
        
           				sapply(mat[kk, ], function(x, y){rank(-c(x, y))[1]}, all[,kk]) 
        
           		          )}) 
        
           	out <- mat 
        
           	for(i in 1:C){ 
        
           		out[i, ] <- mat.rank[[i]] 
        
           	} 
        
           	return(out) 
        
           }

Rename causes.table2 to something more descriptive

The name causes.table2 does not help the reader/maintainer of the code to understand its meaning. Suggestion is to rename to something more descriptive.

Tariff/Tariff/R/Tariff.r

Lines 193 to 198 in 2324381

    
           	causes.table2 <- causes.table[-nonexist] 
        
           }else{ 
        
           	causes.table2 <- causes.table 
        
           } 
        
           S <- dim(symps.train)[2] 
        
           C <- length(causes.table2)

Alert user column is being removed

Although it may be harmless in the expected scenario, there is a possibility the user accidentally passed in the wrong dataset. It is never a bad idea to throw a warning to alert the user something is being deleted.

Tariff/Tariff/R/Tariff.r

Lines 63 to 67 in 2324381

    
           # also remove this from testing data if it is provided 
        
           if(!is.na(colindex2)){ 
        
           	causes.test <- symps.test[, colindex2] 
        
           	symps.test <- symps.test[, -colindex2]		 
        
           }

Rename indices for readability

The names colindex and colindex2 could be renamed to something having to do with train and test, respectively. This would improve readability and maintainability.

Tariff/Tariff/R/Tariff.r

Lines 54 to 55 in 2324381

    
           colindex <- match(causes.train, colnames(symps.train)) 
        
           colindex2 <- match(causes.train, colnames(symps.test))

if condition contains duplicated code

The code inside the if statement on line 323 is the same as what is outside at line 321 - is this intentional?

Tariff/Tariff/R/Tariff.r

Lines 321 to 324 in 5144ad8

    
           CSMF <- (table(c(causes.test, causes.table2)) - 1) / length(causes.test) 
        
           if(use.rank){ 
        
           	CSMF <- (table(c(causes.test, causes.table2)) - 1)/length(causes.test) 
        
           }

Code could be condensed

Could condense the following code to
causes.train.out <- data.frame(ID = id.train, cause = causes.train, stringsAsFactors = F)

Tariff/Tariff/R/Tariff.r

Lines 324 to 325 in 2324381

    
           causes.train.out <- data.frame(ID = id.train) 
        
           causes.train.out$cause <- as.character(causes.train)

Enhancement: Style Guide

We recommend adhering to the Tidyverse R style guide (https://style.tidyverse.org) to improve readability and consistency of code. The tidyverse style guide sums up the need for a consistent style in a particularly effective way: “Good coding style is like correct punctuation: you can manage without it, butitsuremakesthingseasiertoread.”

As the project continues to grow, particularly in an open source environment, having consistent style standards enables easy onboarding of new staff as the code can be quickly read and understood. While a specific style guide is inherently personal preference, we like the tidyverse style guide because it specifically applies to R and works with the lintR tool that can perform automated adherence checks.

Two examples where the style guide could be helpful:

Tariff/Tariff/R/Tariff.r

Line 49 in 2324381

    
           tariff <- function(causes.train, symps.train, symps.test, causes.table = NULL,  use.rank = TRUE, nboot.rank = 1, use.sig = TRUE, nboot.sig = 500, use.top = FALSE, ntop = 40, ...){

In general across the Tariff code, it appears that period separated variable names are used throughout for object names while functions are named in camelCase. However, some functions (like this one) are also named using period separation. Having a consistent naming convention will enable easy differentiation between code functionalities. Even more importantly, perhaps, the tidyverse style guide specifically recommends against using periods in names because R uses periods in other base functionality. Having periods in object/function/class names can result in confusing code such as as.data.frame.data.frame().

Tariff/Tariff/R/Tariff.r

Line 104 in 2324381

count.combo <- function(symps, causes, causelist, binary){

Sometimes code lines can get extremely long, and having to scroll to the right adds unnecessary complexity. Limiting line length and adding line breaks can make the code significantly easier to read. The tidyverse style guide specifically recommends:

Strive to limit your code to 80 characters per line. This fits comfortably on a printed page with a reasonably sized font […] If a function call is too long to fit on a single line, use one line each for the function name, each argument, and the closing ). This makes the code easier to read and to change later.

Comment mentions converting InterVA input

This mention of converting InterVA input is confusing. Should be fixed for better understanding and maintenance.

Tariff/Tariff/R/Tariff.r

Line 296 in 2324381

# convert InterVA input back into numeric form (N by S matrix)

Function defined but never used

The following code defines a function that is never used. Code should be removed.

Tariff/Tariff/R/Tariff.r

Lines 179 to 182 in 2324381

    
           # function to get the n-th largest element's index 
        
           which.ordern <- function(x, n, isMax){ 
        
           	return(order(x, decreasing = isMax)[n]) 
        
           }

Add more detail for use.rank in documentation

It would be helpful for the user documentation to have more detailed information about what a 'TRUE' value means for use.rank in case they are not completely familiar with the algorithm.

Tariff/Tariff/R/Tariff.r

Line 13 in 2324381

#' @param use.rank logical indicator for whether using ranks instead of scores

Code incorrectly sets top tariffs to zero

The algorithm describes taking the top 40 Tariffs, but this code seems to be setting the top 40 Tariffs to zero thereby only looking at 41 to the end

Tariff/Tariff/R/Tariff.r

Lines 235 to 251 in 2324381

    
           ########################################################################### 
        
           # calculate Tariff from training data 
        
           # function to remove tail tariff values 
        
           cleanTariff <- function(tariff, nonzero){ 
        
           	for(i in 1:dim(tariff)[1]){ 
        
           		order.tmp <- order(abs(tariff[i, ]), decreasing = TRUE) 
        
           		tariff[i, order.tmp[1:nonzero]] <- 0 
        
           	} 
        
           	return(tariff) 
        
           } 
        
           # calculate actual tariff and delete the insignificant combo 
        
           X.train <- count.combo(symps.num, causes.train, causes.table2, binary=T) 
        
           tariff <- getTariff(X.train) * insig 
        
           if(use.top){ 
        
           	tariff <- cleanTariff(tariff, ntop) 
        
           }

Unnecessary assignment code can be removed

Because of the join performed at line 318, it seems that line 321 is unnecessary. There should not be anything in CSMF that is not in causes.table2

Tariff/Tariff/R/Tariff.r

Lines 317 to 321 in 2324381

    
           # find CSMF for testing set 
        
           CSMF <- (table(c(causes.test, causes.table2)) - 1) / length(causes.test) 
        
           # names(CSMF) <- causes.table2 
        
           CSMF <- CSMF[causes.table2]

	# make sure train and test has the same columns
	joint <- intersect(colnames(symps.train), colnames(symps.test))
	symps.test <- symps.test[, joint]
	symps.train <- symps.train[, joint]

	id.train <- symps.train[, 1]
	symps.train <- symps.train[, -1]
	id.test <- symps.test[, 1]
	symps.test <- symps.test[, -1]

	##################################################################
	# first bootstrap step, removing insignificant cause-symptom combo
	if(use.sig){
	cat("\nStart re-sampling for significant Tariff cells\n")
	all.tariff.boot <- array(0, dim = c(nboot.sig, C, S))
	# all.tariff.score.boot <- matrix(0, N.boot*N.train, C)
	for(i in 1 : nboot.sig){
	sample.boot <- sample(1:N.train, size = N.train, replace = TRUE)
	symps.boot <- symps.num[sample.boot, ]
	cause.boot <- causes.train[sample.boot]
	count.boot <- count.combo(symps.boot, cause.boot, causes.table2, binary=T)
	all.tariff.boot[i, , ] <- getTariff(count.boot)
	# if(i %% 10 == 0) cat(".")
	}
	# get the lower bound for tariff, C by S matrix
	lower <- apply(all.tariff.boot, c(2,3), function(x){quantile(x, 0.025)})
	# get the upper bound for tariff, C by S matrix
	upper <- apply(all.tariff.boot, c(2,3), function(x){quantile(x, 1-0.025)})
	# check if it covers zero
	cover <- sign(lower * upper)
	# define the indicator matrix of which tariff to be removed
	insig <- cover * 0
	insig[which(cover != -1)] <- 1
	if(sum(insig) == 0){
	warning("No Tariff is significant, remove bootstrapping step")
	insig <- matrix(1, C, S)
	}

	}else{
	insig <- matrix(1, C, S)
	}

	#' @param symps.train N.train by S matrix
	#' @param symps.test N.test by S matrix
	#' @param causes.table list of causes in the data

	}else{
	count <- apply(cases, 2, function(x){length(which(x == "Y"))/length(x)})
	}

	# function to convert score to rank compared only to Resampled training set
	# @para
	# mat : score matrix (C by N)
	# all : all score matrix (K by C)
	# return
	# mat.rank: score rank matrix (C by N)
	toRank <- function(mat, all){
	cat("Calculating ranks\n")
	N <- dim(mat)[2]
	C <- dim(mat)[1]

	# notice rank from small to large, so take negative
	mat.rank <- lapply(seq(1:C),
	function(kk){
	# cat(".");
	return(
	sapply(mat[kk, ], function(x, y){rank(-c(x, y))[1]}, all[,kk])
	)})
	out <- mat
	for(i in 1:C){
	out[i, ] <- mat.rank[[i]]
	}
	return(out)
	}

	causes.table2 <- causes.table[-nonexist]
	}else{
	causes.table2 <- causes.table
	}
	S <- dim(symps.train)[2]
	C <- length(causes.table2)

	# also remove this from testing data if it is provided
	if(!is.na(colindex2)){
	causes.test <- symps.test[, colindex2]
	symps.test <- symps.test[, -colindex2]
	}

	colindex <- match(causes.train, colnames(symps.train))
	colindex2 <- match(causes.train, colnames(symps.test))

	CSMF <- (table(c(causes.test, causes.table2)) - 1) / length(causes.test)
	if(use.rank){
	CSMF <- (table(c(causes.test, causes.table2)) - 1)/length(causes.test)
	}

	causes.train.out <- data.frame(ID = id.train)
	causes.train.out$cause <- as.character(causes.train)

	# function to get the n-th largest element's index
	which.ordern <- function(x, n, isMax){
	return(order(x, decreasing = isMax)[n])
	}

	###########################################################################
	# calculate Tariff from training data
	# function to remove tail tariff values
	cleanTariff <- function(tariff, nonzero){
	for(i in 1:dim(tariff)[1]){
	order.tmp <- order(abs(tariff[i, ]), decreasing = TRUE)
	tariff[i, order.tmp[1:nonzero]] <- 0
	}
	return(tariff)
	}
	# calculate actual tariff and delete the insignificant combo
	X.train <- count.combo(symps.num, causes.train, causes.table2, binary=T)
	tariff <- getTariff(X.train) * insig

	if(use.top){
	tariff <- cleanTariff(tariff, ntop)
	}

	# find CSMF for testing set
	CSMF <- (table(c(causes.test, causes.table2)) - 1) / length(causes.test)

	# names(CSMF) <- causes.table2
	CSMF <- CSMF[causes.table2]

verbal-autopsy-software / tariff Goto Github PK

tariff's People

Contributors

Watchers

Forkers

tariff's Issues

if statement that has no code attached

if condition contains duplicated code

Recommend Projects

Recommend Topics

Recommend Org