Now that we've started discussing classification, it's time to examine comparing our models to each other and choosing models of best fit. Previously in regression, we've been predicting values so it made sense to discuss error as a distance of how far off our estimates were. In classifying a binary variable however, we are either correct or incorrect. As a result, we tend to deconstruct this as how many false positives versus false negatives we come across.
In particular, we examine a few different specific measurements when evaluating the performance of a classification algorithm.
At times, we may wish to tune a classification algorithm to optimize against precison or recall rather then overall accuracy. For example, imagine the scenario of predicting whether or not a patient is at risk for cancer and should be brought in for additional testing. In cases such as this, we often may want to cast a slightly wider net, and it is much preferable to optimize for precision, the number of cancer positive cases, then it is to optimize recall, the percentage of our predicted cancer-risk patients who are indeed positive.
import pandas as pd
df = pd.read_csv()
#Your code here
#Your code here
def precision(y_hat, y):
#Your code here
def recall(y_hat, y):
#Your code here
def accuracy(y_hat, y):
#Your code here
Do this for both the train and the test set.
#Your code here
Plot the precision, recall and accuracy for test and train splits using different train set sizes. What do you notice?
importimport matplotlib.pyplotmatplot as plt
%matplotlib inline
training_Precision = []
testing_Precision = []
training_Recall = []
testing_Recall = []
training_Accuracy = []
testing_Accuracy = []
for i in range(10,95):
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= None) #replace the "None" here
logreg = LogisticRegression(fit_intercept = False, C = 1e12)
model_log = None
y_hat_test = None
y_hat_train = None
# 6 lines of code here
Create 3 scatter plots looking at the test and train precision in the first one, test and train recall in the second one, and testing and training accuracy in the third one.
# code for test and train precision
# code for test and train recall
# code for test and train accuracy