Chapter 10 Tuning in Classification
What metrics are we going to use when we train our classification models? In kNN, for example, our hyperparameter is \(k\), the number of observations in each bin. In our applications with mnist_27
and Adult
datasets, \(k\) was determined by a metric called as accuracy. What is it? If the choice of \(k\) depends on what metrics we use in tuning, can we improve our prediction performance by using a different metric? Moreover, the accuracy is calculated from the confusion table. Yet, the confusion table will be different for a range of discriminating thresholds used for labeling predicted probabilities. These are important questions in classification problems. We will begin answering them in this chapter.
10.1 Confusion matrix
In general, whether it is for training or not, measuring the performance of a classification model is an important issue and has to be well understood before fitting or training a model.
To evaluate a model’s fit, we can look at its predictive accuracy. In classification problems, this requires predicting \(Y\), as either 0 or 1, from the predicted value of \(p(x)\), such as
\[ \hat{Y}=\left\{\begin{array}{ll}{1,} & {\hat{p}\left(x_{1}, \ldots, x_{k}\right)>\frac{1}{2}} \\ {0,} & {\hat{p}\left(x_{1}, \ldots, x_{k}\right)<\frac{1}{2}}\end{array}\right. \]
From this transformation of \(\hat{p}(x)\) to \(\hat{Y}\), the overall predictive accuracy can be summarized with a matrix,
\[ \begin{array}{ccc}{\text { Predicted vs. Reality}} & {Y=1} & {Y=0} \\ {\hat{Y}=1} & {\text { TP }_{}} & {\text { FP }_{}} \\ {\hat{Y}=0} & {\text { FN }_{}} & {\text { TN }_{}}\end{array} \]
where, TP, FP, FN, TN are True positives, False Positives, False Negatives, and True Negatives, respectively. This table is also know as Confusion Table or confusion matrix. The name, confusion, is very intuitive because it is easy to see how the system is confusing two classes.
There are many metrics that can be calculated from this table. Let’s use an example given in Wikipedia
\[ \begin{array}{ccc}{\text { Predicted vs. Reality}} & {Y=Cat} & {Y=Dog} \\ {\hat{Y}=Cat} & {\text { 5 }_{}} & {\text { 2 }_{}} \\ {\hat{Y}=Dog} & {\text { 3 }_{}} & {\text { 3 }_{}}\end{array} \]
According to this confusion matrix, there are 8 actual cats and 5 actual dogs (column totals). The learning algorithm, however, predicts only 5 cats and 3 dogs correctly. The model predicts 3 cats as dogs and 2 dogs as cats. All correct predictions are located in the diagonal of the table, so it is easy to visually inspect the table for prediction errors, as they will be represented by values outside the diagonal.
In predictive analytics, this table (matrix) allows more detailed analysis than mere proportion of correct classifications (accuracy). Accuracy (\((TP+TN)/n\)) is not a reliable metric for the real performance of a classifier, when the dataset is unbalanced in terms of numbers of observations in each class.
It can be seen how misleading the use of \((TP+TN)/n\) could be, if there were 95 cats and only 5 dogs in our example. If we choose accuracy as the performance measure in our training, our learning algorithm might classify all the observations as cats, because the overall accuracy would be 95%. In that case, however, all the dog would be misclassified as cats.
10.2 Performance measures
Which metrics should we be using in training our classification models? These questions are more important when the classes are not in balance. Moreover, in some situation, false predictions would be more important then true predictions. In a situation that you try to predict, for example, cancer, minimizing false negatives (the model misses cancer patients) would be more important than minimizing false positives (the model wrongly predicts cancer). When we have an algorithm to predict spam emails, however, false positives would be the target to minimize rather than false negatives.
Here is the full picture of various metrics using the same confusion table from Wikipedia:
Let’s summarize some of the metrics and their use with examples for detecting cancer:
- Accuracy: the number of correct predictions (with and without cancer) relative to the number of observations (patients). This can be used when the classes are balanced with not less than a 60-40% split. \((TP+TN)/n\).
- Balanced Accuracy: when the class balance is worse than 60-40% split, \((TP/P + TN/N)/2\).
- Precision: the percentage positive predictions that are correct. That is, the proportion of patients that we predict as having cancer, actually have cancer, \(TP/(TP+FP)\).
- Sensitivity: the percentage of positives that are predicted correctly. That is, the proportion of patients that actually have cancer was correctly predicted by the algorithm as having cancer, \(TP/(TP+FN)\). This measure is also called as True Positive Rate or as Recall.
- Specificity: the percentage of negatives that are predicted correctly. Proportion of patients that do not have cancer, are predicted by the model as non-cancerous, This measure is also called as True Positive Rate = \(TN/(TN+FP)\).
Here is the summary:
\[ \begin{array}{ccc}{\text { Predicted vs. Reality}} & {Y=Cat} & {Y=Dog} \\ {\hat{Y}=Cat} & {\text {TPR or Sensitivity }_{}} & {\text { FPR or Fall-out }_{}} \\ {\hat{Y}=Dog} & {\text { FNR or Miss Rate }_{}} & {\text { TNR or Specificity }_{}}\end{array} \]
Kappa is also calculated in most cases. It is an interesting measure because it compares the actual performance of prediction with what it would be if a random prediction was carried out. For example, suppose that your model predicts \(Y\) with 95% accuracy. How good your prediction power would be if a random choice would also predict 70% of \(Y\)s correctly? Let’s use an example:
\[ \begin{array}{ccc}{\text { Predicted vs. Reality}} & {Y=Cat} & {Y=Dog} \\ {\hat{Y}=Cat} & {\text { 22 }_{}} & {\text { 9 }_{}} \\ {\hat{Y}=Dog} & {\text { 7 }_{}} & {\text { 13 }_{}}\end{array} \]
In this case the accuracy is \((22+13)/51 = 0.69\) But how much of it is due the model’s performance itself? In other words, the distribution of cats and dogs can also give a predictive clue such that a certain level of prediction accuracy can be achieved by chance without any learning algorithm. For the TP cell in the table, this can be calculated as the difference between observed accuracy (OA) and expected accuracy (EA),
\[ \mathrm{(OA-EA)_{TP}}=\mathrm{Pr}(\hat{Y}=Cat)[\mathrm{Pr}(Y=Cat |\hat{Y}= Cat)-\mathrm{P}(Y=Cat)], \]
Remember from your statistics class, if the two variables are independent, the conditional probability of \(X\) given \(Y\) has to be equal to the marginal probability of \(X\). Therefore, inside the brackets, the difference between the conditional probability, which reflects the probability of predicting cats due to the model, and the marginal probability of observing actual cats reflects the true level of predictive power of the model by removing the randomness in prediction.
\[ \mathrm{(OA-EA)_{TN}}=\mathrm{Pr}(\hat{Y}=Dog)[\mathrm{Pr}(Y=Dog |\hat{Y}= Dog)-\mathrm{P}(Y=Dog)], \]
If we use the joint and marginal probability definitions, these can be written as:
\[ OA-EA=\frac{m_{i j}}{n}-\frac{m_{i} m_{j}}{n^{2}} \]
Here is the calculation of Kappa for our example:
Total, \(n = 51\),
\(OA-EA\) for \(TP\) = \(22/51-31 \times (29/51^2) = 0.0857\)
\(OA-EA\) for \(TN\) = \(13/51-20 \times (21/51^2) = 0.0934\)
And we normalize it by \(1-EA = 1- 31 \times (29/51^2) + 20 \times (21/51^2) = 0.51\), which is the value if the prediction was 100% successful.
Hence, Kappa: \((0.0857+0.0934) / (1 - 0.51) = 0.3655\)
Finally, Jouden’s J statistics also as known as Youden’s index or Informedness, is a single statistics that captures the performance of prediction. It’s simply \(J=TPR+TNR-1\) and ranges between 0 and 1 indicating useless and perfect prediction performance, respectively. This metric is also related to Receiver Operating Characteristics (ROC) curve analysis, which is the subject of next section.
10.3 ROC Curve
Our outcome variable is categorical (\(Y = 1\) or \(0\)). Most classification algorithms calculate the predicted probability of success (\(Y = 1\)). If the probability is larger than a fixed cut-off threshold (discriminating threshold), then we assume that the model predicts success (Y = 1); otherwise, we assume that it predicts failure. As a result of such a procedure, the comparison of the observed and predicted values summarized in a confusion table depends on the threshold. The predictive accuracy of a model as a function of threshold can be summarized by Area Under Curve (AUC) of Receiver Operating Characteristics (ROC). The ROC curve, which is is a graphical plot that illustrates the diagnostic ability of a binary classifier, indicates a trade-off between True Positive Rate (TPR) and False Positive Rate (FPR). Hence, the success of a model comes with its predictions that increases TPR without raising FPR. The ROC curve was first used during World War II for the analysis of radar signals before it was employed in signal detection theory.
Here is a visualization:
Let’s start with an example, where we have 100 individuals, 50 with \(y_i=1\) and 50 with \(y_i=0\), which is well-balanced. If we use a discriminating threshold (0%) that puts everybody into Category 1 or a threshold (100%) that puts everybody into Category 2, that is,
\[ \hat{Y}=\left\{\begin{array}{ll}{1,} & {\hat{p}\left(x_{1}, \ldots, x_{k}\right)>0 \%} \\ {0,} & {\hat{p}\left(x_{1}, \ldots, x_{k}\right)\leq0 \%}\end{array}\right. \]
and,
\[ \hat{Y}=\left\{\begin{array}{ll}{1,} & {\hat{p}\left(x_{1}, \ldots, x_{k}\right)>100 \%} \\ {0,} & {\hat{p}\left(x_{1}, \ldots, x_{k}\right)\leq100 \%}\end{array}\right. \]
this would have led to the following confusing tables, respectively:
\[ \begin{array}{ccc}{\text { Predicted vs. Reality}} & {Y=1} & {Y=0} \\ {\hat{Y}=1} & {\text { 50 }_{}} & {\text { 50 }_{}} \\ {\hat{Y}=0} & {\text { 0 }_{}} & {\text { 0 }_{}}\end{array} \] \[ \begin{array}{ccc}{\text { Predicted vs. Reality}} & {Y=1} & {Y=0} \\ {\hat{Y}=1} & {\text { 0 }_{}} & {\text { 0 }_{}} \\ {\hat{Y}=0} & {\text { 50 }_{}} & {\text { 50 }_{}}\end{array} \]
In the first case, \(TPR = 1\) and \(FPR = 1\); and in the second case, \(TPR = 0\) and \(FPR = 0\). So when we calculate all possible confusion tables with different values of thresholds ranging from 0% to 100%, we will have the same number of (\(TPR, FPR\)) points each corresponding to one threshold. The ROC curve is the curve that connects these points.
Let’s use an example with the Boston Housing Market dataset to illustrate ROC:
library(MASS)
data(Boston)
# Create our binary outcome
<- Boston[, -14] #Dropping "medv"
data $dummy <- c(ifelse(Boston$medv > 25, 1, 0))
data
# Use logistic regression for classification
<- glm(dummy ~ ., data = data, family = "binomial")
model summary(model)
##
## Call:
## glm(formula = dummy ~ ., family = "binomial", data = data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.3498 -0.2806 -0.0932 -0.0006 3.3781
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 5.312511 4.876070 1.090 0.275930
## crim -0.011101 0.045322 -0.245 0.806503
## zn 0.010917 0.010834 1.008 0.313626
## indus -0.110452 0.058740 -1.880 0.060060 .
## chas 0.966337 0.808960 1.195 0.232266
## nox -6.844521 4.483514 -1.527 0.126861
## rm 1.886872 0.452692 4.168 3.07e-05 ***
## age 0.003491 0.011133 0.314 0.753853
## dis -0.589016 0.164013 -3.591 0.000329 ***
## rad 0.318042 0.082623 3.849 0.000118 ***
## tax -0.010826 0.004036 -2.682 0.007314 **
## ptratio -0.353017 0.122259 -2.887 0.003884 **
## black -0.002264 0.003826 -0.592 0.554105
## lstat -0.367355 0.073020 -5.031 4.88e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 563.52 on 505 degrees of freedom
## Residual deviance: 209.11 on 492 degrees of freedom
## AIC: 237.11
##
## Number of Fisher Scoring iterations: 7
And our prediction (in-sample):
# Classified Y's by TRUE and FALSE
<- model$fitted.values > 0.5
yHat <- table(yHat, data$dummy)
conf_table
#let's change the order of cells
<- as.matrix(conf_table)
ctt <- matrix(0, 2, 2)
ct 1,1] <- ctt[2,2]
ct[2,2] <- ctt[1,1]
ct[1,2] <- ctt[2,1]
ct[2,1] <- ctt[1,2]
ct[
rownames(ct) <- c("Yhat = 1", "Yhat = 0")
colnames(ct) <- c("Y = 1", "Y = 0")
ct
## Y = 1 Y = 0
## Yhat = 1 100 16
## Yhat = 0 24 366
It would be much easier if we create our own function to rotate a matrix/table:
<- function(x){
rot <- apply(x, 2, rev)
t <- apply(t, 1, rev)
tt return(t(tt))
}<- rot(conf_table)
ct rownames(ct) <- c("Yhat = 1", "Yhat = 0")
colnames(ct) <- c("Y = 1", "Y = 0")
ct
##
## yHat Y = 1 Y = 0
## Yhat = 1 100 16
## Yhat = 0 24 366
Now we calculate our TPR, FPR, and J-Index:
#TPR
<- ct[1,1]/(ct[1,1]+ct[2,1])
TPR TPR
## [1] 0.8064516
#FPR
<- ct[1,2]/(ct[1,2]+ct[2,2])
FPR FPR
## [1] 0.04188482
#J-Index
-FPR TPR
## [1] 0.7645668
These rates are calculated for the threshold of 0.5. We can have all pairs of \(TPR\) and \(FPR\) for all possible discrimination thresholds. What’s the possible set? We will use our \(\hat{P}\) values for this.
#We create an ordered grid from our fitted values
summary(model$fitted.values)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000000 0.004205 0.035602 0.245059 0.371758 0.999549
<- model$fitted.values[order(model$fitted.values)]
phat length(phat)
## [1] 506
#We need to have containers for the pairs of TPR and FPR
<- c()
TPR <- c()
FPR
#Now the loop
for (i in 1:length(phat)) {
<- model$fitted.values > phat[i]
yHat <- table(yHat, data$dummy)
conf_table <- as.matrix(conf_table)
ct if(sum(dim(ct))>3){ #here we ignore the thresholds 0 and 1
<- ct[2,2]/(ct[2,2]+ct[1,2])
TPR[i] <- ct[2,1]/(ct[1,1]+ct[2,1])
FPR[i]
}
}plot(FPR, TPR, col= "blue", type = "l", main = "ROC", lwd = 3)
abline(a = 0, b = 1, col="red")
Several things we observe on this curve. First, there is a trade-off between TPF and FPR. Approximately, after 70% of TPR, an increase in TPF can be achieved by increasing FPR, which means that if we care more about the possible lowest FPR, we can fix the discriminating rate at that point.
Second, we can identify the best discriminating threshold that makes the distance between TPR and FPR largest. In other words, we can identify the threshold where the marginal gain on TPR would be equal to the marginal cost of FPR. This can be achieved by the Jouden’s J statistics, \(J=TPR+TNR-1\), which identifies the best discriminating threshold. Note that \(TNR= 1-FPR\). Hence \(J = TPR-FPR\).
# Youden's J Statistics
<- TPR - FPR
J # The best discriminating threshold
which.max(J)] phat[
## 231
## 0.1786863
#TPR and FPR at this threshold
which.max(J)] TPR[
## [1] 0.9354839
which.max(J)] FPR[
## [1] 0.1361257
which.max(J)] J[
## [1] 0.7993582
This simple example shows that the best (in-sample) fit can be achieved by
\[ \hat{Y}=\left\{\begin{array}{ll}{1,} & {\hat{p}\left(x_{1}, \ldots, x_{k}\right)>17.86863 \%} \\ {0,} & {\hat{p}\left(x_{1}, \ldots, x_{k}\right)\leq17.86863 \%}\end{array}\right. \]
10.4 AUC - Area Under the Curve
Finally, we measure the predictive accuracy by the area under the ROC curve. An area of 1 represents a perfect performance; an area of 0.5 represents a worthless prediction. This is because an area of 0.5 suggests its performance is no better than random chance.
For example, an accepted rough guide for classifying the accuracy of a diagnostic test in medical procedures is
0.90-1.00 = Excellent (A)
0.80-0.90 = Good (B)
0.70-0.80 = Fair (C)
0.60-0.70 = Poor (D)
0.50-0.60 = Fail (F)
Since the formula and its derivation is beyond the scope of this chapter, we will use the package ROCR
to calculate it.
library(ROCR)
$dummy <- c(ifelse(Boston$medv > 25, 1, 0))
data<- glm(dummy ~ ., data = data, family = "binomial")
model <- model$fitted.values
phat
<- data.frame(phat, "Y" = data$dummy)
phat_df <- prediction(phat_df[,1], phat_df[,2])
pred_rocr <- performance(pred_rocr,"tpr","fpr")
perf
plot(perf, colorize=TRUE)
abline(a = 0, b = 1)
<- performance(pred_rocr, measure = "auc")
auc_ROCR <- auc_ROCR@y.values[[1]]
AUC AUC
## [1] 0.9600363
This ROC curve is the same as the one that we developed earlier.
When we train a model, in each run (different train and test sets) we will obtain a different AUC. Differences in AUC across train and validation sets creates an uncertainty about AUC. Consequently, the asymptotic properties of AUC for comparing alternative models has become a subject of discussions in the literature.
Another important point is that, while AUC represents the entire area under the curve, our interest would be on a specific location of TPR or FPR. Hence it’s possible that, for any given two competing algorithms, while one prediction algorithm has a higher overall AUC, the other one could have a better AUC in that specific location. This issue can be seen in the following figure taken from Bad practices in evaluation methodology relevant to class-imbalanced problems by Jan Brabec and Lukas Machlica (2018).
For example, in the domain of network traffic intrusion-detection, the imbalance ratio is often higher than 1:1000, and the cost of a false alarm for an applied system is very high. This is due to increased analysis and remediation costs of infected devices. In such systems, the region of interest on the ROC curve is for false positive rate at most 0.0001. If AUC was computed in the usual way over the complete ROC curve then 99.99% of the area would be irrelevant and would represent only noise in the final outcome. We demonstrate this phenomenon in Figure 1.
If AUC has to be used, we suggest to discuss the region of interest, and eventually compute the area only at this region. This is even more important if ROC curves are not presented, but only AUCs of the compared algorithms are reported.
Most of the challenges in classification problems are related to class imbalances in the data. We look at this issue in Cahpter 39.