Chapter 11 Classification Example
We can conclude this section with a classification example. We will use Adult
dataset. The information on the dataset is given at the Machine Learning Repository at UCI (Kohavi and Becker 1996):
The prediction task is to determine whether a person makes over $50K a year. This question would be similar to the question of whether the person makes less than 50K. However, we need to be careful in defining which class will be positive or negative. Suppose we have \(Y\), 0 and 1, and we define 1 as a positive class:
\[ \begin{array}{ccc}{\text { Predicted vs. Reality}} & {Y=1+} & {Y=0-} \\ {\hat{Y}=1+} & {\text { TP }_{}} & {\text { FP }_{}} \\ {\hat{Y}=0-} & {\text { FN }_{}} & {\text { TN }_{}}\end{array} \] Now suppose we define 1 as a negative class:
\[ \begin{array}{ccc}{\text { Predicted vs. Reality}} & {Y=0+} & {Y=1-} \\ {\hat{Y}=0+} & {\text { TP }_{}} & {\text { FP }_{}} \\ {\hat{Y}=1-} & {\text { FN }_{}} & {\text { TN }_{}}\end{array} \] Of course this is just a notational difference and nothing changes in calculations. But some performance measures, especially, sensitivity (TPR) and fall-out (FPR) will be different.
We are going to use the original train set again to avoid some data cleaning jobs that we mentioned in Chapter 5.
# Download adult income data
# url.train <- "http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data"
# url.names <- "http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.names"
# download.file(url.train, destfile = "adult_train.csv")
# download.file(url.names, destfile = "adult_names.txt")
# Read the training set into memory
<- read.csv("adult_train.csv", header = FALSE)
df
<- c("Age",
varNames "WorkClass",
"fnlwgt",
"Education",
"EducationNum",
"MaritalStatus",
"Occupation",
"Relationship",
"Race",
"Sex",
"CapitalGain",
"CapitalLoss",
"HoursPerWeek",
"NativeCountry",
"IncomeLevel")
names(df) <- varNames
<- df data
In each machine learning application, the data preparation stage (i.e. cleaning the data, organizing the columns and rows, checking out the columns’ names, checking the types of each feature, identifying and handling the missing observations, etc) is a very important step and should be dealt with a good care.
First, let’s see if the data balanced or not:
<- table(data$IncomeLevel)
tbl tbl
##
## <=50K >50K
## 24720 7841
2] / tbl[1] tbl[
## >50K
## 0.3171926
There are multiple variables that are chr
in the data.
str(data)
## 'data.frame': 32561 obs. of 15 variables:
## $ Age : int 39 50 38 53 28 37 49 52 31 42 ...
## $ WorkClass : chr " State-gov" " Self-emp-not-inc" " Private" " Private" ...
## $ fnlwgt : int 77516 83311 215646 234721 338409 284582 160187 209642 45781 159449 ...
## $ Education : chr " Bachelors" " Bachelors" " HS-grad" " 11th" ...
## $ EducationNum : int 13 13 9 7 13 14 5 9 14 13 ...
## $ MaritalStatus: chr " Never-married" " Married-civ-spouse" " Divorced" " Married-civ-spouse" ...
## $ Occupation : chr " Adm-clerical" " Exec-managerial" " Handlers-cleaners" " Handlers-cleaners" ...
## $ Relationship : chr " Not-in-family" " Husband" " Not-in-family" " Husband" ...
## $ Race : chr " White" " White" " White" " Black" ...
## $ Sex : chr " Male" " Male" " Male" " Male" ...
## $ CapitalGain : int 2174 0 0 0 0 0 0 0 14084 5178 ...
## $ CapitalLoss : int 0 0 0 0 0 0 0 0 0 0 ...
## $ HoursPerWeek : int 40 13 40 40 40 40 16 45 50 40 ...
## $ NativeCountry: chr " United-States" " United-States" " United-States" " United-States" ...
## $ IncomeLevel : chr " <=50K" " <=50K" " <=50K" " <=50K" ...
table(data$WorkClass)
##
## ? Federal-gov Local-gov Never-worked
## 1836 960 2093 7
## Private Self-emp-inc Self-emp-not-inc State-gov
## 22696 1116 2541 1298
## Without-pay
## 14
table(data$NativeCountry)
##
## ? Cambodia
## 583 19
## Canada China
## 121 75
## Columbia Cuba
## 59 95
## Dominican-Republic Ecuador
## 70 28
## El-Salvador England
## 106 90
## France Germany
## 29 137
## Greece Guatemala
## 29 64
## Haiti Holand-Netherlands
## 44 1
## Honduras Hong
## 13 20
## Hungary India
## 13 100
## Iran Ireland
## 43 24
## Italy Jamaica
## 73 81
## Japan Laos
## 62 18
## Mexico Nicaragua
## 643 34
## Outlying-US(Guam-USVI-etc) Peru
## 14 31
## Philippines Poland
## 198 60
## Portugal Puerto-Rico
## 37 114
## Scotland South
## 12 80
## Taiwan Thailand
## 51 18
## Trinadad&Tobago United-States
## 19 29170
## Vietnam Yugoslavia
## 67 16
We can see that there is only one observation in Holand-Netherlands
. This is a problem because it will be either in the training set or the test set. Therefore, when you estimate without taking care of it, it will give this error:
Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels) : factor NativeCountry has new levels Holand-Netherlands
We will see later how to take care of these issues in a loop with several error handling options. But now, let’s drop this observation:
<- which(data$NativeCountry ==" Holand-Netherlands")
ind <- data[-ind, ] data
Although some packages like lm()
and glm()
can use character variables, we should take care of them properly before any type of data analysis. Here is an example:
<- data
df #converting by a loop
for (i in 1:ncol(df)) {
if (is.character(df[, i]))
<- as.factor(df[, i])
df[, i]
}
<- data
df #Converting with `apply()` family
sapply(df, is.character)] <- lapply(df[sapply(df, is.character)],
df[ as.factor)
The job is to use LPM, Logistic, and kNN models to see which one could be a better predictive model for the data. In LPM and Logistic, we do not (yet) have any parameter to tune for a better prediction. Although we could use a degree of polynomials for selected features, we will set aside that option for now. We will later see regularization methods for parametric models, which will make LPM and logistic models “trainable”. In kNN, \(k\) is the hyperparameter to train the model.
There are several key points to keep in mind in this classification practice:
- What performance metric(s) are we going to use for comparing the alternative models?
- How are we going to transform the predicted probabilities to classes (0’s and 1’s) so that we can have the confusion matrix?
Let’s start with LPM first.
11.1 LPM
anyNA(data)
## [1] FALSE
# Our LPM requires
$Y <- ifelse(data$IncomeLevel==" <=50K", 0, 1)
data<- data[, -15] data
Now, we are ready. We will use ROC and AUC for comparing the models.
library(ROCR)
<- c()
AUC = 100 # number of times we loop
t
for (i in 1:t) {
set.seed(i)
<- sample(nrow(data), nrow(data), replace = FALSE)
shuffle <- 5
k <- shuffle[1:(nrow(data) / k)]
testind <- shuffle[-testind]
trainind <- data[trainind, ] #80% of the data
trdf <- data[testind, ] #20% of data set a side
tsdf
#LPM
<- glm(Y ~ ., data = trdf, family = "gaussian")
model1 <- predict(model1, tsdf)
phat < 0] <- 0
phat[phat > 1] <- 1
phat[phat
# ROC & AUC (from ROCR)
<- data.frame(phat, "Y" = tsdf$Y)
phat_df <- prediction(phat_df[, 1], phat_df[, 2])
pred_rocr
<- performance(pred_rocr, measure = "auc")
auc_ROCR <- auc_ROCR@y.values[[1]]
AUC[i]
}
plot(AUC, col = "grey")
abline(a = mean(AUC), b = 0, col = "red")
mean(AUC)
## [1] 0.8936181
sqrt(var(AUC))
## [1] 0.003810335
Let’s see the ROC curve from the last run.
# ROC from the last run by `ROCR`
<- performance(pred_rocr, "tpr", "fpr")
perf plot(perf, colorize = TRUE)
abline(a = 0, b = 1)
# And our "own" ROC
<- phat[order(phat)]
phator < 0] <- 0
phator[phator > 1] <- 1
phator[phator <- unique(phator)
phator
<- c()
TPR <- c()
FPR
for (i in 1:length(phator)) {
<- phat > phator[i]
yHat <- table(yHat, tsdf$Y)
conf_table <- as.matrix(conf_table)
ct if (sum(dim(ct)) > 3) {
#here we ignore the min and max thresholds
<- ct[2, 2] / (ct[2, 2] + ct[1, 2])
TPR[i] <- ct[2, 1] / (ct[1, 1] + ct[2, 1])
FPR[i]
}
}
# Flat and vertical sections are omitted
plot(FPR,
TPR,col = "blue",
type = "l",
main = "ROC")
abline(a = 0, b = 1, col = "red")
What’s the confusion table at the “best” discriminating threshold? The answer is the one where the difference between TPR and FPR is maximized: Youden’s J Statistics. Note that this answers would be different if we have different weights in TPR and FPR. We may also have different targets, maximum FPR, for example.
# Youden's J Statistics
<- TPR - FPR
J
# The best discriminating threshold
<- phator[which.max(J)]
opt_th opt_th
## [1] 0.318723
#TPR and FPR at this threshold
which.max(J)] TPR[
## [1] 0.8494898
which.max(J)] FPR[
## [1] 0.2024676
which.max(J)] J[
## [1] 0.6470222
And the confusion table (from the last run):
<- phat > opt_th
yHat <- table(yHat, tsdf$Y)
conf_table
# Function to rotate the table
<- function(x){
rot <- apply(x, 2, rev)
t <- apply(t, 1, rev)
tt return(t(tt))
}
# Better looking table
<- rot(conf_table)
ct rownames(ct) <- c("Yhat = 1", "Yhat = 0")
colnames(ct) <- c("Y = 1", "Y = 0")
ct
##
## yHat Y = 1 Y = 0
## Yhat = 1 1332 1001
## Yhat = 0 236 3943
Note that the optimal threshold is almost the ratio of cases in the data around 31%. We will come back to this issue later.
11.2 Logistic Regression
library(ROCR)
<- c()
AUC = 100
t
for (i in 1:t) {
set.seed(i)
<- sample(nrow(data), nrow(data), replace = FALSE)
shuffle <- 5
k <- shuffle[1:(nrow(data) / k)]
testind <- shuffle[-testind]
trainind <- data[trainind,] #80% of the data
trdf <- data[testind,] #20% of data set a side
tsdf
#Logistic
<- glm(Y ~ ., data = trdf, family = "binomial")
model2 #Note "response". It predicts phat. Another option is "class"
#which predicts yhat by using 0.5
<- predict(model2, tsdf, type = "response")
phat < 0] <- 0
phat[phat > 1] <- 1
phat[phat
# ROC & AUC (from ROCR)
<- data.frame(phat, "Y" = tsdf$Y)
phat_df <- prediction(phat_df[, 1], phat_df[, 2])
pred_rocr
<- performance(pred_rocr, measure = "auc")
auc_ROCR <- auc_ROCR@y.values[[1]]
AUC[i]
}
plot(AUC, col = "grey")
abline(a = mean(AUC), b = 0, col = "red")
mean(AUC)
## [1] 0.908179
sqrt(var(AUC))
## [1] 0.003593404
Both LPM and Logistic methods are linear classifiers. We can add polynomials and interactions manually to capture possible nonlinearities in the data but that would be an impossible job as the number of features would grow exponentially. This brings us to a nonparametric classifier, kNN.
11.3 kNN
We will train kNN with the choice of \(k\) and use AUC as our performance criteria in choosing \(k\).
11.3.1 kNN 10-fold CV
There are several packages in R for kNN applications: knn()
from the class
package and knn3()
in the caret
package. We will use knn3()
in the caret package. Since kNN use distances, we should scale the numerical variables first to make their magnitudes on the same scale.
rm(list = ls())
<- read.csv("adult_train.csv", header = FALSE)
data
<- c("Age",
varNames "WorkClass",
"fnlwgt",
"Education",
"EducationNum",
"MaritalStatus",
"Occupation",
"Relationship",
"Race",
"Sex",
"CapitalGain",
"CapitalLoss",
"HoursPerWeek",
"NativeCountry",
"IncomeLevel")
names(data) <- varNames
<- data
df
# Dropping single observation
<- which(df$NativeCountry==" Holand-Netherlands")
ind <- df[-ind, ]
df
#Scaling the numerical variables
for(i in 1:ncol(df))
if(is.integer(df[,i])) df[,i] <- scale(df[,i])
#Converting the character variables to factor
sapply(df, is.character)] <- lapply(df[sapply(df, is.character)],
df[
as.factor)str(df)
## 'data.frame': 32560 obs. of 15 variables:
## $ Age : num [1:32560, 1] 0.0307 0.8371 -0.0427 1.057 -0.7758 ...
## ..- attr(*, "scaled:center")= num 38.6
## ..- attr(*, "scaled:scale")= num 13.6
## $ WorkClass : Factor w/ 9 levels " ?"," Federal-gov",..: 8 7 5 5 5 5 5 7 5 5 ...
## $ fnlwgt : num [1:32560, 1] -1.064 -1.009 0.245 0.426 1.408 ...
## ..- attr(*, "scaled:center")= num 189783
## ..- attr(*, "scaled:scale")= num 105548
## $ Education : Factor w/ 16 levels " 10th"," 11th",..: 10 10 12 2 10 13 7 12 13 10 ...
## $ EducationNum : num [1:32560, 1] 1.13 1.13 -0.42 -1.2 1.13 ...
## ..- attr(*, "scaled:center")= num 10.1
## ..- attr(*, "scaled:scale")= num 2.57
## $ MaritalStatus: Factor w/ 7 levels " Divorced"," Married-AF-spouse",..: 5 3 1 3 3 3 4 3 5 3 ...
## $ Occupation : Factor w/ 15 levels " ?"," Adm-clerical",..: 2 5 7 7 11 5 9 5 11 5 ...
## $ Relationship : Factor w/ 6 levels " Husband"," Not-in-family",..: 2 1 2 1 6 6 2 1 2 1 ...
## $ Race : Factor w/ 5 levels " Amer-Indian-Eskimo",..: 5 5 5 3 3 5 3 5 5 5 ...
## $ Sex : Factor w/ 2 levels " Female"," Male": 2 2 2 2 1 1 1 2 1 2 ...
## $ CapitalGain : num [1:32560, 1] 0.148 -0.146 -0.146 -0.146 -0.146 ...
## ..- attr(*, "scaled:center")= num 1078
## ..- attr(*, "scaled:scale")= num 7385
## $ CapitalLoss : num [1:32560, 1] -0.217 -0.217 -0.217 -0.217 -0.217 ...
## ..- attr(*, "scaled:center")= num 87.2
## ..- attr(*, "scaled:scale")= num 403
## $ HoursPerWeek : num [1:32560, 1] -0.0354 -2.2221 -0.0354 -0.0354 -0.0354 ...
## ..- attr(*, "scaled:center")= num 40.4
## ..- attr(*, "scaled:scale")= num 12.3
## $ NativeCountry: Factor w/ 41 levels " ?"," Cambodia",..: 39 39 39 39 6 39 23 39 39 39 ...
## $ IncomeLevel : Factor w/ 2 levels " <=50K"," >50K": 1 1 1 1 1 1 1 2 2 2 ...
Now we are ready. Here is our kNN training:
library(caret)
library(ROCR)
set.seed(123)
<- sample(nrow(df), nrow(df), replace = FALSE)
sh <- 10
h
<- sh[1:(nrow(df) / h)]
ind_test <- sh[-ind_test]
ind_train
# Put 10% a side as a test set
<- df[ind_train, ]
trdf <- df[ind_test, ]
tsdf
# h - fold CV
<- floor(nrow(trdf) / h)
nval <- seq(from = 3, to = 50, by = 2)
k
<- c()
AUC <- c()
MAUC2 <- c()
k_opt
for (i in 1:h) {
if (i < h) {
<- c(((i - 1) * nval + 1):(i * nval))
ind_val else{
} <- c(((i - 1) * nval + 1):length(ind))
ind_val
}<- c(1:nrow(trdf))[-ind_val]
ind_train
<- trdf[ind_train, ]
df_train <- trdf[ind_val, ]
df_val
for (s in 1:length(k)) {
<- knn3(IncomeLevel ~ ., data = df_train, k = k[s])
model <- predict(model, df_val, type = "prob")
phat
#AUC
<- prediction(phat[, 2], df_val$IncomeLevel)
pred_rocr <- performance(pred_rocr, measure = "auc")
auc_ROCR <- auc_ROCR@y.values[[1]]
AUC[s]
}<- AUC[which.max(AUC)]
MAUC2[i] <- k[which.max(AUC)]
k_opt[i] }
Note that kNN would best fit on data sets with true numeric variables. Now we can find the tuned kNN (i.e.the best “k”) and apply the trained kNN for prediction using the test data we split at the beginning
cbind(k_opt, MAUC2)
## k_opt MAUC2
## [1,] 49 0.9020390
## [2,] 37 0.9015282
## [3,] 27 0.8911303
## [4,] 45 0.8967005
## [5,] 47 0.9035859
## [6,] 21 0.9004941
## [7,] 33 0.8937860
## [8,] 37 0.8985006
## [9,] 43 0.8918030
## [10,] 39 0.8862083
mean(k_opt)
## [1] 37.8
mean(MAUC2)
## [1] 0.8965776
We can compare kNN with LPM (and Logistic) by AUC (not the one given above!) but “k” is not stable. Although, we can go with the mean of “k” or the mode of “k”, we can address this problem by changing the order of loops and using bootstrapping in our training instead of 10-fold CV, which would also increase the number or loops hence the running time.
Before jumping into this possible solution, we need to think about what we have done so far. We trained our kNN. That is, we got the value of our hyperparameter. We should use our tuned kNN to test it on the test data that we put aside at the beginning. The proper way to that, however, is to have several loops, instead of one like what we did here, and calculate the test AUC for comparison, which is similar to what we did in LPM and Logistic before. We will not do it here as the running time would be very long, which, by the way, shows the importance of having fast “machines” as well as efficient algorithms.
A more stable, but much longer, suggestion for tuning our kNN application is using a bootstrapping method. It runs multiple loops and takes the average of AUC with the same “k”. The example below is restricted to 20 runs for each “k”. Note that bootstrapping (See Chapter 37.5) is a process of resampling with replacement (all values in the sample have an equal probability of being selected, including multiple times, so a value could have duplicates).
#### Test/Train split - as before!########
set.seed(123)
<- sample(nrow(df), nrow(df), replace = FALSE)
sh <- 10 # should be set to 100 or more
h
<- sh[1:(nrow(df) / h)]
ind_test <- sh[-ind_test]
ind_train
# Put 10% a side as a test set
<- df[ind_train,]
trdf <- df[ind_test,]
tsdf
########## Bootstrapping ############
# We use `by=2` to reduce the running time
# With a faster machine, that could be set to 1.
<- seq(from = 3, to = 50, by = 2)
k <- 20 # number of bootstrap loops (could be higher too, like 50)
m
<- c()
MAUC <- c()
k_opt
for (i in 1:length(k)) {
<- c()
AUC for (l in 1:m) {
#Here is the heart of bootstrapped tuning
set.seed(l)
<- sample(nrow(trdf), nrow(trdf), replace = TRUE)
bind <- unique(bind)
uind <- df[uind,]
df_train <- df[-uind,]
df_val
<- knn3(IncomeLevel ~ ., data = df_train, k = k[i])
model <- predict(model, df_val, type = "prob")
phat
#AUC
<- prediction(phat[, 2], df_val$IncomeLevel)
pred_rocr <- performance(pred_rocr, measure = "auc")
auc_ROCR <- auc_ROCR@y.values[[1]]
AUC[l]
}<- mean(AUC)
MAUC[i] }
OK … now finding the optimal “k”
plot(k, MAUC, col = "red", type = "o")
which.max(MAUC)] MAUC[
## [1] 0.895667
which.max(MAUC)] k[
## [1] 49
This algorithm can be more efficient with parallel processing using multicore loop applications, which we will see In Chapter 14 (14.4.2). The other way to reduce the running time is to make the increments in the grid (for “k”) larger, like 10, and then find the region where AUC is highest. Then, we can have a finer grid for that specific region to identify the best “k”.
Before concluding this chapter, note that knn3()
handles factor variables itself. This is an internal process, a good one. Remember, knn()
could not do that and requires all features to be numeric. How could we do that? One way to handle it is to convert all factor variables to dummy (binary numerical) codes as shown below. This is also called as “one-hot encoding” in practice. This type of knowledge, what type of data handling is required by a package and how we can achieve it, is very important in data analytics.
<- df[,-15]
dftmp
<- which(sapply(dftmp, is.factor)==TRUE)
ind <- dftmp[,ind]
fctdf <- dftmp[, -ind]
numdf
#dummy coding
<- model.matrix(~. - 1, data = fctdf)
fctdum
#Binding
<- cbind(Y = df$IncomeLevel, numdf, fctdum) df_dum
Now, it can also be used with knn()
from the class
package. Note that kNN gets unstable as the number of variables increases. We can see it by calculating test AUC multiple times by adding an outer loop to our algorithm.
11.3.2 kNN with caret
# kNN needs a proper levels with caret!
levels(df$IncomeLevel)[levels(df$IncomeLevel)==" <=50K"] <- "Less"
levels(df$IncomeLevel)[levels(df$IncomeLevel)==" >50K"] <- "More"
levels(df$IncomeLevel)
## [1] "Less" "More"
#### Test/Train split ########
set.seed(123)
<- sample(nrow(df), nrow(df), replace = FALSE)
sh <- 10
h
<- sh[1:(nrow(df)/h)]
ind_test <- sh[-ind_test]
ind_train
<- df[ind_train, ]
trdf <- df[ind_test, ]
tsdf
########## CARET SET-UP ##################
# Here we use class probabilities, which is required for ROC training
#`twoClassSummary` will compute the sensitivity, specificity, AUC, ROC
<- trainControl(method = "cv", number = 10, p = 0.9, classProbs = TRUE,
cv summaryFunction = twoClassSummary)
#The main training process
set.seed(5) # for the same results, no need otherwise
<- train(IncomeLevel ~ ., method = "knn", data = trdf,
model_knn3 tuneGrid = data.frame(k=seq(3, 50, 2)),
trControl = cv,
metric = "ROC") #Here is the key difference.
#we are asking caret to use ROC
#as our main performance criteria
#Optimal k
ggplot(model_knn3, highlight = TRUE)
model_knn3
## k-Nearest Neighbors
##
## 29304 samples
## 14 predictor
## 2 classes: 'Less', 'More'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 26374, 26373, 26374, 26374, 26373, 26374, ...
## Resampling results across tuning parameters:
##
## k ROC Sens Spec
## 3 0.8262230 0.8941795 0.5961648
## 5 0.8586528 0.9031179 0.6005682
## 7 0.8724161 0.9090915 0.6086648
## 9 0.8800527 0.9111126 0.6088068
## 11 0.8848148 0.9129091 0.6079545
## 13 0.8884125 0.9150201 0.6035511
## 15 0.8904958 0.9174006 0.6041193
## 17 0.8915694 0.9167720 0.6063920
## 19 0.8923858 0.9171763 0.6036932
## 21 0.8936219 0.9179848 0.6035511
## 23 0.8940702 0.9159186 0.6042614
## 25 0.8947602 0.9174457 0.6039773
## 27 0.8952041 0.9176254 0.6026989
## 29 0.8955018 0.9179398 0.6019886
## 31 0.8956911 0.9180746 0.6017045
## 33 0.8959661 0.9187034 0.6029830
## 35 0.8960988 0.9179398 0.5998580
## 37 0.8963903 0.9182991 0.5994318
## 39 0.8968082 0.9191523 0.5977273
## 41 0.8967777 0.9192421 0.5977273
## 43 0.8968486 0.9204999 0.5977273
## 45 0.8970198 0.9202755 0.5944602
## 47 0.8972242 0.9205450 0.5944602
## 49 0.8971898 0.9208593 0.5923295
##
## ROC was used to select the optimal model using the largest value.
## The final value used for the model was k = 47.
$results model_knn3
## k ROC Sens Spec ROCSD SensSD SpecSD
## 1 3 0.8262230 0.8941795 0.5961648 0.004815409 0.008727063 0.012670476
## 2 5 0.8586528 0.9031179 0.6005682 0.005448823 0.007349595 0.011200694
## 3 7 0.8724161 0.9090915 0.6086648 0.005166892 0.006160801 0.015110770
## 4 9 0.8800527 0.9111126 0.6088068 0.004782100 0.006779982 0.015820347
## 5 11 0.8848148 0.9129091 0.6079545 0.005121326 0.006691569 0.009772617
## 6 13 0.8884125 0.9150201 0.6035511 0.004814653 0.007291597 0.006261826
## 7 15 0.8904958 0.9174006 0.6041193 0.004443550 0.006821105 0.011003812
## 8 17 0.8915694 0.9167720 0.6063920 0.004336396 0.006641748 0.009964578
## 9 19 0.8923858 0.9171763 0.6036932 0.004357410 0.007690924 0.009156761
## 10 21 0.8936219 0.9179848 0.6035511 0.004689076 0.007526214 0.009644457
## 11 23 0.8940702 0.9159186 0.6042614 0.004753603 0.007840512 0.008710062
## 12 25 0.8947602 0.9174457 0.6039773 0.004637773 0.007644920 0.009151863
## 13 27 0.8952041 0.9176254 0.6026989 0.004438855 0.007110203 0.009279578
## 14 29 0.8955018 0.9179398 0.6019886 0.004414619 0.006857247 0.007080142
## 15 31 0.8956911 0.9180746 0.6017045 0.004228545 0.007160469 0.006567629
## 16 33 0.8959661 0.9187034 0.6029830 0.004194696 0.007855452 0.007342833
## 17 35 0.8960988 0.9179398 0.5998580 0.004149906 0.007520967 0.008654547
## 18 37 0.8963903 0.9182991 0.5994318 0.004319967 0.007271261 0.007426320
## 19 39 0.8968082 0.9191523 0.5977273 0.004422126 0.007898694 0.007080142
## 20 41 0.8967777 0.9192421 0.5977273 0.004740533 0.007711601 0.007745494
## 21 43 0.8968486 0.9204999 0.5977273 0.004691945 0.007227390 0.007420280
## 22 45 0.8970198 0.9202755 0.5944602 0.004919464 0.007125413 0.008207829
## 23 47 0.8972242 0.9205450 0.5944602 0.004863486 0.007306936 0.008180470
## 24 49 0.8971898 0.9208593 0.5923295 0.004929357 0.007144172 0.007335196
Confusion matrix:
# Performance metrics
confusionMatrix(predict(model_knn3, tsdf, type = "raw"),
$IncomeLevel) tsdf
## Confusion Matrix and Statistics
##
## Reference
## Prediction Less More
## Less 2303 300
## More 179 474
##
## Accuracy : 0.8529
## 95% CI : (0.8403, 0.8649)
## No Information Rate : 0.7623
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.571
##
## Mcnemar's Test P-Value : 4.183e-08
##
## Sensitivity : 0.9279
## Specificity : 0.6124
## Pos Pred Value : 0.8847
## Neg Pred Value : 0.7259
## Prevalence : 0.7623
## Detection Rate : 0.7073
## Detection Prevalence : 0.7994
## Balanced Accuracy : 0.7701
##
## 'Positive' Class : Less
##
# If we don't specify "More" as our positive class, the first level
# "Less" will be "positive".
confusionMatrix(predict(model_knn3, tsdf, type = "raw"),
$IncomeLevel, positive = "More") tsdf
## Confusion Matrix and Statistics
##
## Reference
## Prediction Less More
## Less 2303 300
## More 179 474
##
## Accuracy : 0.8529
## 95% CI : (0.8403, 0.8649)
## No Information Rate : 0.7623
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.571
##
## Mcnemar's Test P-Value : 4.183e-08
##
## Sensitivity : 0.6124
## Specificity : 0.9279
## Pos Pred Value : 0.7259
## Neg Pred Value : 0.8847
## Prevalence : 0.2377
## Detection Rate : 0.1456
## Detection Prevalence : 0.2006
## Balanced Accuracy : 0.7701
##
## 'Positive' Class : More
##
We now know two things: (1) how good the prediction is with kNN; (2) how good it is relative to other “base” or “benchmark” models. These two questions must be answered every time to evaluate the prediction performance of a machine learning algorithm. Although we didn’t calculate the test AUC in our own kNN algorithm, we can accept that kNN performance is good with AUC that is close to 90%. However, it is not significantly better than LPM and Logistic