Chapter 7 Smoothing
We can define \(Y_i\) with a following model:
\[ Y_{i}=f\left(x_{i}\right)+\varepsilon_{i} \]
We do not want to (and cannot) predict \(Y_i\) as we don’t know the random part, the “noise” \(\epsilon_i\). If we predict \(f(x)\) well, it would give us a good approximation about \(Y_i\). Hence we need to know the best \(f(x)\). Nonparametric estimations can also be used for recovering \(f(x)\), which is called “smoothing” when the outcome variable is quantitative. In general, the purposes of smoothing is two-fold: building a forecasting model by smoothing and learning the shape of the trend embedded in the data (\(Y\)). You can think of smoothing as process that reduces the effect of noise in the data.
We will use several datasets from MASS
. The mcycle
dataset (from the MASS
package) contains \(n=133\) pairs of time points (in ms) and observed head accelerations (in g) that were recorded in a simulated motorcycle accident. We will several smoothing methods to explore the relationship between time and acceleration. Fist, let’s visualize the relationship between time \((X)\) and acceleration \((Y)\) and see if we can assume that \(f(x_i)\) is a linear function of time:
library(tidyverse)
library(MASS)
data(mcycle)
head(mcycle)
## times accel
## 1 2.4 0.0
## 2 2.6 -1.3
## 3 3.2 -2.7
## 4 3.6 0.0
## 5 4.0 -2.7
## 6 6.2 -2.7
plot(mcycle$times, mcycle$accel,
cex.axis = 0.75, cex.main = 0.8)
#Let’s use linear regression
lines(mcycle$times, predict(lm(accel ~ times, mcycle)), lwd = 2, col = "red")
The line does not appear to describe the trend very well. Let’s try an alternative model:
7.1 Using bins
As we have seen before, the main idea is to group data points into bins (equal size) in which the value of \(f(x)\) can be assumed to be constant. We can consider this assumption realistic because we consider \(f(x)\) is almost constant in small windows of time. After deciding the window size (say, 10ms), we find out how many observations (\(Y_i\)) we have in those 10-ms windows. We can calculate the number of observations in a 10-ms window centered around \(x_i\) satisfying the following condition:
\[ \left(x-\frac{10}{2}<x_{i}<x+\frac{10}{2}\right) \] When we apply condition such as this for each observation of \(x_i\), we create a moving 10-ms window. Note that the window is established by adding half of 10ms to \(x_i\) to get the forward half and subtracting it from \(x_i\) to get the backward half. When we identify all the observations in each window, we estimate \(f(x)\) as the average of the \(Y_i\) values in that window. If we define \(A_0\) as the set of indexes in each window and \(N_0\) as the number of observation in each window, with our data, computing \(f(x)\) can be expressed as,
\[\begin{equation} \hat{f}\left(x_{0}\right)=\frac{1}{N_{0}} \sum_{i \in A_{0}} Y_{i}, \tag{7.1} \end{equation}\]
We build an estimate of the underlying curve. Here is its application to our data:
#With ksmooth() Pay attention to "box"
<- with(mcycle, ksmooth(times, accel, kernel = "box", bandwidth = 7))
fit1 <- with(mcycle, ksmooth(times, accel, kernel = "box", bandwidth = 10))
fit2 <- with(mcycle, ksmooth(times, accel, kernel = "box", bandwidth = 21))
fit3
plot(mcycle$times, mcycle$accel,
xlab = "Time (ms)", ylab = "Acceleration (g)")
lines(mcycle$times, fit1$y, lwd = 2, col = "blue")
lines(mcycle$times, fit2$y, lwd = 2, col = "red")
lines(mcycle$times, fit3$y, lwd = 2, col = "green")
As you can see, even if we use a shorter bandwidth, the lines are quite wiggly.
7.2 Kernel smoothing
We can take care of this by taking weighted averages that give the center points more weight than far away points.
#With ksmooth() Pay attention to "box"
<- with(mcycle, ksmooth(times, accel, kernel = "normal", bandwidth = 7))
fit1 <- with(mcycle, ksmooth(times, accel, kernel = "normal", bandwidth = 10))
fit2 <- with(mcycle, ksmooth(times, accel, kernel = "normal", bandwidth = 21))
fit3
plot(mcycle$times, mcycle$accel,
xlab = "Time (ms)", ylab = "Acceleration (g)")
lines(mcycle$times, fit1$y, lwd = 2, col = "blue")
lines(mcycle$times, fit2$y, lwd = 2, col = "red")
lines(mcycle$times, fit3$y, lwd = 2, col = "green")
Now they look smoother. There are several functions in R that implement bin smoothers. One example is ksmooth
, shown above. In practice, however, we typically prefer methods that are slightly more complex than fitting a constant. Methods such as loess
, which we explain next, improve on this.
7.3 Locally weighted regression loess()
A limitation of the bin smoother approach by ksmooth()
is that we need small windows for the approximately constant assumptions to hold. Now loess()
permits us to consider larger window sizes.
#With loess()
<- loess(accel ~ times, degree = 1, span = 0.1, mcycle)
fit1 <-loess(accel ~ times, degree = 1, span = 0.9, mcycle)
fit2
summary(fit1)
## Call:
## loess(formula = accel ~ times, data = mcycle, span = 0.1, degree = 1)
##
## Number of Observations: 133
## Equivalent Number of Parameters: 18.57
## Residual Standard Error: 22.93
## Trace of smoother matrix: 22.01 (exact)
##
## Control settings:
## span : 0.1
## degree : 1
## family : gaussian
## surface : interpolate cell = 0.2
## normalize: TRUE
## parametric: FALSE
## drop.square: FALSE
plot(mcycle$times, mcycle$accel,
xlab = "Time (ms)", ylab = "Acceleration (g)")
lines(mcycle$times, fit1$fitted, lwd = 2, col = "blue")
lines(mcycle$times, fit2$fitted, lwd = 2, col = "red")
It seems the “blue” line is better than the “red” line. We can make our windows even larger by fitting parabolas instead of lines.
<- loess(accel ~ times, degree = 1, span = 0.1, data = mcycle)
fit1 <-loess(accel ~ times, degree = 2, span = 0.1, data = mcycle)
fit2
plot(mcycle$times, mcycle$accel,
xlab = "Time (ms)", ylab = "Acceleration (g)")
lines(mcycle$times, fit1$fitted, lwd = 2, col = "blue")
lines(mcycle$times, fit2$fitted, lwd = 2, col = "green")
7.4 Smooth Spline Regression
We can also use npreg
package with ss()
function for automated smooth splines
library(npreg)
<- with(mcycle, ss(times, accel))
mod.ss mod.ss
##
## Call:
## ss(x = times, y = accel)
##
## Smoothing Parameter spar = 0.1585867 lambda = 8.337283e-07
## Equivalent Degrees of Freedom (Df) 12.20781
## Penalized Criterion (RSS) 62034.66
## Generalized Cross-Validation (GCV) 565.4684
summary(mod.ss)
##
## Call:
## ss(x = times, y = accel)
##
## Residuals:
## Min 1Q Median 3Q Max
## -76.7951 -12.5654 -0.8346 12.5823 50.5576
##
## Approx. Signif. of Parametric Effects:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -14.234 2.313 -6.154 1.018e-08 ***
## x 9.549 21.603 0.442 6.593e-01
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approx. Signif. of Nonparametric Effects:
## Df Sum Sq Mean Sq F value Pr(>F)
## s(x) 10.21 210130 20585.3 40.08 0 ***
## Residuals 120.79 62035 513.6
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 22.66 on 120.8 degrees of freedom
## Multiple R-squared: 0.7999, Adjusted R-squared: 0.7801
## F-statistic: 41.21 on 11.21 and 120.8 DF, p-value: <2e-16
plot(mod.ss, xlab = "Time (ms)", ylab = "Acceleration (g)", col = "orange")
rug(mcycle$times) # add rug to plot for the precise location of each point
The gray shaded area denotes a 95% Bayesian “confidence interval” for the unknown function.
7.5 Multivariate Loess
Loess can handle up to 4 predictor variables. But when it’s more than two, it’s always advisable to use additive models either GAM or MARS. Let’s try loess()
with 3 variables. This dataset was produced from US economic time series data available from St.Louis Federal Reserve. It’s a data frame with 478 rows and 6 variables: psavert
, personal savings rate; pce
, personal consumption expenditures, in billions of dollars; unemploy
, number of unemployed in thousands, uempmed
median duration of unemployment, in weeks; and pop
total population, in thousands. Although we have a time variable, date
, let’s create an index for time.
data(economics, package="ggplot2")
str(economics)
## spc_tbl_ [574 × 6] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ date : Date[1:574], format: "1967-07-01" "1967-08-01" ...
## $ pce : num [1:574] 507 510 516 512 517 ...
## $ pop : num [1:574] 198712 198911 199113 199311 199498 ...
## $ psavert : num [1:574] 12.6 12.6 11.9 12.9 12.8 11.8 11.7 12.3 11.7 12.3 ...
## $ uempmed : num [1:574] 4.5 4.7 4.6 4.9 4.7 4.8 5.1 4.5 4.1 4.6 ...
## $ unemploy: num [1:574] 2944 2945 2958 3143 3066 ...
$index <- 1:nrow(economics)
economics<- loess(uempmed ~ index, data=economics, span=0.5) # 40% smoothing span
fit1
plot(economics$index, economics$uempmed,
cex.axis = 0.75, cex.main = 0.8)
lines(economics$index, fit1$fitted, lwd = 2, col = "red")
#Now more predictors
<- loess(uempmed ~ pce + psavert +pop,
fit2 data=economics, span=1)
<- sqrt(sum((fit2$residuals)^2))
RRSS_2 RRSS_2
## [1] 121.3815