Chapter 42 Algorithmic Optimization
Here is a definition of algorithmic optimization from Wikipedia:
An optimization algorithm is a procedure which is executed iteratively by comparing various solutions until an optimum or a satisfactory solution is found. Optimization algorithms help us to minimize or maximize an objective function \(F(x)\) with respect to the internal parameters of a model mapping a set of predictors (\(X\)) to target values(\(Y\)). There are three types of optimization algorithms which are widely used; Zero-Order Algorithms, First-Order Optimization Algorithms, and Second-Order Optimization Algorithms. Zero-order (or derivative-free) algorithms use only the criterion value at some positions. It is popular when the gradient and Hessian information are difficult to obtain, e.g., no explicit function forms are given. First Order Optimization Algorithms minimize or maximize a Loss function \(F(x)\) using its Gradient values with respect to the parameters. Most widely used First order optimization algorithm is Gradient Descent. The First order derivative displays whether the function is decreasing or increasing at a particular point.
In this appendix, we will review some important concepts in algorithmic optimization.
42.1 Brute-force optimization
Let’s look at a simplified example about optimal retirement-plan and solve it with a zero-order algorithm.
Suppose that there are 2 groups of workers who are planning for their retirement at the age of 65. Both consider spending 40,000 dollars each year for the rest of their lives after retirement. On average, people in both groups expect to live 20 more years after retirement with some uncertainty. The people in the first group (A) have the following risk profile: 85% chance to live 20 years and 15% chance to live 30 years. The same risk profile for the people in the second group (B) is: 99% for 20 years and 1% for 30 years. Suppose that in each group, their utility (objective) function is \(U=C^{0.5}\).
What’s the maximum premium (lump-sum payment) that a person in each group would be willing to pay for a life-time annuity of 40K?
Without a pension plan, people in each group have the following utilities:
#For people in group A
= 0.85*sqrt(40000*20) + 0.15*sqrt(10*0)
U_A U_A
## [1] 760.2631
#For people in group B
= 0.99*sqrt(40000*20) + 0.01*sqrt(10*0)
U_B U_B
## [1] 885.4829
For example, they would not pay 200,000 dollars to cover their retirement because that would make them worse than their current situation (without a pension plan).
#For people in group A
= 0.85*sqrt(40000*20 - 200000) + 0.15*sqrt(40000*10 - 200000)
U_A U_A
## [1] 725.4892
#For people in group B
= 0.99*sqrt(40000*20 - 200000) + 0.01*sqrt(40000*10 - 200000)
U_B U_B
## [1] 771.3228
Hence, the payment they would be willing to make for reduction in uncertainty during their retirement should not make them worse off. Or more technically, their utility should not be lower than their current utility levels. Therefore Pmax, the maximum premium that a person would be willing to pay, can be found by minimizing the following cost function for people, for example, in Group A:
\[ \begin{aligned} F(Pmax) = p r * \sqrt{\$ 40000 * 20 y e a r s-\operatorname{Pmax}}+(1-p r) * & \sqrt{\$ 40000 * 10 y e a r s-\operatorname{Pmax}} \\ &-p r * \sqrt{\$ 40000 * 20 y e a r s} \end{aligned} \]
Here is the iteration to solve for Pmax for people in Group A. We created a cost function, costf, that we try to minimize. Change the parameters to play with it. The same algorithm can be used to find Pmax for people in Group B.
<- 0.85
pr <- 800000
w1 <- 400000
w2
= F
converged = 0
iterations <- 600000
maxiter <- 0.5
learnrate <- 10000
Pmax
while(converged == FALSE){
<- pr*sqrt(w1 - Pmax) + (1-pr)*sqrt(w2 - Pmax) - pr*sqrt(w1)
costf if(costf > 0){
<- Pmax + learnrate
Pmax = iterations + 1
iterations #cat("Iterations: ", iterations, "\r")
if(iterations > maxiter) {
print("It cannot converge before finding the optimal Pmax")
break
} = FALSE
converged else{
}= TRUE
converged print(paste("Maximum Premium:", Pmax, "achieved with", iterations, "iterations"))
} }
## [1] "Maximum Premium: 150043 achieved with 280086 iterations"
#let's verify it by uniroot() which finds the roots for f(x)=0
library(stats)
<- function(x){0.85*sqrt(800000 - x) + (1-0.85)*sqrt(400000 - x) - pr*sqrt(800000)}
costf paste("Unitroot, i.e. f(x) = 0, is ", uniroot(costf, c(10000, 200000))$root)
## [1] "Unitroot, i.e. f(x) = 0, is 150042.524874307"
There are better functions we could use for this purpose, but this example works well for our experiment. There is a couple of important parameters in our algorithm. The first one is the starting Pmax
, which can be set up manually. If the starting value is too low, iteration could not converge. If it’s too high, it can give us an error. Another issue is that our iteration does not know if the learning rate should increase or decrease when the starting value is too high or too low. This can be done with additional lines of code, but we will not address it here. This situation leads us to the learning rate: the incremental change in the value of the parameter. This parameter should be conditioned on the value of cost function. If the cost function for a given Pmax
is negative, for example, the learning rate should be negative. Secondly, the number of maximum iterations ,ust be set properly, otherwise the algorithm may not converge or take too long to converge. In the next section, we will address these issues with a smarter algorithm:
There are other types of approaches. In one of these, the algorithm creates a grid of Pmax
and then try all the possible values to see which one approximately makes the cost function minimum.
42.2 Derivative-based methods
One of the derivative-based methods is the Newton-Raphson method. If we assume that the function is differentiable and has only one minimum (maximum), we can develop an optimization algorithm that looks for the point in parameter space where the derivative of the function is zero. There are other methods, like Fisher Scoring and Iteratively Reweighted Least Squares, that we will not see here.
First let’s see the Newton-Raphson method. This is a well-known extension of your calculus class about derivatives in High School. The method is very simple and used to find the roots of \(f(x)=0\) by iterations. In first-year computer science courses, this method is used to teach loop algorithms that calculate the value of, for example, \(e^{0.71}\) or \(\sqrt{12}\). It is a simple iteration that converges in a few steps.
\[ x_{n+1}=x_{n}-\frac{f\left(x_{n}\right)}{f^{\prime}\left(x_{n}\right)} \] To understand it, let’s look at the function \(y=f(x)\) shown in the following graph:
It has a zero at \(x=x_r\), which is not known. To find it, we start with \(x_0\) as an initial estimate of \(X_r\). The tangent line to the graph at the point \(\left(x_0, f\left(x_0\right)\right)\) has the point \(x_1\) at which the tangent crosses the \(x\)-axis. The slope of this line can be defined as
\[ \frac{y-f\left(x_0\right)}{x-x_0}=f^{\prime}\left(x_0\right) \] Hence,
\[ y-f\left(x_0\right)=f^{\prime}\left(x_0\right)\left(x-x_0\right) \] At the point where the tangent line cross the \(x\)-axis, \(y=0\) and \(x=x_1\). Hence solving the equation for \(x_1\), we get
\[ x_{1}=x_{0}-\frac{f\left(x_{0}\right)}{f^{\prime}\left(x_{0}\right)} \] And the second approximations:
\[ x_{2}=x_{1}-\frac{f\left(x_{1}\right)}{f^{\prime}\left(x_{1}\right)} \] And with multiple iterations one can find the solution. Here is the example:
<- function(f, x0, tol = 1e-5, n = 1000) {
newton require(numDeriv) # Package for computing f'(x)
for (i in 1:n) {
<- genD(func = f, x = x0)$D[1] # First-order derivative f'(x0)
dx <- x0 - (f(x0) / dx) # Calculate next value x1
x1 if (abs(x1 - x0) < tol) {
<- paste("Root approximation is ", x1, " in ", i, " iterations")
res return(res)
}# If Newton-Raphson has not yet reached convergence set x1 as x0 and continue
<- x1
x0
}print('Too many iterations in method')
}
<- function(x) {
func2 ^15 - 2
x
}newton(func2, 1)
## [1] "Root approximation is 1.04729412282063 in 5 iterations"
#Check it
paste("Calculator result: ", 2^(1/15))
## [1] "Calculator result: 1.04729412282063"
Newton’s method is often used to solve two different, but related, problems:
- Finding \(x\) such that \(f(x)=0\) (Try to solve our insurance problem with this method)
- Finding \(x\) that \(g'(x)=0\), or find \(x\) that minimizes/maximizes \(g(x)\).
The relation between these two problems is obvious when we define \(f(x) = g'(x)\). Hence, for the second problem, the Newton-Raphson method becomes:
\[ x_{n+1}=x_{n}-\frac{g^{\prime}\left(x_{n}\right)}{g^{\prime \prime}\left(x_{n}\right)} \]
Connection between these two problems are defined in this post (Gulzar 2018) very nicely.
Let’s pretend that we are interested in determining the parameters of a random variable \(X \sim N(\mu, \sigma^{2})\). Here is the log-likelihood function for \(X\):
\[ \log (\mathcal{L}(\mu, \sigma))=-\frac{n}{2} \log \left(2 \pi \sigma^{2}\right)-\frac{1}{2 \sigma^{2}} \sum_{i=1}^{n}\left(x_{i}-\mu\right)^{2} \]
We have seen it in Chapter 2 before. But this time we will use dnorm()
which calculates the pdf of a normal variable. First let’s have the data and the log-likelihood:
# Let's create a sample of normal variables
set.seed(2019)
<- rnorm(100, 0, 1)
X
# And the log-likelihood of this function.
# Remember likelihood function would be prod(dnorm()) with log=F
<- function(prmt){
normalLL sum(dnorm(X, mean = prmt[1], sd = prmt[2], log = TRUE))
}
# Let's try several parameters
normalLL(prmt = c(1,1.5))
## [1] -176.078
normalLL(prmt = c(2,1))
## [1] -347.4119
normalLL(prmt = c(mean(X),sd(X)))
## [1] -131.4619
As you can see, the last one is the best. And we can verify it because we had created \(X\) with 0 mean and 1 sd, approximately. Now we will use the Newton-Raphson method to calculate those parameters that minimize the negative log-likelihood.
First, let’s build a function that estimates the slope of the function (first-derivative) numerically at any arbitrary point in parameter space for mean and sd, separately. Don’t forget, the log-likelihood is a function of parameters (mean and sd) not X.
# First partial (numerical) derivative w.r.t. mean
<- function(p1, p2, change = 0.0001){
firstM <- c(p1, p2)
prmt <- normalLL(prmt + c(change,0))
high <- normalLL(prmt - c(change,0))
low <- (high-low)/(change*2)
slope return(slope)
}firstM(mean(X), sd(X))
## [1] 0
# First partial (numerical) derivative w.r.t. sd
<- function(p1, p2, change = 0.0001){
firstSD <- c(p1, p2)
prmt <- normalLL(prmt + c(0, change))
high <- normalLL(prmt - c(0, change))
low <- (high-low)/(change*2)
slope return(slope)
}
firstSD(mean(X), sd(X))
## [1] -1.104417
#Verify them with the grad()
library(numDeriv)
<- function(x) {
f <- x[1]; b <- x[2]
a sum(dnorm(X, mean = a, sd = b, log = TRUE))
}grad(f,c(mean(X),sd(X)))[1]
## [1] 0
grad(f,c(mean(X),sd(X)))[2]
## [1] -1.104419
# Or better
round(jacobian(f,c(mean(X),sd(X))), 4) #First derivatives
## [,1] [,2]
## [1,] 0 -1.1044
round(hessian(f,c(mean(X),sd(X))), 4) #Second derivatives
## [,1] [,2]
## [1,] -121.9741 0.000
## [2,] 0.0000 -240.289
Let’s try them now in the Newton-Raphson method.
\[ x_{n+1}=x_{n}-\frac{g^{\prime}\left(x_{n}\right)}{g^{\prime \prime}\left(x_{n}\right)} \]
Similar to the first one, we can also develop a function that calculates the second derivatives. However, instead of using our own functions, let’s use grad()
and hessian()
from the numDeriv
package.
set.seed(2019)
<- rnorm(100, 2, 2)
X
<- function(f, x0, y0, tol = 1e-5, n = 1000) {
NR for (i in 1:n) {
<- grad(f,c(x0, y0))[1] # First-order derivative f'(x0)
dx <- hessian(f,c(x0, y0))[1,1] # Second-order derivative f''(x0)
ddx <- x0 - (dx / ddx) # Calculate next value x1
x1 if (abs(x1 - x0) < tol) {
<- paste("The mean approximation is ", x1, " in ", i, " iterations")
res return(res)
}# If Newton-Raphson has not yet reached convergence set x1 as x0 and continue
<- x1
x0
}print('Too many iterations in method')
}
<- function(x) {
func <- x[1]; b <- x[2]
a sum(dnorm(X, mean = a, sd = b, log = TRUE))
}
NR(func, -3, 1.5)
## [1] "The mean approximation is 1.85333200308383 in 2 iterations"
#Let;s verify it
mean(X)
## [1] 1.853332
Finding sd is left to the practice questions. But the way to do it should be obvious. Use our approximation of the mean (1.853332) as a fixed parameter in the function and run the same algorithm for finding sd. When the power of computers and the genius of mathematics intercepts, beautiful magics happen.
42.3 ML Estimation with logistic regression
The pdf of Bernoulli distribution is
\[ P(Y=y)=p^y(1-p)^{1-y} \] It’s likelihood
\[ \begin{aligned} L(\boldsymbol{\beta} \mid \mathbf{y} ; \mathbf{x}) &=L\left(\beta_0, \beta_1 \mid\left(y_1, \ldots, y_n\right) ;\left(x_1, \ldots, x_n\right)\right) \\ &=\prod_{i=1}^n p_i^{y_i}\left(1-p_i\right)^{1-y_i} \end{aligned} \] And log-likelihood
\[ \begin{aligned} \ell(\boldsymbol{\beta} \mid \mathbf{y} ; \mathbf{x}) &=\log \left(\prod_{i=1}^n p_i^{y_i}\left(1-p_i\right)^{1-y_i}\right) \\ &=\sum_{i=1}^n \log \left(p_i^{y_i}\right)+\log \left(\left(1-p_i\right)^{1-y_i}\right) \\ &=\sum_{i=1}^n y_i \log \left(p_i\right)+\left(1-y_i\right) \log \left(1-p_i\right) \end{aligned} \] where
\[ \begin{aligned} \operatorname{logit}\left(p_i\right) &=\log \left(\frac{p_i}{1-p_i}\right) \\ &=\beta_0+\beta_1 x_1 \end{aligned} \]
So,
\[ p_i=\frac{\exp \left(\beta_0+x_1 \beta_1\right)}{1+\exp \left(\beta_0+x_1 \beta_1\right)} \] First partial derivative with respect to \(\beta_0\) \[ \begin{aligned} \frac{\partial p_i}{\partial \beta_0} &=\frac{\exp \left(\beta_0+x_1 \beta_1\right)}{\left(1+\exp \left(\beta_0+x_1 \beta_1\right)\right)^2} \\ &=p_i\left(1-p_i\right) \end{aligned} \] And
\[ \begin{aligned} \frac{\partial p_i}{\partial \beta_1} &=\frac{x_1 \exp \left(\beta_0+x_1 \beta_1\right)}{\left(1+\exp \left(\beta_0+x_1 \beta_1\right)\right)^2} \\ &=x_1 p_i\left(1-p_i\right) \end{aligned} \] Newton-Raphson’s equation is
\[ \boldsymbol{\beta}^{(t+1)}=\boldsymbol{\beta}^{(t)}-\left(\boldsymbol{H}^{(t)}\right)^{-1} \boldsymbol{u}^{(t)}, \]
where
\[ \boldsymbol{\beta}^{(t)}=\left[\begin{array}{c} \beta_0^{(t)} \\ \beta_1^{(t)} \end{array}\right] \]
\[ \boldsymbol{u}^{(t)}=\left[\begin{array}{c} u_0^{(t)} \\ u_1^{(t)} \end{array}\right]=\left[\begin{array}{c} \frac{\partial \mathcal{C}\left(\beta^{(t)} \mid y ; x\right)}{\partial \beta_0} \\ \frac{\partial \mathscr{C}\left(\beta^{(t)} \mid y ; x\right)}{\partial \beta_1} \end{array}\right]=\left[\begin{array}{c} \sum_{i=1}^n y_i-p_i^{(t)} \\ \sum_{i=1}^n x_i\left(y_i-p_i^{(t)}\right) \end{array}\right] \] \[ p_i^{(t)}=\frac{\exp \left(\beta_0^{(t)}+x_1 \beta_1^{(0)}\right)}{1+\exp \left(\beta_0^{(t)}+x_1 \beta_1^{(t)}\right)} \]
\(\boldsymbol{H}^{(t)}\) can be considered as Jacobian matrix of \(\boldsymbol{u}(\cdot)\),
\[ \boldsymbol{H}^{(t)}=\left[\begin{array}{ll} \frac{\partial u_0^{(t)}}{\partial \beta_0} & \frac{\partial u_0^{(t)}}{\partial \beta_1} \\ \frac{\partial u_1^{(t)}}{\partial \beta_0} & \frac{\partial u_1^{(t)}}{\partial \beta_1} \end{array}\right] \]
rm(list=ls())
#Creating random data
set.seed(1)
<- 500
n = rnorm(n) # this is our x
xx = -2 + 3 * xx
z
#Probability is defined by a logistic function
= 1 / (1 + exp(-z))
p
#Bernoulli is the special case of the binomial distribution with size = 1
= rbinom(n, size = 1, prob = p)
y
#And we create our data
<- data.frame(y, xx)
df1 head(df1)
## y xx
## 1 0 -0.6264538
## 2 0 0.1836433
## 3 0 -0.8356286
## 4 0 1.5952808
## 5 0 0.3295078
## 6 0 -0.8204684
<- glm(y ~ xx, data = df1, family = binomial)
logis summary(logis)
##
## Call:
## glm(formula = y ~ xx, family = binomial, data = df1)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.3813 -0.4785 -0.2096 0.2988 2.4274
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.8253 0.1867 -9.776 <2e-16 ***
## xx 2.7809 0.2615 10.635 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 605.69 on 499 degrees of freedom
## Residual deviance: 328.13 on 498 degrees of freedom
## AIC: 332.13
##
## Number of Fisher Scoring iterations: 6
<- function(x) c(sum(df1$y - exp(x[1] + x[2] * df1$xx)/ (1 + exp(x[1] + x[2] * df1$xx))),
func.u sum(df1$xx * (df1$y - exp(x[1] + x[2] * df1$xx)/ (1 + exp(x[1] + x[2] * df1$xx)))))
<- matrix(1:2,nrow = 2)
delta <- array(c(-2,3))
x
while(abs(sum(delta)) > 0.0001){
<- x ##current x value
xx <- as.matrix(x) - solve(jacobian(func.u, x = x)) %*% func.u(x) ##new x value
x <- x - as.matrix(xx)
delta
} x
## [,1]
## [1,] -1.825347
## [2,] 2.780929
42.4 Gradient Descent Algorithm
Let’s start with a regression problem. The cost function in OLS is the residual sum of squares, \(\mathrm{RSS}=\sum_{i=1}^n\left(\widehat{e}_i\right)^2=\sum_{i=1}^n\left(y_i-\hat{y}\right)^2=\sum_{i=1}^n\left(y_i-\left(b_1+b_2 x_i\right)\right)^2\), which is a convex function. Our objective to find \(b_1\) and \(b_2\) that minimize RSS. How can we find those parameters to minimize a cost function if we don’t know much about it? The trick is to start with some point and move a bit (locally) in the direction that reduces the value of the cost function. In general, this search process for finding the minimizing point has two components: the direction and the step size. The direction tells us which direction we move next, and the step size determines how far we move in that direction. For example, the iterative search for \(b_2\) of gradient descent can be described by the following recursive rule:
\[ b_2^{(k+1)}=b_2^{(k)}-lr \nabla RSS^{k} \] Here, \(lr\) is learning rate and \(\nabla RSS^{k}\) is the slope of RSS at step \(k\). Hence, \(lr \nabla RSS^{k}\) is the total step size at step \(k\). Note that, as we move from either directions towards \(b^*_2\), \(\nabla RSS^{k}\) gets smaller. In fact, it becomes zero at \(b^*_2\). Therefore, \(\nabla RSS^{k}\) helps iterations find the proper adjustment in each step in terms of direction and magnitude. Since RSS is a convex function, it’s easy to see how sign of \(\nabla RSS^{k}\) will direct the arbitrary \(b_2^{''}\) towards the optimal \(b_2\).
Since first-order approximation at \(b_2^{''}\) is good only for small \(\Delta b_2\), a small \(lr>0\) is needed to o make \(\Delta b_2\) small in magnitude. Moreover, when a high learning rate used it leads to “overshooting” past the local minima and may result in diverging algorithm.
Below, we first use a simple linear regression function on simulated data and estimate its parameters with lm()
. Let’s simulate a sample with our DGM.
set.seed(1001)
<- 100
N <- rep(1, N)
int <- rnorm(N, mean = 10, sd = 2)
x1 <- rnorm(N, 2*x1 + int, 1)
Y <- lm(Y ~ x1)
model <- coef(model)
b b
## (Intercept) x1
## 1.209597 1.979643
plot(x1, Y, col = "blue", pch = 20)
abline(b)
42.4.1 One-variable
Below, we create a function, gradientDesc,
to show how sensitive gradient descent algorithms would be to different calibrations:
<- function(x, y, learnrate, conthresh, maxiter) {
gradientDesc #starting points
set.seed(234)
<- runif(1, 0, 1)
b <- runif(1, 0, 1)
c <- length(x)
n #function
<- b * x + c
yhat #gradient
<- sum((y - yhat) ^ 2) / n
MSE = F
converged = 0
iterations #while loop
while (converged == F) {
<- b + learnrate * (1 / n) * (sum((y - yhat) * x))
b_new <- c + learnrate * (1 / n) * (sum(y - yhat))
c_new <- b_new
b <- c_new
c <- b * x + c
yhat <- sum((y - yhat) ^ 2) / n
MSE_new if (abs(MSE - MSE_new) <= conthresh) {
= T
converged return(paste("Intercept:", c, "Slope:", b))
}= iterations + 1
iterations if (iterations > maxiter) {
= T
converged return(paste("Max. iter. reached, ", "Intercept:",
"Slope:", b))
c,
}
} }
Note that the key part in this algorithm is b_new <- b + learnrate * (1 / n) * (sum((y - yhat) * x))
.
The first \(b\) that is picked randomly by b <- runif(1, 0, 1)
is adjusted by learnrate * (1 / n) * (sum((y - yhat) * x))
.
Note that sum((y - yhat) * x)
is one of the f.o.c of the cost function (RSS - Residual Sum of Squares) or the “normal” equation for the slope coefficient. The cost function (RSS) is a convex function where the minimum can be achieved by the optimal \(b\). It is a linear Taylor approximation of RSS at \(b\) that provides the steepest descent, that is just a simple adjustment for identifying the direction of the adjustment of \(b\) until the minimum RSS is reached.
Now we will see if this function will give us the same intercept and slope coefficients already calculated with lm()
above.
gradientDesc(x1, Y, 0.000293, 0.001, 2500000) # Perfect
## [1] "Max. iter. reached, Intercept: 1.20959654028224 Slope: 1.97964316494781"
gradientDesc(x1, Y, 0.0000293, 0.001, 2500000) # Ehhhh
## [1] "Max. iter. reached, Intercept: 1.20102677446618 Slope: 1.980458385393"
gradientDesc(x1, Y, 0.00000293, 0.001, 2500000) # Too little
## [1] "Max. iter. reached, Intercept: 0.994504430596522 Slope: 2.00010434080475"
gradientDesc(x1, Y, 0.000293, 0.001, 2500000) # Perfect
## [1] "Max. iter. reached, Intercept: 1.20959654028224 Slope: 1.97964316494781"
gradientDesc(x1, Y, 0.00293, 0.001, 2500000) # perfect
## [1] "Max. iter. reached, Intercept: 1.2095965402898 Slope: 1.97964316494709"
Yes, the main question is how do we find out what the learning rate should be? It is an active research question and to answer it is beyond this chapter. A general suggestion, however, is to keep it small and tune it within the training process.
42.4.2 Multivariable
We will expand the gradient descent algorithms with an multivariable example using matrix algebra. First, the data and model simulation:
set.seed(1001)
<- 100
N <- rep(1, N)
int <- rnorm(N, mean = 10, sd = 2)
x1 <- rnorm(N, mean = 5, sd = 1)
x2 <- rbinom(N, 1, 0.5)
x3 <- rbinom(N, 1, 0.5)
x4 <- rbinom(N, 1, 0.5)
x5 <- rnorm(N, 1, 0.25)
x6 <- rnorm(N, 1, 0.2)
x7 <- x2*x3
x2x3 <- x4*x5
x4x5 <- x5*x6
x4x6 <- x3*x7
x3x7
<- rnorm(N, 2*x1 + -0.5*x2 - 1.75*x2x3 + 2*x4x5 - 3*x4x6 + 1.2*x3x7 + int, 1)
Y <- cbind(int, x1, x2, x2x3, x4x5, x4x6, x3x7) X
We can solve it with linear algebra manually:
<- solve(t(X)%*%X)%*%t(X)%*%Y
betaOLS print(betaOLS)
## [,1]
## int 0.4953323
## x1 1.9559022
## x2 -0.3511182
## x2x3 -1.9112623
## x4x5 1.7424723
## x4x6 -2.8323934
## x3x7 2.1015442
We can also solve it with lm()
<- lm(Y ~ X -1)
model1.lm summary(model1.lm)
##
## Call:
## lm(formula = Y ~ X - 1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.84941 -0.45289 -0.09686 0.57679 2.07154
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## Xint 0.49533 0.75410 0.657 0.51290
## Xx1 1.95590 0.03868 50.571 < 2e-16 ***
## Xx2 -0.35112 0.12600 -2.787 0.00645 **
## Xx2x3 -1.91126 0.13358 -14.308 < 2e-16 ***
## Xx4x5 1.74247 0.24396 7.143 2.01e-10 ***
## Xx4x6 -2.83239 0.18831 -15.041 < 2e-16 ***
## Xx3x7 2.10154 0.64210 3.273 0.00149 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8448 on 93 degrees of freedom
## Multiple R-squared: 0.9972, Adjusted R-squared: 0.997
## F-statistic: 4677 on 7 and 93 DF, p-value: < 2.2e-16
Now the function for gradient descent:
<- function(x, y, learnrate, conthresh, maxiter) {
gradientDescALL set.seed(123)
<- runif(ncol(x), 0, 1)
b <- x%*%b
yhat <- y - yhat
e <- t(e)%*%e
RSS = F
converged = 0
iterations <- length(y)
n while(converged == F) {
<- b - (learnrate*(1/n))*t(x)%*%(x%*%b-y)
b_new <- b_new
b <- x%*%b
yhat <- y - yhat
e <- t(e)%*%e
RSS_new if(RSS - RSS_new <= conthresh) {
= T
converged return(b)
}= iterations + 1
iterations if(iterations > maxiter) {
= T
converged return(b)
}
} }
gradientDescALL(X, Y, 0.0000293, 0.001, 2500000) # Ehhhh
## [,1]
## int 0.7166117
## x1 1.9536609
## x2 -0.3908202
## x2x3 -1.8183423
## x4x5 1.7662375
## x4x6 -2.8377073
## x3x7 1.6405013
gradientDescALL(X, Y, 0.00000293, 0.001, 2500000) # Too little
## [,1]
## int 0.3243014
## x1 2.0080590
## x2 -0.4608975
## x2x3 -1.5596021
## x4x5 1.0183122
## x4x6 -2.1427840
## x3x7 0.3929369
gradientDescALL(X, Y, 0.000293, 0.001, 2500000) # Close
## [,1]
## int 0.4960033
## x1 1.9558851
## x2 -0.3512160
## x2x3 -1.9111656
## x4x5 1.7425020
## x4x6 -2.8324060
## x3x7 2.1010627
gradientDescALL(X, Y, 0.00293, 0.001, 2500000) # Perfect
## [,1]
## int 0.4953323
## x1 1.9559022
## x2 -0.3511182
## x2x3 -1.9112623
## x4x5 1.7424723
## x4x6 -2.8323934
## x3x7 2.1015442
gradientDescALL(X, Y, 0.0293, 0.001, 2500000) # Too big
## [,1]
## int 0.21935351
## x1 0.56900009
## x2 -0.05797943
## x2x3 -0.29563662
## x4x5 0.90067425
## x4x6 -0.06914816
## x3x7 0.30046247
As before, the issue here is the magnitude of learning rate.
42.5 Optimization with R
A good summary of tools for optimization in R given in this guide: Optimization and Mathematical Programming. There are many optimization methods, each of which would only be appropriate for specific cases. In choosing a numerical optimization method, we need to consider following points:
- We need to know if it’s a constrained or unconstrained problem. For example, the MLE method is an unconstrained problem. Most regularization problems, like Lasso or Ridge, are constraint optimization problems.
- Do we know how the objective function is shaped a priori? MLE and OLS methods have well-known objective functions (Residual Sum of Squares and Log-Likelihood). Maximization and minimization problems can be used in both cases by flipping the sign of the objective function.
- Multivariate optimization problems are much harder than single-variable optimization problems. There is, however, a large set of available optimization methods for multivariate problems.
- In multivariate cases, the critical point is whether the objective function has available gradients. If only the objective function is available without gradient or Hessian, the Nelder-Mead algorithm is the most common method for numerical optimization. If gradients are available, the best and most used method is the gradient descent algorithm. We have seen its application for OLS. This method can be applied to MLE as well. It is also called a Steepest Descent Algorithm. In general, the gradient descent method has three types: Batch Gradient Descent, Stochastic Gradient Descent, and Mini-Batch Gradient Descent.
- If the gradient and Hessian are available, we can also use the Newton-Raphson Method. This is only possible if the dataset is not high-dimensional, as calculating the Hessian would otherwise be very expensive.
- Usually the Nelder-Mead method is slower than the Gradient Descent method.
Optim()
uses the Nelder-Mead method, but the optimization method can be chosen in its arguments.
More details can be found in this educational slides. The most detailed and advance source is Numerical Recipes, which uses C++
and R
.