Numerical Optimization

Andrew Fogarty

5/23/2020

1 Introduction

Maximum likelihood fixes our observed data and asks: What parameter values most likely generated the data we have? In a likelihood framework, our data is viewed as a joint probability as a function of parameter values for a specified mass or density function. In this case, the joint probability is being maximized with respect to the parameters. A maximum likelihood estimate is one that provides the density or mass function with the highest likelihood of generating the observed data.

Numerical optimizers provide the means to estimate our parameters by finding the values that maximize the likelihood of generating our data. This guide helps us understand how optimization algorithms find the best estimates for our coefficients in a step-by-step process.

2 Numerical Optimization

In the chunk below, some binary synthetic data is created for use in logistic regression.

y <- sample(c(0, 1), size =500, replace=TRUE)
x1 <- rnorm(500)
x2 <- rnorm(500)
intercept <- rep(1, 500)
X <- data.frame(intercept, x1, x2)
X <- data.matrix(X)

When calculating the MLE, most programs iterate some numerical optimization procedure. These algorithms continuously recalculate the parameters until the change in the current value falls below a specified tolerance – when the algorithm is then said to have converged. The workhorse for finding extrema of log-likelihood functions are hill-climbing algorithms that use the information in the derivative of the log-likelihood function to climb to a maximum or descend to a minimum. There are many algorithms for computing the MLE, such as Newton-Raphson (NR), Quasi-Newton (BFGS), and Gradient Descent.

We can get coefficients by either:

  1. Maximizing the log-likelihood, or
  2. Minimizing the negative log-likelihood

3 Gradient Descent

In this section, we minimize the negative log-likelihood to arrive at our parameters that most likely generated our data using gradient descent. Gradient descent is optimal because it avoids computing the hessian (the second derivative) which can be computationally expensive or difficult given certain functions. Instead, it uses the gradient to choose the direction for the next update. Gradient descent is very popular for machine learning methods, particularly for high dimensional computation because it avoids computing the hessian.

To find our coefficients, we need to compute two quantities of interest: (1) the negative log-likelihood, and (2) its first derivative, the gradient. Gradient descent, often called “standard” or “batch” gradient descent uses all data points for computing the next search direction. In the next section, we show how we can circumvent this by using Stochastic Gradient Descent.

3.1 Quantity of Interest: Negative Log-Likelihood

The log-likelihood is: \[\text{log} L(\theta|y) = \sum_{i=1}^n [y_{i}*log(\theta_{i}) + (1-y_{i})*log(1-\theta_{i})] \] which, programmed into R, looks like the following:

negLL <- function(b,X,y){  # b = betas
    p <- as.vector(1/(1+exp(-X %*% b)))  # "standard logistic function"; 1/1+exp(-X)
    -sum(y*log(p) + (1-y)*log(1-p))   # negative log-likelihood
    } 

3.2 Quantity of Interest: Gradient

The gradient is: \[\sum_{i=1}^n (y_{i} - \theta_{i})*X_{ij} \] which, programmed into R, looks like the following:

gradient <- function(b,X,y){
  p <- as.vector(1/(1+exp(-X %*% b)))
  -apply(((y - p)*X),2,sum) # derivative of cost function: (p) = y-hat
  } # gradient of the negative log-likelihood; a vector of partial derivatives

3.3 Gradient Descent Optimization: In Steps

Next, we step through the optimization algorithm manually so we can better understand how our coefficients get updated as we minimize our negative log-likelihood.

First, we instantiate some initial values for our coefficients and a learning rate.

beta = c(0,0,0)  # for intercept and two betas
learning_rate = 0.0001  # learning rate

Next, we step through the process iteratively before proceeding to a while loop. In the code below, we:

  1. Save the current beta coefficients separately as the old,

  2. Multiply our learning rate by the first derivative, obtaining the values at which we nudge our coefficients, either positively or negatively,

  3. Update our current beta by subtracting the error,

  4. Use the new betas to update our negative log-likelihood, and

  5. Calculate the euclidean distance between our beta estimates which we will use to stop iterating once we reach a specified tolerance.

# save the previous value
beta0 = beta

# calculate h, the increment
h = learning_rate*gradient(beta, X, y)

# update beta
beta = beta - h # subtract to minimize

# update the log likelihood 
logL = negLL(beta, X, y)

# calculate the euclidean distance
eps  = sqrt(sum((beta - beta0)^2))

3.4 Gradient Descent Optimization: Automated

In this section, we utilize a while loop to execute our numerical optimization using gradient descent. This loop will continue until the euclidean distance between our \(\beta\)s stop updating past our specified tolerance.

tol = 10^-6  # set the threshold for convergence
beta = c(0,0,0)  # for intercept and two betas
maxit = 1000
iter = 0
learning_rate = 0.0001  # learning rate
eps = Inf

start = Sys.time()
while(eps > tol & iter < maxit){
  # save the previous value
  beta0 = beta
  
  # calculate h, the increment
  h = learning_rate*gradient(beta, X, y)
  
  # update beta
  beta = beta - h # subtract to minimize
  
  # update the log likelihood 
  logL = negLL(beta, X, y)
  
  # calculate the euclidean distance
  eps  = sqrt(sum((beta - beta0)^2))
  
  # update the iteration number
  iter = iter + 1
  if(iter == maxit) warning("Iteration limit reached without convergence")
  
  # print out info to keep track
  if(floor(iter/20) == ceiling(iter/20)) cat(sprintf("Iter: %d logL: %.2f beta0: %.3f beta1: %.3f beta2: %.3f eps:%f\n",iter, logL,beta[1],beta[2],beta[3],eps))
}
## Iter: 20 logL: 346.06 beta0: 0.007 beta1: -0.010 beta2: -0.030 eps:0.001427
## Iter: 40 logL: 345.74 beta0: 0.012 beta1: -0.017 beta2: -0.053 eps:0.001116
## Iter: 60 logL: 345.55 beta0: 0.017 beta1: -0.023 beta2: -0.071 eps:0.000873
## Iter: 80 logL: 345.43 beta0: 0.020 beta1: -0.028 beta2: -0.086 eps:0.000683
## Iter: 100 logL: 345.36 beta0: 0.022 beta1: -0.032 beta2: -0.097 eps:0.000535
## Iter: 120 logL: 345.31 beta0: 0.024 beta1: -0.035 beta2: -0.106 eps:0.000419
## Iter: 140 logL: 345.29 beta0: 0.025 beta1: -0.038 beta2: -0.112 eps:0.000328
## Iter: 160 logL: 345.27 beta0: 0.027 beta1: -0.039 beta2: -0.118 eps:0.000257
## Iter: 180 logL: 345.26 beta0: 0.027 beta1: -0.041 beta2: -0.122 eps:0.000202
## Iter: 200 logL: 345.25 beta0: 0.028 beta1: -0.042 beta2: -0.125 eps:0.000158
## Iter: 220 logL: 345.25 beta0: 0.028 beta1: -0.043 beta2: -0.128 eps:0.000124
## Iter: 240 logL: 345.25 beta0: 0.029 beta1: -0.044 beta2: -0.130 eps:0.000097
## Iter: 260 logL: 345.24 beta0: 0.029 beta1: -0.045 beta2: -0.131 eps:0.000076
## Iter: 280 logL: 345.24 beta0: 0.029 beta1: -0.045 beta2: -0.133 eps:0.000060
## Iter: 300 logL: 345.24 beta0: 0.030 beta1: -0.045 beta2: -0.134 eps:0.000047
## Iter: 320 logL: 345.24 beta0: 0.030 beta1: -0.046 beta2: -0.134 eps:0.000037
## Iter: 340 logL: 345.24 beta0: 0.030 beta1: -0.046 beta2: -0.135 eps:0.000029
## Iter: 360 logL: 345.24 beta0: 0.030 beta1: -0.046 beta2: -0.135 eps:0.000023
## Iter: 380 logL: 345.24 beta0: 0.030 beta1: -0.046 beta2: -0.136 eps:0.000018
## Iter: 400 logL: 345.24 beta0: 0.030 beta1: -0.046 beta2: -0.136 eps:0.000014
## Iter: 420 logL: 345.24 beta0: 0.030 beta1: -0.046 beta2: -0.136 eps:0.000011
## Iter: 440 logL: 345.24 beta0: 0.030 beta1: -0.047 beta2: -0.137 eps:0.000009
## Iter: 460 logL: 345.24 beta0: 0.030 beta1: -0.047 beta2: -0.137 eps:0.000007
## Iter: 480 logL: 345.24 beta0: 0.030 beta1: -0.047 beta2: -0.137 eps:0.000005
## Iter: 500 logL: 345.24 beta0: 0.030 beta1: -0.047 beta2: -0.137 eps:0.000004
## Iter: 520 logL: 345.24 beta0: 0.030 beta1: -0.047 beta2: -0.137 eps:0.000003
## Iter: 540 logL: 345.24 beta0: 0.030 beta1: -0.047 beta2: -0.137 eps:0.000003
## Iter: 560 logL: 345.24 beta0: 0.030 beta1: -0.047 beta2: -0.137 eps:0.000002
## Iter: 580 logL: 345.24 beta0: 0.030 beta1: -0.047 beta2: -0.137 eps:0.000002
## Iter: 600 logL: 345.24 beta0: 0.030 beta1: -0.047 beta2: -0.137 eps:0.000001
## Iter: 620 logL: 345.24 beta0: 0.030 beta1: -0.047 beta2: -0.137 eps:0.000001

As we can see, we have:

  1. Minimized our negative log-likelihood, and

  2. Found our estimates of beta iteratively until we reached our specified tolerance (0.000001).

To compare our results with that of R’s glm function, consider the following:

fit.glm <- glm(y ~ x1 + x2, data = data.frame(X), family=binomial(link="logit")) 
round(fit.glm$coefficients, 3)
## (Intercept)          x1          x2 
##       0.030      -0.047      -0.137

As we can see, we get identical results.

4 Stochastic Gradient Descent

Stochastic gradient descent randomly sub-samples the data and performs an update for \(\beta\) with a random observation drawn from the data, rather than a full pass over the data. The same procedure continues as above, except we sample our data and iterate until convergence. While in this example stochastic gradient descent will perform slower than standard gradient descent, in situations where we have more data and higher dimensional features, stochastic gradient descent will outperform gradient descent significantly by reducing computation time.

5 Stochastic Gradient Descent Optimization: Automated

# the start
tol = 10^-9  # lowered threshold to ensure convergence
beta = c(0,0,0)  # for intercept and two betas
maxit = 100000  # increased iteration limit
iter = 0
learning_rate = 0.0001
subsample.size = 10  # sample 10 observations a time
eps = 100000
epscount = 0
logL = rep(NA, maxit)

# given this, the result below will be the same as NR
start = Sys.time()
while(eps > tol & iter < maxit & epscount < 4){
  # save the previous value
  beta0 = beta
  
  # take subsample
  index = sample(1:500, size = subsample.size,replace = T)
  
  # calculate h, the increment
  h = learning_rate*gradient(beta, X[index,], y[index])
  
  # update lambda
  beta = beta - h
  
  # update the log likelihood 
  logL[iter] = negLL(beta, X, y)
  
  # use relative change in logL from 1000 iterations prior
  # this is because randomness between single iterations large, smooths out
  if(iter > 1000) eps  = abs(logL[iter] -logL[iter-1000])/abs(logL[iter-1000])
  
  # we use this count to protect against randomly hitting the convergene limit early
  if(eps < tol) epscount = epscount+1
  
  # update the iteration number
  iter = iter + 1
  if(iter == maxit) warning("Iteration limit reached without convergence")
  
  # print out info to keep track
  if(floor(iter/200) == ceiling(iter/200)) cat(sprintf("Iter: %d logL: %.2f beta0: %.3f beta1: %.3f beta2: %.3f eps:%f\n",iter, logL[iter-1],beta[1],beta[2],beta[3],eps))
}
## Iter: 200 logL: 346.45 beta0: 0.001 beta1: -0.001 beta2: -0.007 eps:100000.000000
## Iter: 400 logL: 346.30 beta0: 0.001 beta1: -0.001 beta2: -0.017 eps:100000.000000
## Iter: 600 logL: 346.22 beta0: 0.004 beta1: -0.001 beta2: -0.022 eps:100000.000000
## Iter: 800 logL: 346.10 beta0: 0.003 beta1: -0.004 beta2: -0.029 eps:100000.000000
## Iter: 1000 logL: 346.05 beta0: 0.005 beta1: -0.006 beta2: -0.032 eps:100000.000000
## Iter: 1200 logL: 345.93 beta0: 0.006 beta1: -0.005 beta2: -0.042 eps:0.001495
## Iter: 1400 logL: 345.88 beta0: 0.005 beta1: -0.004 beta2: -0.048 eps:0.001206
## Iter: 1600 logL: 345.80 beta0: 0.004 beta1: -0.007 beta2: -0.053 eps:0.001202
## Iter: 1800 logL: 345.73 beta0: 0.008 beta1: -0.013 beta2: -0.057 eps:0.001075
## Iter: 2000 logL: 345.69 beta0: 0.008 beta1: -0.017 beta2: -0.060 eps:0.001047
## Iter: 2200 logL: 345.65 beta0: 0.007 beta1: -0.019 beta2: -0.064 eps:0.000809
## Iter: 2400 logL: 345.62 beta0: 0.007 beta1: -0.020 beta2: -0.066 eps:0.000738
## Iter: 2600 logL: 345.57 beta0: 0.012 beta1: -0.022 beta2: -0.071 eps:0.000675
## Iter: 2800 logL: 345.52 beta0: 0.015 beta1: -0.024 beta2: -0.075 eps:0.000607
## Iter: 3000 logL: 345.50 beta0: 0.011 beta1: -0.026 beta2: -0.079 eps:0.000550
## Iter: 3200 logL: 345.44 beta0: 0.014 beta1: -0.028 beta2: -0.086 eps:0.000602
## Iter: 3400 logL: 345.41 beta0: 0.013 beta1: -0.028 beta2: -0.090 eps:0.000604
## Iter: 3600 logL: 345.38 beta0: 0.015 beta1: -0.031 beta2: -0.094 eps:0.000538
## Iter: 3800 logL: 345.38 beta0: 0.012 beta1: -0.030 beta2: -0.096 eps:0.000407
## Iter: 4000 logL: 345.38 beta0: 0.011 beta1: -0.031 beta2: -0.096 eps:0.000330
## Iter: 4200 logL: 345.37 beta0: 0.015 beta1: -0.036 beta2: -0.096 eps:0.000208
## Iter: 4400 logL: 345.35 beta0: 0.014 beta1: -0.039 beta2: -0.099 eps:0.000180
## Iter: 4600 logL: 345.34 beta0: 0.017 beta1: -0.041 beta2: -0.101 eps:0.000137
## Iter: 4800 logL: 345.34 beta0: 0.019 beta1: -0.040 beta2: -0.099 eps:0.000112
## Iter: 5000 logL: 345.34 beta0: 0.021 beta1: -0.039 beta2: -0.100 eps:0.000134
## Iter: 5200 logL: 345.33 beta0: 0.020 beta1: -0.038 beta2: -0.101 eps:0.000103
## Iter: 5400 logL: 345.31 beta0: 0.021 beta1: -0.038 beta2: -0.105 eps:0.000109
## Iter: 5600 logL: 345.31 beta0: 0.016 beta1: -0.039 beta2: -0.107 eps:0.000063
## Iter: 5800 logL: 345.29 beta0: 0.016 beta1: -0.041 beta2: -0.113 eps:0.000142
## Iter: 6000 logL: 345.28 beta0: 0.016 beta1: -0.045 beta2: -0.116 eps:0.000157
## Iter: 6200 logL: 345.27 beta0: 0.019 beta1: -0.043 beta2: -0.119 eps:0.000181
## Iter: 6400 logL: 345.27 beta0: 0.023 beta1: -0.042 beta2: -0.117 eps:0.000125
## Iter: 6600 logL: 345.27 beta0: 0.023 beta1: -0.042 beta2: -0.118 eps:0.000132
## Iter: 6800 logL: 345.26 beta0: 0.023 beta1: -0.044 beta2: -0.124 eps:0.000106
## Iter: 7000 logL: 345.25 beta0: 0.022 beta1: -0.045 beta2: -0.126 eps:0.000080
## Iter: 7200 logL: 345.25 beta0: 0.022 beta1: -0.045 beta2: -0.126 eps:0.000050
## Iter: 7400 logL: 345.25 beta0: 0.021 beta1: -0.042 beta2: -0.129 eps:0.000055
## Iter: 7600 logL: 345.25 beta0: 0.021 beta1: -0.041 beta2: -0.131 eps:0.000050
## Iter: 7800 logL: 345.25 beta0: 0.022 beta1: -0.043 beta2: -0.137 eps:0.000025
## Iter: 8000 logL: 345.25 beta0: 0.021 beta1: -0.040 beta2: -0.135 eps:0.000012
## Iter: 8200 logL: 345.25 beta0: 0.023 beta1: -0.042 beta2: -0.135 eps:0.000020
## Iter: 8400 logL: 345.25 beta0: 0.022 beta1: -0.041 beta2: -0.136 eps:0.000012
## Iter: 8600 logL: 345.25 beta0: 0.021 beta1: -0.044 beta2: -0.139 eps:0.000008
## Iter: 8800 logL: 345.25 beta0: 0.022 beta1: -0.045 beta2: -0.139 eps:0.000000
## Iter: 9000 logL: 345.25 beta0: 0.022 beta1: -0.045 beta2: -0.139 eps:0.000011
## Iter: 9200 logL: 345.25 beta0: 0.024 beta1: -0.046 beta2: -0.141 eps:0.000004
## Iter: 9400 logL: 345.25 beta0: 0.022 beta1: -0.049 beta2: -0.140 eps:0.000006
## Iter: 9600 logL: 345.25 beta0: 0.023 beta1: -0.047 beta2: -0.142 eps:0.000005
## Iter: 9800 logL: 345.25 beta0: 0.024 beta1: -0.047 beta2: -0.141 eps:0.000004
## Iter: 10000 logL: 345.24 beta0: 0.026 beta1: -0.047 beta2: -0.138 eps:0.000011
## Iter: 10200 logL: 345.25 beta0: 0.023 beta1: -0.049 beta2: -0.139 eps:0.000001
## Iter: 10400 logL: 345.25 beta0: 0.023 beta1: -0.049 beta2: -0.139 eps:0.000003
## Iter: 10600 logL: 345.25 beta0: 0.023 beta1: -0.052 beta2: -0.141 eps:0.000004
## Iter: 10800 logL: 345.25 beta0: 0.023 beta1: -0.052 beta2: -0.135 eps:0.000005
## Iter: 11000 logL: 345.24 beta0: 0.025 beta1: -0.051 beta2: -0.135 eps:0.000005
## Iter: 11200 logL: 345.24 beta0: 0.026 beta1: -0.050 beta2: -0.136 eps:0.000005
## Iter: 11400 logL: 345.24 beta0: 0.025 beta1: -0.047 beta2: -0.139 eps:0.000005
## Iter: 11600 logL: 345.25 beta0: 0.024 beta1: -0.043 beta2: -0.141 eps:0.000005
## Iter: 11800 logL: 345.25 beta0: 0.024 beta1: -0.046 beta2: -0.144 eps:0.000000
## Iter: 12000 logL: 345.24 beta0: 0.029 beta1: -0.045 beta2: -0.142 eps:0.000004
## Iter: 12200 logL: 345.24 beta0: 0.031 beta1: -0.042 beta2: -0.138 eps:0.000001
## Iter: 12400 logL: 345.25 beta0: 0.034 beta1: -0.042 beta2: -0.139 eps:0.000004
## Iter: 12600 logL: 345.25 beta0: 0.036 beta1: -0.041 beta2: -0.139 eps:0.000002
## Iter: 12800 logL: 345.25 beta0: 0.037 beta1: -0.041 beta2: -0.140 eps:0.000003
## Iter: 13000 logL: 345.25 beta0: 0.038 beta1: -0.040 beta2: -0.132 eps:0.000017
## Iter: 13200 logL: 345.25 beta0: 0.037 beta1: -0.041 beta2: -0.133 eps:0.000014
## Iter: 13400 logL: 345.25 beta0: 0.036 beta1: -0.042 beta2: -0.130 eps:0.000010
## Iter: 13600 logL: 345.25 beta0: 0.037 beta1: -0.046 beta2: -0.131 eps:0.000000
## Iter: 13800 logL: 345.25 beta0: 0.038 beta1: -0.048 beta2: -0.127 eps:0.000013
## Iter: 14000 logL: 345.25 beta0: 0.040 beta1: -0.045 beta2: -0.129 eps:0.000008
## Iter: 14200 logL: 345.25 beta0: 0.041 beta1: -0.048 beta2: -0.128 eps:0.000016
## Iter: 14400 logL: 345.25 beta0: 0.038 beta1: -0.047 beta2: -0.127 eps:0.000011
## Iter: 14600 logL: 345.25 beta0: 0.040 beta1: -0.049 beta2: -0.130 eps:0.000012
## Iter: 14800 logL: 345.25 beta0: 0.038 beta1: -0.052 beta2: -0.130 eps:0.000003
## Iter: 15000 logL: 345.25 beta0: 0.034 beta1: -0.052 beta2: -0.133 eps:0.000016
## Iter: 15200 logL: 345.25 beta0: 0.037 beta1: -0.055 beta2: -0.135 eps:0.000012
## Iter: 15400 logL: 345.25 beta0: 0.034 beta1: -0.056 beta2: -0.136 eps:0.000010
## Iter: 15600 logL: 345.25 beta0: 0.032 beta1: -0.054 beta2: -0.138 eps:0.000015
## Iter: 15800 logL: 345.24 beta0: 0.030 beta1: -0.050 beta2: -0.141 eps:0.000023
## Iter: 16000 logL: 345.24 beta0: 0.025 beta1: -0.050 beta2: -0.140 eps:0.000006
## Iter: 16200 logL: 345.24 beta0: 0.027 beta1: -0.050 beta2: -0.138 eps:0.000018
## Iter: 16400 logL: 345.25 beta0: 0.028 beta1: -0.048 beta2: -0.145 eps:0.000009
## Iter: 16600 logL: 345.24 beta0: 0.030 beta1: -0.048 beta2: -0.143 eps:0.000006
## Iter: 16800 logL: 345.25 beta0: 0.029 beta1: -0.051 beta2: -0.149 eps:0.000026
## Iter: 17000 logL: 345.26 beta0: 0.031 beta1: -0.053 beta2: -0.151 eps:0.000031
## Iter: 17200 logL: 345.26 beta0: 0.030 beta1: -0.050 beta2: -0.155 eps:0.000052
## Iter: 17400 logL: 345.27 beta0: 0.027 beta1: -0.050 beta2: -0.158 eps:0.000066
## Iter: 17600 logL: 345.28 beta0: 0.029 beta1: -0.051 beta2: -0.161 eps:0.000093
## Iter: 17800 logL: 345.27 beta0: 0.029 beta1: -0.050 beta2: -0.158 eps:0.000051
## Iter: 18000 logL: 345.26 beta0: 0.029 beta1: -0.050 beta2: -0.155 eps:0.000016
## Iter: 18200 logL: 345.26 beta0: 0.029 beta1: -0.050 beta2: -0.155 eps:0.000004
## Iter: 18400 logL: 345.27 beta0: 0.027 beta1: -0.052 beta2: -0.157 eps:0.000005
## Iter: 18600 logL: 345.26 beta0: 0.027 beta1: -0.053 beta2: -0.155 eps:0.000033
## Iter: 18800 logL: 345.27 beta0: 0.027 beta1: -0.056 beta2: -0.155 eps:0.000009
## Iter: 19000 logL: 345.27 beta0: 0.024 beta1: -0.059 beta2: -0.154 eps:0.000027
## Iter: 19200 logL: 345.27 beta0: 0.023 beta1: -0.056 beta2: -0.156 eps:0.000019
## Iter: 19400 logL: 345.27 beta0: 0.023 beta1: -0.059 beta2: -0.151 eps:0.000003
## Iter: 19600 logL: 345.27 beta0: 0.023 beta1: -0.061 beta2: -0.148 eps:0.000002
## Iter: 19800 logL: 345.26 beta0: 0.022 beta1: -0.060 beta2: -0.148 eps:0.000009
## Iter: 20000 logL: 345.25 beta0: 0.024 beta1: -0.056 beta2: -0.144 eps:0.000051
## Iter: 20200 logL: 345.26 beta0: 0.022 beta1: -0.056 beta2: -0.149 eps:0.000029
## Iter: 20400 logL: 345.26 beta0: 0.021 beta1: -0.058 beta2: -0.148 eps:0.000013
## Iter: 20600 logL: 345.27 beta0: 0.017 beta1: -0.055 beta2: -0.150 eps:0.000001
## Iter: 20800 logL: 345.27 beta0: 0.017 beta1: -0.055 beta2: -0.151 eps:0.000012
## Iter: 21000 logL: 345.27 beta0: 0.016 beta1: -0.055 beta2: -0.152 eps:0.000050
## Iter: 21200 logL: 345.26 beta0: 0.019 beta1: -0.054 beta2: -0.148 eps:0.000001
## Iter: 21400 logL: 345.26 beta0: 0.018 beta1: -0.054 beta2: -0.146 eps:0.000008
## Iter: 21600 logL: 345.25 beta0: 0.022 beta1: -0.050 beta2: -0.146 eps:0.000039
## Iter: 21800 logL: 345.26 beta0: 0.020 beta1: -0.048 beta2: -0.149 eps:0.000033
## Iter: 22000 logL: 345.25 beta0: 0.020 beta1: -0.049 beta2: -0.147 eps:0.000050
## Iter: 22200 logL: 345.26 beta0: 0.018 beta1: -0.047 beta2: -0.149 eps:0.000002
## Iter: 22400 logL: 345.26 beta0: 0.020 beta1: -0.049 beta2: -0.148 eps:0.000010
## Iter: 22600 logL: 345.25 beta0: 0.020 beta1: -0.048 beta2: -0.145 eps:0.000000
## Iter: 22800 logL: 345.25 beta0: 0.026 beta1: -0.047 beta2: -0.145 eps:0.000028
## Iter: 23000 logL: 345.25 beta0: 0.026 beta1: -0.048 beta2: -0.151 eps:0.000004
## Iter: 23200 logL: 345.25 beta0: 0.026 beta1: -0.045 beta2: -0.147 eps:0.000032
## Iter: 23400 logL: 345.25 beta0: 0.022 beta1: -0.042 beta2: -0.143 eps:0.000018
## Iter: 23600 logL: 345.25 beta0: 0.024 beta1: -0.039 beta2: -0.137 eps:0.000010
## Iter: 23800 logL: 345.25 beta0: 0.024 beta1: -0.040 beta2: -0.136 eps:0.000005
## Iter: 24000 logL: 345.25 beta0: 0.025 beta1: -0.038 beta2: -0.130 eps:0.000010
## Iter: 24200 logL: 345.25 beta0: 0.025 beta1: -0.041 beta2: -0.134 eps:0.000007
## Iter: 24400 logL: 345.25 beta0: 0.023 beta1: -0.040 beta2: -0.134 eps:0.000001
## Iter: 24600 logL: 345.25 beta0: 0.024 beta1: -0.041 beta2: -0.133 eps:0.000001
## Iter: 24800 logL: 345.25 beta0: 0.025 beta1: -0.040 beta2: -0.134 eps:0.000000
## Iter: 25000 logL: 345.25 beta0: 0.024 beta1: -0.039 beta2: -0.135 eps:0.000008
## Iter: 25200 logL: 345.26 beta0: 0.023 beta1: -0.035 beta2: -0.134 eps:0.000024
## Iter: 25400 logL: 345.25 beta0: 0.026 beta1: -0.034 beta2: -0.132 eps:0.000017
## Iter: 25600 logL: 345.27 beta0: 0.022 beta1: -0.030 beta2: -0.131 eps:0.000055
## Iter: 25800 logL: 345.27 beta0: 0.021 beta1: -0.028 beta2: -0.131 eps:0.000073
## Iter: 26000 logL: 345.28 beta0: 0.019 beta1: -0.028 beta2: -0.133 eps:0.000077
## Iter: 26200 logL: 345.27 beta0: 0.023 beta1: -0.029 beta2: -0.134 eps:0.000034
## Iter: 26400 logL: 345.26 beta0: 0.027 beta1: -0.031 beta2: -0.134 eps:0.000015
## Iter: 26600 logL: 345.26 beta0: 0.028 beta1: -0.032 beta2: -0.134 eps:0.000028
## Iter: 26800 logL: 345.26 beta0: 0.027 beta1: -0.033 beta2: -0.135 eps:0.000050
## Iter: 27000 logL: 345.25 beta0: 0.029 beta1: -0.037 beta2: -0.134 eps:0.000077
## Iter: 27200 logL: 345.25 beta0: 0.033 beta1: -0.038 beta2: -0.137 eps:0.000054
## Iter: 27400 logL: 345.25 beta0: 0.034 beta1: -0.039 beta2: -0.137 eps:0.000037
## Iter: 27600 logL: 345.25 beta0: 0.036 beta1: -0.040 beta2: -0.138 eps:0.000028
## Iter: 27800 logL: 345.24 beta0: 0.034 beta1: -0.043 beta2: -0.137 eps:0.000034
## Iter: 28000 logL: 345.25 beta0: 0.035 beta1: -0.040 beta2: -0.139 eps:0.000004
## Iter: 28200 logL: 345.24 beta0: 0.033 beta1: -0.043 beta2: -0.141 eps:0.000010
## Iter: 28400 logL: 345.25 beta0: 0.035 beta1: -0.044 beta2: -0.143 eps:0.000003
## Iter: 28600 logL: 345.25 beta0: 0.037 beta1: -0.048 beta2: -0.145 eps:0.000003
## Iter: 28800 logL: 345.25 beta0: 0.036 beta1: -0.047 beta2: -0.145 eps:0.000010

To compare with the GLM function’s output again:

round(fit.glm$coefficients, 3)
## (Intercept)          x1          x2 
##       0.030      -0.047      -0.137

As we can see, we arrive at similar results, albeit in a slower and more circuitous manner given the “stochastic” nature of our sampling process.

6 Quasi Newton: BFGS

To show how we can maximize our log-likelihood and arrive at similar results by using R’s optim package, we make some slight changes to the code below:

  1. We remove the negative sign from likelihood function and its gradient, and

  2. We specify fnscale=-1 inside optim to tell the algorithm to maximize rather than minimize.

LL <- function(b,X,y){  # b = betas
    p<-as.vector(1/(1+exp(-X %*% b)))  # "standard logistic function"; 1/1+exp(-X)
    sum(y*log(p) + (1-y)*log(1-p))   # log-likelihood
}

gradient <- function(b,X,y){
  p <- as.vector(1/(1+exp(-X %*% b)))
   apply(((y - p)*X),2,sum) # gradient of the log-likelihood function above
}

Next, we incorporate these functions into optim and output its results.

results <- optim(rep(0, ncol(X)), fn=LL, gr=gradient,
                   hessian=T, method='BFGS', X=X, y=y,
                 control=list(trace=1,
                              REPORT=1,
                              fnscale=-1))
## initial  value 346.573590 
## iter   2 value 345.243859
## iter   3 value 345.242765
## iter   4 value 345.242756
## iter   5 value 345.242153
## iter   6 value 345.242004
## iter   6 value 345.242004
## iter   6 value 345.242004
## final  value 345.242004 
## converged
list(coefficients=results$par,var=solve(results$hessian),
       deviance=2*results$value,
       converged=results$convergence==0)
## $coefficients
## [1]  0.03005923 -0.04681860 -0.13721082
## 
## $var
##               [,1]          [,2]          [,3]
## [1,] -8.045266e-03  5.211218e-05 -0.0001299256
## [2,]  5.211218e-05 -7.451854e-03 -0.0004355625
## [3,] -1.299256e-04 -4.355625e-04 -0.0081648738
## 
## $deviance
## [1] -690.484
## 
## $converged
## [1] TRUE
round(results$par, 3)
## [1]  0.030 -0.047 -0.137

To compare with the GLM function’s output again:

round(fit.glm$coefficients, 3)
## (Intercept)          x1          x2 
##       0.030      -0.047      -0.137

As we can see, we find can find identical results either through maximization or minimization. Further, we can see that BFGS outperforms gradient descent here in terms of speed.

7 Sources