Using a Genetic Algorithm to Minimize an OLS Regression in R

A genetic algorithm allows you to optimize parameters by using an algorithm that mimics biological evolution. It will run through several generations of values trying to find the values that minimizes [or maximizes depending on the algorithm] its fitness or evaluation function, which is just any function that returns a value from the parameters the algorithm is optimizing.

There is a lot of literature on how genetic algorithms work, and I would recommend reading those if you want the technical details on how they work. Genetic algorithms are typically demonstrated by the knapsack algorithm problem [Numb3rs Scene Youtube], where you look to optimize the survival points by seeking the right combination of survival items weighing under a specified amount to fit in a knapsack. This R-bloggers site has a good demonstration of that example and code. However, I find it more interesting to use a genetic algorithm on something more familiar to analytics and statistics, and that’s the ordinary least squares regression (OLS).

OLS minimizes the sum of squared error (SSE) to find the best fit line or regression line for the data set. This is derived using matrix calculus, and it’s computational efficient, easy to understand, and ubiquitous.

Since OLS essentially is an algorithm that uses calculus to minimize SSE, we can use a genetic algorithm to accomplish the same task. R’s GA (genetic algorithm) package allows you to use either binary or real numbers as parameters for the fitness function. Traditionally, genetic algorithms use binary parameters [see the knapsack algorithm], but for this problem, real numbers will be much more useful since the regression coefficients will be real numbers.

The GA algorithm will create a vector of real numbers between -100 and 100, then use that vector to evaluation a regression equation in the fitness function. The fitness function returns the SSE. Since the GA algorithm seeks to maximize the fitness function, the function has a negative sign in front of it, so the lowest absolute SSE will at the maximum if it’s negative. The GA has a population of 500 vectors which are evaluated with the fitness function, and the best solutions are generally kept and children vectors are created, the process is repeated 500 times. The results is a SSE that is very close the OLS solution, and parameter estimates that match up as well.

I’ve included two different linear models. The first has only two variables which play significant roles in the OLS regression, and a second model which has every variable with not all being significant. You can run it a few times and see how the GA solutions differ. The first model’s GA estimates will be a lot closer to the OLS’ estimates than the second model’s.

All of this is rather academic for well-behaved linear regression problems, since GA are computationally expensive taking forever relative to your standard OLS procedure.

The full annotated R code follows:



#install.packages('GA')
library(GA)

#loads an airquality dataframe
data(airquality)

#removes missing data
airquality <- na.omit(airquality) 


#### create a function to evaluate a linear regression
#### takes intercept and the two best variables to compute the predicted y_hat
#### then computes and returns the SSE for each chromosome
#### we will try to minimize the SSE like OLS does

OLS <- function(data, b0, b1, b2){
  
  attach(data, warn.conflicts=F)
  
  Y_hat <- b0  + b1*Wind + b2*Temp
  
  SSE = t(Ozone-Y_hat) %*% (Ozone-Y_hat) #matrix formulation for SSE
  
  detach(data)
  
  return(SSE)
  
}


#### this sets up a real-value GA using 3 parameters all from -100 to 100
#### the parameters use real numbers (so floating decimals) and passes those to
#### the linear regression equation/function
#### the real-value GA requires a min and max
#### this takes a while to run

ga.OLS <- ga(type='real-valued', min=c(-100,-100, -100), 
             max=c(100, 100, 100), popSize=500, maxiter=500, names=c('intercept', 'Wind', 'Temp'),
             keepBest=T, fitness = function(b) -OLS(airquality, b[1],b[2], b[3]))

#### summary of the ga with solution
ga.model <- summary(ga.OLS)
ga.model


#### check against the results against the typical OLS procedure
lm.model <- lm(formula= Ozone ~ Wind + Temp, data=airquality)
summary(lm.model)
lm.model$res %*% lm.model$res ### SSE.lm
-ga.model$fitness ### SSE.ga

lm.model$res %*% lm.model$res + ga.model$fitness  ### difference between OLS and GA's SSE



#### FULL MODEL ####

OLS.FULL <- function(data, b0, b1, b2, b3, b4, b5){
  
  attach(data, warn.conflicts=F)
  
  Y_hat <- b0 + b1*Solar.R + b2*Wind + b3*Temp + b4*Month + b5*Day  # linear regression equation
  
  SSE = t(Ozone-Y_hat) %*% (Ozone-Y_hat) #matrix formulation for SSE
  
  detach(data)
  
  return(SSE)
  
}


#### this sets up a real-value GA using 6 parameters all from -100 to 100
#### the parameters use real numbers (so floating decimals) and passes those to
#### the linear regression equation/function
#### the real-value GA requires a min and max
#### this takes a while to run related to the survival pack
#### this will produce some values that vary a lot from OLS estimates since not all values are significant
#### some estimates should have high standard error
ga.OLS <- ga(type='real-valued', min=c(-100,-100, -100, -100, -100, -100), 
             max=c(100,100, 100, 100, 100, 100), popSize=500, maxiter=500, 
             keepBest=T, fitness = function(b) -OLS.FULL(airquality, b[1],b[2], b[3], b[4], b[5], b[6]))

#### summary of the ga with solution
summary(ga.OLS)

#### check against the results against the typical OLS procedure
summary(lm(formula= Ozone ~ Wind + Temp, data=airquality))