Many statistical techniques involve optimization. The path from a set of data to a statistical estimate often lies through a patch of code whose purpose is to find the minimum (or maximum) of a function. Likelihood-based methods (such as structural equation modeling, or logistic regression) and least squares estimates all depend on optimizers for their estimates and for certain goodness-of-fit tests. Base-R offers the optim function for general-purpose optimization. Through a conversation with John Nash, author and maintainer of optim and the newer optimx, learn about the pitfalls of optimization and some of the tools that R offers.
Statistics is nothing if not an exercise in optimization, beginning with the sample average and moving on from there. The average of a group of numbers minimizes the sum of squared deviations. In other words, the average minimizes the sum of terms like (x_i-avg)^2. The median minimizes the sum of absolute deviations—terms like |x_i-med|.
The underlying strategy for most statistical reasoning is:
Statistics is nothing if not an exercise in optimization, beginning with the sample average and moving on from there. The average of a group of numbers minimizes the sum of squared deviations. In other words, the average minimizes the sum of terms like (x_i-avg)^2. The median minimizes the sum of absolute deviations—terms like |x_i-med|.
The underlying strategy for most statistical reasoning is:
1. Write down a probability model that should account for the data.
2. This model will contain some unknown constants, or parameters.
3. Collect the data.
4. Find values of the parameters that best account for the data.
That last point contains the optimization. Maximum likelihood is an optimization procedure that selects the most plausible parameter values for the data you got. Parameters can be estimated in a number of ways, but all of them involve an optimization.
Interestingly, you can optimize most of the models taught in a beginning statistics course with a closed-form expression. Mathematically, finding the mean and variance of normal data, estimating a proportion, fitting a regression line, and modeling treatment and block effects in experimental design are all optimization problems. Their solutions emerge as the solution of a system of linear equations. The statistician hands over the problem to a software package, trusting that the linear algebra algorithm will produce stable estimates of the parameters.
For just about any model outside of the linear models I mentioned, closed-form solutions do not exist. You need not go far to find practical examples. Consider a bank that wants to predict whether prospective customers will default on their mortgages. The bank's historical data show the default status (a binary outcome coded Yes-No) of numerous customers, together with personal and financial information, such as employment status, years with current employer, salary, marital status. Estimating the contributions of each of these pieces of information boils down to estimating the parameters of a logistic equation. No closed-form expression exists for the best parameters of this model, so a numerical procedure is required to estimate the parameters and optimize the likelihood.
The same is true of other generalized linear models, structural equation models, and many other models that are used in modern statistics. Most of the heavy lifting in these problems, and the software that delivers the results, relies on a numerical optimization algorithm.
The usual graduate program in statistics, even at a good school, teaches you a lot about the theoretical properties of these estimates. And the theory, to be sure, paints an optimistic picture. Theory shows that under certain conditions, most people forget with time, and given a large enough sample size, the solution to the optimization problem (the maximum likelihood estimate) is the solution to the estimation problem. Solving the maximum likelihood problem gives you the estimates necessary to complete specification of the model. Even better, the estimates behave nicely (are normally distributed). You can even estimate how accurate they are from the curvature of the maximum likelihood function near its optimum value. At least, if the sample size is large enough, these nice properties will hold.
You then hand the problem over to an optimization algorithm, confident that your work is done. But is that confidence well placed?
2. This model will contain some unknown constants, or parameters.
3. Collect the data.
4. Find values of the parameters that best account for the data.
That last point contains the optimization. Maximum likelihood is an optimization procedure that selects the most plausible parameter values for the data you got. Parameters can be estimated in a number of ways, but all of them involve an optimization.
Interestingly, you can optimize most of the models taught in a beginning statistics course with a closed-form expression. Mathematically, finding the mean and variance of normal data, estimating a proportion, fitting a regression line, and modeling treatment and block effects in experimental design are all optimization problems. Their solutions emerge as the solution of a system of linear equations. The statistician hands over the problem to a software package, trusting that the linear algebra algorithm will produce stable estimates of the parameters.
For just about any model outside of the linear models I mentioned, closed-form solutions do not exist. You need not go far to find practical examples. Consider a bank that wants to predict whether prospective customers will default on their mortgages. The bank's historical data show the default status (a binary outcome coded Yes-No) of numerous customers, together with personal and financial information, such as employment status, years with current employer, salary, marital status. Estimating the contributions of each of these pieces of information boils down to estimating the parameters of a logistic equation. No closed-form expression exists for the best parameters of this model, so a numerical procedure is required to estimate the parameters and optimize the likelihood.
The same is true of other generalized linear models, structural equation models, and many other models that are used in modern statistics. Most of the heavy lifting in these problems, and the software that delivers the results, relies on a numerical optimization algorithm.
The usual graduate program in statistics, even at a good school, teaches you a lot about the theoretical properties of these estimates. And the theory, to be sure, paints an optimistic picture. Theory shows that under certain conditions, most people forget with time, and given a large enough sample size, the solution to the optimization problem (the maximum likelihood estimate) is the solution to the estimation problem. Solving the maximum likelihood problem gives you the estimates necessary to complete specification of the model. Even better, the estimates behave nicely (are normally distributed). You can even estimate how accurate they are from the curvature of the maximum likelihood function near its optimum value. At least, if the sample size is large enough, these nice properties will hold.
You then hand the problem over to an optimization algorithm, confident that your work is done. But is that confidence well placed?