Keywords
We can use the least-square regression to generate a curve that can sufficiently describe the relationship between and . This method minimizes the discrepancy between the data points and curve produced; thus, it can be useful for objectively illustrating the general trend in data.
Linear Regression
Best Criteria
For a set of paired observations: . We can express the residual error of a particular straight line approximation using the formula
where and are coefficients characterizing the slope and intercept.1 Consequently, we can find the line that best fit the given data points using a specific criteria: the line with the smallest sum of squares of residual errors is the best fit (least-squares fit). We can determine this sum using the expression
where is the total number of points.
Least-Squares Fit
The expression for a straight line is
however, we set to obtain the line with the least-squares. On the other hand, we can solve for and using the formulae:
where and are the means of and .
Quantification of Error
The least-squares regression provides the best estimate of and when it meets the following criteria:
- Along the entire range of data, the distance between the regression line and the data points is of similar magnitude.
- The distribution of the points with respect to the line is normal.2
Furthermore, the regression line’s standard deviation can be obtained—assuming that the criteria are met—using
The standard error of the estimate tells us the spread of data around the regression line; in contrast, the standard deviation only tells us the spread of data around the mean. This assists us in assessing the accuracy of our fit, especially when we compare other types of regression methods.
To determine the goodness of our fit:
- Determine the or total sum of the squares around the mean for the dependent variable.
- This is the magnitude of the residual error associated with the pre-regression dependent variable.
- Solve for the regression line.
- Determine the or the sum of the squares of the residuals around the regression line.
- This is the magnitude of the residual error that remains after the regression; it is also referred to as the unexplained sum of the squares
- Solve for the coefficient of determination
- A perfect fit has no and has
- The result indicates the percent of original uncertainty explained by the linear model
The coefficient of determination can mathematically be expressed as
where its square root is referred to as the correlation coefficient. The correlation coefficient can also be solved using the formula
WARNING
Just because is close to does not mean that the fit is necessarily good (like in cases where the relationship between the variables are not even linear).
TIP
If , then the linear regression model has merit; else, it does not.
Polynomial Regression
Besides linear regression, we could also apply the least-square procedure to polynomial regression—which can be handy for trends in data that a straight line cannot acceptably represent. For example, we can extend the least-square formula to handle quadratic polynomial using the following equation:
To compute for the three unknowns (i.e., and ), we need to solve the following system of three linear equations
For an -order polynomial
we can express its standard error of the estimate as
Similar to a linear regression, the coefficient of determination is
Multiple Linear Regression
In a case where there are two independent variables and only a single dependent variable, the regression line becomes a regression plane. For example, the following function:
As with previous cases, we can obtain the best values by taking advantage of the sum of the squares of the residuals,
To obtain the coefficients for the minimum sum of the squares of the residuals, we can use the following equation
It should be noted that it’s not limited to two variables, multiple linear regression can be extended to dimensions, as in
where the standard error is
and the coefficient of determination is
Nonlinear Regression
TODO
Gauss-Newton Method
TODO
Sources
- Numerical Methods for Engineers by Steven Chapra and Raymond Canale (Chapter 17)
Footnotes
-
Residual error is the difference between the true value and approximated value. ↩