Assessing the Quality of Regression Models
See Also: Linear & Polynomial Regression Multiple Linear Regression
Nonlinear Regression
The statistical indicators and the various plots provided by the Polymath program can be used to assess the quality of the regression models and to compare between various models. A practical guide for the use of the indicators and plots for these purposes follows:
Graph: This creates a plot of the calculated and measured values of the dependent variable. When the plots show different trends, this usually indicates an inappropriate model. If the difference between the measured and calculated points is large, but no clear trend exists, this may also indicate very noisy data (excessive experimental error) that cannot be accurately modeled.
Residual plot: The residual plot shows the difference between the calculated and measured values of the dependent variable as function of the measured values. If the regression model represents the data correctly, the residuals should be randomly distributed around the line of err=0 with zero mean. If the residuals show a clear trend, this indicates that an inappropriate model is being used. (For example, a straight-line representation is used instead of a polynomial.) If there is a point where the residual (error) is much larger, in absolute value, than in the rest of the points this may be indication of an outlier, which can be removed under certain circumstances.
Confidence intervals: For the regression model to be stable and statistically valid, the confidence intervals must be much smaller (or at least smaller) than the respective parameter values (in absolute values). An unstable model may yield very inaccurate derivative values and absurd results for even a small range of extrapolation. For an unstable model, a small change in the data (by adding or removing a data point, for example) may lead to large changes of the parameter values.
R2 and R2adj: The correlation coefficients are frequently used to judge whether the model represents correctly the data, implying that if the correlation coefficient is close to one then the regression model is correct. There are, however, many examples where the correlation coefficient is close enough to one but the model is still not appropriate. The residual plot should be used for judging the appropriateness of the model while the correlation coefficients can be used for comparing various models representing the same dependent variable.
In the following formulas, n is the number of scores (or observations) and yi is a specific observation. The notation "obs" relates to observed data and the notation "calc" relates to calculated data.
Variance and Rmsd: Just like the correlation coefficients, these two indicators are recommended to be used for comparing various models representing the same dependent variable. A model with smaller variance and Rmsd represents the data more accurately than a model with larger values of these indicators.
The following equations describe the Variance (s2), Standard Deviation (s), and Chi-Square respectively:
Reference: Selection of the most appropriate regression model is discussed in detail by Shacham et al (1996b)
Examples
Example 1: Insufficient Number of Terms/parameters in the Model
Note that this problem is Example 2 under the Examples drop-down menu in the Data Table window.
| TC | P | TK | logP | Trec | logT | T2 |
| -36.7 | 1 | 236.45 | 0 | 0.004229 | 2.373739 | 55908.6 |
| -19.6 | 5 | 253.55 | 0.69897 | 0.003944 | 2.404064 | 64287.6 |
| -11.5 | 10 | 261.65 | 1 | 0.003822 | 2.417721 | 68460.72 |
| -2.6 | 20 | 270.55 | 1.30103 | 0.003696 | 2.432248 | 73197.3 |
| 7.6 | 40 | 280.75 | 1.60206 | 0.003562 | 2.44832 | 78820.56 |
| 15.4 | 60 | 288.55 | 1.778151 | 0.003466 | 2.460221 | 83261.1 |
| 26.1 | 100 | 299.25 | 2 | 0.003342 | 2.476034 | 89550.56 |
| 42.2 | 200 | 315.35 | 2.30103 | 0.003171 | 2.498793 | 99445.62 |
| 60.6 | 400 | 333.75 | 2.60206 | 0.002996 | 2.523421 | 1.11E+05 |
| 80.1 | 760 | 353.25 | 2.880814 | 0.002831 | 2.548082 | 1.25E+05 |
In this example a model including logP as dependent variable and Trec as independent variable (Clapeyron's equation) is fitted to the data that has been transformed in the Data Table prior to the Linear Regression. The model also includes a free parameter. Some of the results appearing on the "Report" display are shown below.
Model: logP = a0 + a1*Trec
Variable Value 95% confidence a0 8.752017 0.5423357 a1 -2035.333 153.6285 General
Regression including a free parameter
Number of observations = 10
Statistics
R^2 0.9915016 R^2adj 0.9904393 Rmsd 0.024644 Variance 0.0075916
The "Report" above shows that both R^2 and R^2adj are very close to one, the variance is small, and the 95% confidence intervals are much smaller than the parameter values. From those results we get the (false) impression that the data are represented correctly by the linear equation involving the transformed variables. The residual plot (shown below) shows, however, a clear curvature indicating that that this model is insufficient for precise representation of the data.

Better results are obtained when more of the transformed variables are utilized in a multiple linear regression.
Example 2: Excessive Number of Terms/parameters in a Polynomial Model
Note that this problem is Example 3 - Heat Capacity under the Examples drop-down menu in the Data Table window.

In this example, a 3rd order polynomial including P as dependent variable and T as independent variable is fitted to the data. The model also includes a free parameter. Some of the results appearing on the "Report" screen are shown below.
Model: Cp = a0 + a1*T + a2*T^2 + a3*T^3
Variable Value 95% confidence a0 -4.710354 10.32547 a1 0.3273543 0.2104189 a2 -0.002356 0.0014169 a3 5.887E-06 3.153E-06 General
Degree of polynomial = 3
Regression including a free parameter
Number of observations = 18
Statistics
R^2 0.9951909 R^2adj 0.9941604 Rmsd 0.007293 Variance 0.0012309
For this example both R^2 and R^2adj are very close to one, the variance is small but the 95% confidence interval for the free parameter a0 is much larger than the parameter value itself. This may indicate that the free parameter is not needed. Carrying out the regression again but marking the "Through origin" option yields the following results:
Model: Cp = a1*T + a2*T^2 + a3*T^3
Variable Value 95% confidence a1 0.2314379 0.0082055 a2 -0.0017116 0.0001102 a3 4.459E-06 3.64E-07 General
Degree of polynomial = 3
Regression not including a free parameter
Number of observations = 18
Statistics
R^2 0.994862 R^2adj 0.994177 Rmsd 0.0075383 Variance 0.0012274
The regression
without the free parameter has very little effect on R^2 and the variance but now all the 95% confidence intervals are much smaller than the parameter values. This indicates that the 3rd order
polynomial without free parameter representation of the data is more statistically valid and stable.
Example 3: Excessive Number of Terms/parameters in Multiple Linear Regression
Note that this problem is Example 4 - Heat of hardening under the Examples drop-down menu in the Data Table window.

This example illustrates a multiple linear regression with hard_heat as dependent variable and W%1, W%2, W%3, and W%4 as independent variables. The model also includes a free parameter. Some of the results appearing on the "Report" screen are shown below.
Model: hard_heat = a0 + a1*Wpc1 + a2*Wpc2 + a3*Wpc3 + a4*Wpc4
Variable Value 95% confidence a0 60.89893 161.6172 a1 1.562729 1.717796 a2 0.5265028 1.669402 a3 0.1125452 1.740721 a4 -0.1266181 1.635414 General
Number of independent variables = 4
Regression including a free parameter
Number of observations = 13
Statistics
R^2 0.9823245 R^2adj 0.9734867 Rmsd 0.5322918 Variance 5.985442
This regression produces both R^2 and R^2adj values that are close to one, but
the fairly large value of the variance indicates that the data are noisy.
The 95% confidence intervals for all the parameters are larger than the parameter values themselves. There is
also some trend that can be seen in the residual plot (see below).

In this case physical considerations indicate that the data must be modeled with
a multiple linear regression model that does not include a free parameter. Repeating the solution
while marking the "Through origin" option yields the following results:
Model: hard_heat = a1*Wpc1 + a2*Wpc2 + a3*Wpc3 + a4*Wpc4
Variable Value 95% confidence a1 2.189177 0.4182687 a2 1.154136 0.1082325 a3 0.7532949 0.3601112 a4 0.4885452 0.093483 General
Number of independent variables = 4
Regression not including a free parameter
Number of observations = 13
Statistics
R^2 0.9806563 R^2adj 0.9742084 Rmsd 0.5568439 Variance 5.822523
The elimination of the free parameter has very little effect on R^2 and the variance, but now all the 95% confidence intervals are considerably smaller than the parameter values. This indicates that the linear model, without free parameter representation of the data, is statistically valid and stable. It is interesting to compare the parameter values for the cases or with and without free parameter. In can be seen that the differences are very substantial and the derivative of "hard_heat" with respect to W4 (value of parameter a4) may change from positive to negative if the inappropriate model is used. The error distribution when using the statistically valid model is fairly random, as shown in the residual plot below:
