Variable selection in multiple linear regression : The influence of individual cases

The influence of individual cases in a data set is studied when variable selection is applied in multiple linear regression. Two different influence measures, based on the Cp criterion and Akaike’s information criterion, are introduced. The relative change in the selection criterion when an individual case is omitted is proposed as the selection influence of the specific omitted case. Four standard examples from the literature are considered and the selection influence of the cases is calculated. It is argued that the selection procedure may be improved by taking the selection influence of individual data cases into account.


Introduction
The literature on statistical variable selection, frequently one of the first steps in a multiple linear regression analysis, is considerable.Recent references are Burnham & Anderson (2002) and Murtaugh (1998).The purpose of regression variable selection is to reduce the predictors to some "optimal" subset of the available regressors.This may be important because (i) a smaller set of predictors may provide more accurate predictions of future cases, or, (ii) it may be important to identify those predictor variables significantly influencing the response (for example, in a clinical trial).There are many techniques which are routinely used for variable selection in multiple linear regression: stepwise routines such as forward selection (Miller, 2002), Bayesian techniques (an extensive list of references is given in Burnham & Anderson, 1998, p.127), cross-validation selection techniques (Liu et al., 1999), or an all possible subsets approach based, for example, on Mallows' C p criterion (Mallows, 1973(Mallows, , 1995)), or Breiman's little bootstrap (Breiman, 1992;Venter & Snyman, 1997).In this paper attention is restricted to selection using an all possible subsets approach based on either the C p criterion or Akaike's information criterion (Akaike, 1973).
The literature on measures of the influence of individual cases in a data set on a multiple linear regression fit is equally impressive.Some contributions include Cook (1977Cook ( , 1986)), and Belsley et al. (1980).These and other contributions on influence measures assume that the predictor variables are identified beforehand.If an initial selection step takes place, the influence measures are therefore conditional, i.e. the specific predictors in the model are given.
Only a few papers dealing with the influence of individual data cases in regression explicitly take an initial variable selection step into account.In this context, Léger & Altman (1993) distinguish between conditional and unconditional selection versions of Cook's distance.To explain the difference between the two versions, let V be the set of indices corresponding to the predictor variables selected from the full data set, and let y (V ) be the prediction vector based on the selected variables and calculated from the full data set.Also, let y (−i) (V ) be the prediction vector based on the variables corresponding to V , but calculated from the full data set without case i.Note that y (−i) (V ) contains a prediction for case i, although this case is not used in calculating y (−i) (V ).The conditional Cook's distance for the i-th case is appropriately scaled.Here, • denotes the Euclidean norm.To obtain the unconditional Cook's distance, Léger & Altman (1993) argue that it is necessary to repeat the variable selection using the data without case i.This selection yields a subset V (−i) of indices, with V (−i) possibly different from V .The unconditional distance is appropriately scaled.Léger & Altman (1993) discuss the differences between these two selection versions of Cook's distance and argue that the unconditional version is preferable, since it explicitly takes the selection effect into account.
In this paper we introduce two new measures of the selection influence of an individual data case in multiple linear regression analysis.Our point of view is that such a measure should provide an indication of whether the fit of the selected model improves or deteriorates owing to the presence of the case.This is the main contribution of the paper.Section 2 is devoted to a brief exposition of Mallows' C p statistic and Akaike's information criterion (AIC ).The new measures of selection influence, based on C p and AIC respectively, are introduced in Section 3. It is indicated how variable selection based on C p or on AIC may be modified, and hopefully improved, by making use of the influence measures.Four practical examples are discussed in Section 4. We close with some recommendations in Section 5.
2 Variable selection using C p or AIC Let y 1 , . . .y n be observations of the response in a multiple linear regression, with corresponding observations x ij , i = 1, . . ., n; j = 1, . . ., p of p explanatory variables.We make the customary assumption that for i = 1, . . ., n.In (1), ε 1 , . . ., ε n are independent normal 0; σ 2 random variables, and β 0 , . . ., β n and σ 2 are unknown parameters.We assume that {x i1 , . . ., x ip } may contain redundant variables, so that it is advisable to perform variable selection as a first step in the analysis.Let RSS be the residual sum of squares from the least squares fit of (1).Then σ2 = RSS/(n − p − 1) is the commonly used unbiased estimator of σ 2 .Consider a subset V of {1, . . ., p}, and let RSS(V ) be the residual sum of squares from the least squares fit using only the regressors corresponding to the indices in V , together with an intercept.The C p statistic for the corresponding model is where v is the number of indices in V .Variable selection based on (2) entails calculating C p (V ) for each subset of {1, . . ., p}, and selecting the variables corresponding to V , the subset minimizing (2).This approach is based on the fact that for a given V , σ2 C p (V ) is an estimate of the expected squared error if a (multiple) linear regression function based on the variables corresponding to V is used to predict Y * , a new (future) observation of the response random vector, Y .Therefore, choosing V to minimize (2) is equivalent to selecting the variables which minimize the estimated expected prediction error.
The AIC is based on the maximized log-likelihood function of the model under consideration.Given independent normal errors, and ignoring constant terms, the maximized log-likelihood for the model corresponding to a subset V is given by −n log [RSS(V )/n] /2.This is a non-decreasing function of the number of selected regressors.Akaike (1973) therefore included a penalty term, viz.v + 2, which equals the number of parameters which have to be estimated.Multiplying the resulting expression by −2 yields For an up-to-date discussion and further motivation of (3), see Burnham and Anderson (2002).It is known that AIC(V ) does not perform well when the number of parameters to be estimated is large compared to the sample size (typically cases where 40 In such cases a modified version of (3) should be used, viz.
Variable selection based on (3) or (4) entails calculating the criterion for each subset V of {1, . . ., p}, and selecting the variables corresponding to the minimizing subset.This is equivalent to selecting the variables which maximize a penalized version of the maximum log-likelihood.

New measures of selection influence
It is standard statistical practice in multiple linear regression to study the influence of a single data case in an analysis as follows: analyse the complete data set and calculate a (summary) measure, say M ; repeat the analysis after omitting the case under consideration and calculate M (−i) ; quantify the influence of case i in terms of a function, f (•), of M and M (−i) .An example is provided by Cook's statistic, where M is the vector of predictions of the response, and f (x) = x 2 /RSS.
The measures of selection influence which we propose are also based on a leave-one-out strategy.The following questions need to be resolved: (i) What is meant by an analysis of the complete or reduced data set?(ii) What measure M should be used?(iii) How should we define f (•)?Regarding the first question, in a variable selection context an analysis of a data set entails applying a given variable selection technique, and fitting the model corresponding to V to the data.Consequently, if we wish to study the influence of a single data case in such an analysis, it is necessary to apply the selection technique under consideration to the full data set and again to the reduced data set.This is in line with the unconditional approach recommended by Léger and Altman (1993).Turning to the second question, different choices of M can be made, depending on the aspect of the fitted model which is of interest.In variable selection the number of selected variables and the lack of fit of the corresponding model are typically of interest.These quantities are combined in selection criteria such as C p and the AIC.It therefore seems reasonable to take M equal to the criterion employed in the selection method.This implies that f (•) has to be based on the difference in the value of the selection criterion before and after omitting case i.This difference, M − M (−i) , may then be divided by M in order to calculate the relative change in the selection criterion.The proposed selection influence measure for the i-th case is therefore given by where M denotes the selection criterion under consideration.Note that (5) may be calculated for all selection criteria where the particular criterion is a combination of some sort of goodness-of-fit measure and a penalty function (such a penalty function usually includes the number of predictors of the particular selected model as one of its components, see Kundu and Murali (1996)).
The proposed influence measure for the i-th case when the C p criterion is used becomes where C p V (−i) is calculated as in (2), but with the i-th case omitted.Note that in calculating C p V (−i) the estimator for the error variance is obtained from the full data set.The use of this error variance estimator is supported by considerations given by Léger and Altman (1993) for using σ2 in the denominator of the unconditional Cook's distance.
The proposed influence measure based on the C p criterion in ( 6) is large if the relative difference between C p (V ) and C p V (−i) is large.If this is true for an omitted data case i, the particular case is considered possibly selection influential.Note that negative values of C p (V ) may occur.These negative values may cause misrepresentation of the relative difference between C p (V ) and C p V (−i) , i.e. the relative differences for certain data cases may now be incorrectly larger than others if, for example, C p (V ) − C p V (−i) in the numerator, and C p (V ) in the denominator of (6) are negative.We overcome this difficulty by omitting the subtraction of n and (n − 1) in the calculation of C p (V ) and C p V (−i) respectively.Thus, for instances where V and V (−i) are similar, or where V and V (−i) have the same number of indices, the sign of the numerator in (6) depends on the sizes of RSS(V ) and RSS(V (−i) ) respectively.Using Akaike's information criterion (AIC ) for variable selection, the corresponding influence measure for the i-th case in ( 5) is The value of AIC V (−i) in ( 7) is obtained by using either (3) or ( 4), but with the i-th case omitted.

Illustrative examples
The proposed influence measures in ( 6) and ( 7) are applied to four example data sets in this section.

The fuel data
Consider the fuel data given by Weisberg (1985, pp.35, 36 and 126).There are 50 cases (one case for each of the 50 states in the USA).The response is the 1972 fuel consumption in gallons per person.The four predictor variables are: x 1 : amount of tax on a gallon of fuel in cents; x 2 : percentage of the population with a driver's license; x 3 : average income in thousands of dollars; and x 4 : total length of roads in thousands of miles.
Applying C p selection to the full data set resulted in variables x 2 and x 3 being selected.
The proposed influence measure in ( 6) is calculated for each data case i.It is clear from Table 1 that the proposed influence measure and Cook's unconditional distance are relatively large when case 40, 49 or 50 is omitted.Both reach a maximum when data case 50 is omitted.We observe a sharp reduction in the estimated average prediction error (from 9 740 to 6 053) for the reduced data set where case 50 is omitted.The estimated average prediction errors for the reduced data sets without case 40 or 49 are also considerably smaller than that of the full data set, but not as low as that of the reduced data set with case 50 omitted.It would therefore seem to be advisable to omit data case 50 before performing variable selection and model fitting on the fuel data.Note that variables x 1 , x 2 and x 3 are selected if case 50 is omitted.
There is a strong correspondence between the values of the proposed influence measure and the estimated average prediction errors.This is reflected in the correlation coefficient of −0.9716 between these two sets of numbers.The correlation between Cook's unconditional selection distance and the estimated average prediction error is −0.9657.Finally, the correlation between the proposed influence measure and Cook's unconditional distance is 0.9276, confirming a strong positive relationship between these two measures for this example.
Very similar results are obtained in Table 2 when the influence measure based on AIC in ( 7) is applied to the fuel data.We report on the same reduced data sets as in Table 1.Variable selection on the full data set by means of AIC also resulted in variables x 2 and x 3 being selected.With case 50 omitted, variables x 1 , x 2 and x 3 are selected.
Clearly, case 50 is once again selection influential.For all the other reduced sets (also those not shown) AIC selection results in the same variables selected as in the full data set.The proposed influence measure also identifies case 50 as the most influential.Cook's unconditional distance (also suggesting case 50 as most influential) remains unchanged from Table 1, except for the reduced data sets where cases 40 and 49 are respectively omitted.The estimated average prediction error for the full data set equals 9 613.The following correlations are also reported: between the proposed influence measure and the estimated average prediction error: −0.9809; between Cook's unconditional distance and the estimated average prediction error: −0.9199; between the proposed influence measure and Cook's unconditional distance: 0.8634.The proposed influence measures in ( 6) and (7) show that the maximum relative difference between M and M (−i) is obtained if case 50 is omitted.Cook's unconditional selection distance confirms the implication that case 50 is selection influential.The low values obtained for the estimated average prediction errors when case 50 is omitted supply strong evidence that omitting this case before performing variable selection improves the predictive power of the resulting model.
The following question arises: How should one judge the significance of a value of the proposed influence measure?In other words, should one recommend to a practitioner analyzing this data set to omit case 50 before performing variable selection?We attempt to answer this question by comparing the influence measures of data case 50 (0.2658 in (6) and 0.0575 in (7)) with the largest influence measures obtained in 10 000 residual bootstrap samples (Efron and Tibshirani, 1993).The bootstrap samples are obtained in the following way: Determine the vector of residuals of the form r = Y − Ŷ , once a linear regression model has been fitted to the complete data set.Random selection of 50 of these residuals with replacement yields a bootstrap vector of residuals, denoted by r b .If we calculate r b + Ŷ a new bootstrap response vector, denoted by Y b , is obtained.This newly formed bootstrap vector and the original set of unchanged predictor variable values constitute the bootstrap sample.Consider now 10 000 of these bootstrap samples, each of size 50 for the fuel data.The proposed influence measures in ( 6) and ( 7) are calculated for every single omitted data case in each of the 10 000 bootstrap samples.By keeping record of the largest values of ( 6) and ( 7) in every bootstrap sample, we are provided with two sets of values with which we may compare the largest influence measures (i.e. if data case 50 is omitted) of the fuel data set.The distributions of these 10 000 largest bootstrap influence measures, based on (6) and on (7) respectively, are shown in Figure 1 and 2.
The vertical line drawn in the class interval (0.26;0.29] on the histogram in Figure 1 shows the position of 0.2658 in the distribution.The proportion of bootstrap influence measures that are smaller than 0.2658, equals 0.8824.This implies that the value 0.2658 lies close to the 90th percentile of the bootstrap distribution.Similarly, the largest influence measure (0.0575) lies above the 83rd percentile in Figure 2. Providing this information to any practitioner who analyses the fuel data, will surely be helpful in the decision as to whether case 50 should be omitted before subsequent analysis is performed.
It is important to bear in mind that the proposed influence measures only identify individual selection influential data cases.If it is, for example, decided to reject case 50 from the fuel data set, the influence measures should be recalculated on the n − 1 remaining observations to identify other possibly selection influential data cases.

The evaporation data
We also apply the proposed influence measures to the evaporation data given by Freund (1979).Ten independent predictor variables were measured on 46 consecutive days, in accordance with the amount of evaporation from the soil, which represents the response.
The ten predictor variables are: x 1 : maximum daily soil temperature; x 2 : minimum daily soil temperature; x 3 : integrated area under the soil temperature curve; x 4 : maximum daily air temperature; x 5 : minimum daily air temperature; x 6 : integrated area under the daily temperature curve; x 7 : maximum daily relative humidity; x 8 : minimum daily relative humidity; x 9 : integrated area under the daily humidity curve; and x 10 : total wind measured in miles per day.
Variable selection with C p on the full data set yields variables x 1 , x 3 , x 6 , x 8 and x 9 as the selected variables.Table 3 shows the following for some of the reduced data sets: the selected variables; values of the proposed influence measure in (6); Cook's unconditional distances; and the estimated expected squared prediction errors.Random selection of 37 x 1 : distance covered in miles; and x 2 : elevation climbed during the race.
According to Atkinson (1986) the data contain a known error: observation 18 should be 18 minutes rather than 78 minutes.Here we specifically use the data set containing the error in order to evaluate the performance of the proposed influence measures in ( 6) and (7).Throughout, the results obtained when either C p or AIC are applied to the data are very similar.Both variables are selected if the two selection techniques are applied to the full data set.The one-at-a-time omission of case 7 or 11 leads to variable x 1 being selected.For all the other reduced data sets the same variables as in the full data set are selected.
The largest influence measures in ( 6) and ( 7) are obtained when data case 18 is omitted, followed by cases 11 and 7.The influence measures for these cases (especially case 18) are significantly larger than when other cases are omitted.Cook's unconditional distance reaches a maximum when case 7 is omitted, with also relatively large values when cases 11 and 18 are omitted.There is a sharp reduction in the estimated average prediction error when case 18 is omitted.The estimated average prediction error for the reduced data sets without cases 11 and 7 are also significantly smaller than that of the full data set.For the full data set 28 observations are used in the training data sets and 7 observations in the test data sets.
Finally, consider the stack loss data which also have been examined by several authors (Brownlee, 1965;Atkinson, 1985;Hoeting et al., 1996).The data investigate the relationship between the percentage of unconverted ammonia that escapes from a plant in 21 days, and the following three explanatory variables: x 1 : air flow that measures the rate of operation of the plant; x 2 : inlet temperature of cooling water circulating through coils in the tower; and x 3 : value proportional to the concentration of acid in the tower.Again, as with the hill racing data, results obtained when either C p or AIC are applied, coinside throughout.Variable selection with both techniques on the full and all reduced data sets consistently result in variables x 1 and x 2 as being selected.The influence measures show a relative maximum when case 21 is omitted, followed by cases 4, 3 and 1.The corresponding unconditional Cook's distances of these cases are also relatively larger than those of the other cases.In the same way the estimated average prediction errors if these cases are omitted are significantly smaller than the estimated average prediction error calculated on the full data set.For the full data set 16 observations are used in the training data sets and 5 observations in the test data sets.

Recommendations
Quantifying the influence of an individual data case in a selection context is important.The proposed influence measures may easily be calculated for any multiple regression sample that has to be analyzed.Once the influence measures have been obtained, a decision has to be taken as to whether to omit the data case with the largest influence measure, before selection is repeated on the reduced data set.The magnitude by which the estimated average prediction error decreases, if calculated for the complete and reduced data set, provides us with a good indication of whether the data case should be omitted.The illustrated bootstrap approach may also be utilized to judge the significance of the largest proposed influence measure.We strongly recommend that both these aspects (i.e., the estimated average prediction error and the bootstrap distribution) should be taken into consideration before the data case with the largest influence measure is merely excluded from the regression sample.
The proposed influence measure is based on the C p and AIC selection criteria.It may easily be extended to other selection criteria where the particular criterion is a combination of some sort of goodness-of-fit measure and a penalty function.
Since omission of individual cases does not address the problems of masking and swamping, two or more cases may be omitted at a time.This approach, however, causes difficulties with respect to computing time.

Figure 1 :
Figure 1: Histogram of largest C p influence measures in 10 000 bootstrap samples.

Figure 3 :
Figure 3: Histogram of largest C p influence measures in 10 000 bootstrap samples.

Figure 4 :
Figure 4: Histogram of largest AIC influence measures obtained in 10 000 bootstrap samples.

Table 1 :
The resulting values, together with the selected variables, Cook's unconditional selection distances, and estimated expected squared prediction errors are shown in Table1for some of the omitted data cases.The estimated average prediction errors for the reduced data sets were obtained by randomly selecting 39 of the 49 cases as a training data set and the other 10 cases as a test data set.Applying the C p criterion to the training data set, the selected model was used to calculate an average squared prediction error for the 10 cases of the test data set.Random selection of a training data set and calculation of the average squared prediction error for the test data set, based on the selected model from the training data set, was repeated 20 000 times.Record was kept of the 20 000 average prediction errors.Their average was calculated in order to obtain an estimate of the expected squared prediction error.For the full data set (using 40 observations in the training data sets and 10 observations in the test data sets) this estimate equals 9 740.Fuel data: Variable selection with Mallows' C p criterion.

Table 2 :
Fuel data: Variable selection with AIC.