The evaluation of the predictive ability of a model, is an essential moment of all the chemometrical techniques. So it must be performed very carefully. However, in the case of selection of relevant variables (an essential step in the case of data sets with many, frequently thousands, variables) the selection is generally performed using all the available objects. In some recent classification and class modeling techniques, from the original or from the selected variables the Mahalanobis distances of the leverages from the centroids of the categories in the problem are computed, and then added to the original variables. Also here the Mahalanobis distances are computed with all the objects. The consequence is an overestimate of the prediction ability, very large when the ratio between the number of the objects and that of the variables is rather low, so that the variance-covariance matrix is unstable.In this paper the correct validation procedures are described for the cases of selection of variables and of the addition of Mahalanobis distances computed on the original variables or the selected variables. The estimates of the prediction ability are compared with those obtained with insufficient validation strategies.
Complete validation for classification and class modeling procedures with selection of variables and/or with additional computed variables
FORINA, MICHELE;OLIVERI, PAOLO;CASALE, MONICA
2010-01-01
Abstract
The evaluation of the predictive ability of a model, is an essential moment of all the chemometrical techniques. So it must be performed very carefully. However, in the case of selection of relevant variables (an essential step in the case of data sets with many, frequently thousands, variables) the selection is generally performed using all the available objects. In some recent classification and class modeling techniques, from the original or from the selected variables the Mahalanobis distances of the leverages from the centroids of the categories in the problem are computed, and then added to the original variables. Also here the Mahalanobis distances are computed with all the objects. The consequence is an overestimate of the prediction ability, very large when the ratio between the number of the objects and that of the variables is rather low, so that the variance-covariance matrix is unstable.In this paper the correct validation procedures are described for the cases of selection of variables and of the addition of Mahalanobis distances computed on the original variables or the selected variables. The estimates of the prediction ability are compared with those obtained with insufficient validation strategies.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.