ealing with high dimensionality when learning from data is a tough task since, for example, similarity and correlation in data cannot be properly captured by the conventional notions of distance. Issues are amplified whenever coping with small sample problems, i.e. when the cardinality of the dataset is remarkably smaller than its dimensionality: in these cases, a reliable estimation of the accuracy of the trained model on new data is difficult to derive because of the inefficiency of standard statistical inference approaches in this framework. In this paper, we show that high dimensionality of data, at least under some assumptions, helps improving the assessment of the performance of a model, trained with empirical data in supervised classification tasks. In particular, we propose to create copies of the original dataset, where, however, only subsets of independent and informative features are considered in turn: we show that training and combining a collection of classifiers on these sets help filling the gap between the true and the estimated error of the models. In order to verify the potentiality of the proposed approach and to get more insights on it, we test the method on both an artificial problem and on a series of real-world high dimensional Human Gene Expression datasets.
Out-of-Sample Error Estimation: The Blessing of High Dimensionality
ONETO, LUCA;GHIO, ALESSANDRO;RIDELLA, SANDRO;ANGUITA, DAVIDE
2014-01-01
Abstract
ealing with high dimensionality when learning from data is a tough task since, for example, similarity and correlation in data cannot be properly captured by the conventional notions of distance. Issues are amplified whenever coping with small sample problems, i.e. when the cardinality of the dataset is remarkably smaller than its dimensionality: in these cases, a reliable estimation of the accuracy of the trained model on new data is difficult to derive because of the inefficiency of standard statistical inference approaches in this framework. In this paper, we show that high dimensionality of data, at least under some assumptions, helps improving the assessment of the performance of a model, trained with empirical data in supervised classification tasks. In particular, we propose to create copies of the original dataset, where, however, only subsets of independent and informative features are considered in turn: we show that training and combining a collection of classifiers on these sets help filling the gap between the true and the estimated error of the models. In order to verify the potentiality of the proposed approach and to get more insights on it, we test the method on both an artificial problem and on a series of real-world high dimensional Human Gene Expression datasets.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.