Data-intensive applications use empirical methods to extract consistent information from huge samples. When applied to classification tasks, their aim is to optimize accuracy on unseen data hence a reliable prediction of the generalization error is of paramount importance. Theoretical models, such as Statistical Learning Theory, and empirical estimations, such as cross-validation, can both fit data-mining classification domains very well, provided some crucial assumptions are verified in advance. In particular, the stationary distribution of the observed data is critical, although it is sometimes overlooked in practice. The paper formulates an operative criterion to verify the stationary assumption; the method applies to both theoretical and practical predictions of generalization errors. The analysis addresses the specific case of clustering-based classifiers; the K-Winner Machine (KWM) model is used as a reference for its known theoretical bounds; cross-validation provides an empirical counterpart for practical comparison. The criterion, based on efficient unsupervised clustering-based probability distribution estimation, is tested experimentally on a set of different, data-intensive applications, including: intrusion detection for computer-network security, optical character recognition, text mining and pedestrian detection. Experimental results confirm the effectiveness of the proposed approach to efficiently detect non stationarity.

Operative Assessment of Predicted Generalization Errors on Non-Stationary Distributions in Data-Intensive Applications

GASTALDO, PAOLO;ZUNINO, RODOLFO
2011-01-01

Abstract

Data-intensive applications use empirical methods to extract consistent information from huge samples. When applied to classification tasks, their aim is to optimize accuracy on unseen data hence a reliable prediction of the generalization error is of paramount importance. Theoretical models, such as Statistical Learning Theory, and empirical estimations, such as cross-validation, can both fit data-mining classification domains very well, provided some crucial assumptions are verified in advance. In particular, the stationary distribution of the observed data is critical, although it is sometimes overlooked in practice. The paper formulates an operative criterion to verify the stationary assumption; the method applies to both theoretical and practical predictions of generalization errors. The analysis addresses the specific case of clustering-based classifiers; the K-Winner Machine (KWM) model is used as a reference for its known theoretical bounds; cross-validation provides an empirical counterpart for practical comparison. The criterion, based on efficient unsupervised clustering-based probability distribution estimation, is tested experimentally on a set of different, data-intensive applications, including: intrusion detection for computer-network security, optical character recognition, text mining and pedestrian detection. Experimental results confirm the effectiveness of the proposed approach to efficiently detect non stationarity.
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11567/331246
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 0
  • ???jsp.display-item.citation.isi??? 0
social impact