Clustering of ranking data aims at the identification of groups of subjects with a homogenous, common, preference behavior. Human beings naturally tend to rank objects in the everyday life such as shops, one’s place of living, choice of occupations, singers and football teams, according to their preferences. More generally, ranking data occurs when a number of subjects are asked to rank a list of objects according to their personal preference order. The input in cluster analysis is a dissimilarity matrix quantifying the differences between rankings of two subjects. The choice of the dissimilarity dramatically affects the classification outcome and therefore the computation of an appropriate dissimilarity matrix is an issue. Several distance measures have been proposed for ranking data. We propose generalizations of this kind of distance using copulas adapted to the case of missing data. We consider the case of the extreme list where only the top-k and/or bottom-k ranks are known. We discuss an optimistic and a pessimistic imputation of missing values and show its effect on the classification. Those generalizations provide a more flexible instrument to model different types of data dependence structures and consider different situations in the classification process. Simulated and real data are used to illustrate the performance and the importance of our proposal.
Clustering ranked data using copulas
Nai Ruscone, Marta
2019-01-01
Abstract
Clustering of ranking data aims at the identification of groups of subjects with a homogenous, common, preference behavior. Human beings naturally tend to rank objects in the everyday life such as shops, one’s place of living, choice of occupations, singers and football teams, according to their preferences. More generally, ranking data occurs when a number of subjects are asked to rank a list of objects according to their personal preference order. The input in cluster analysis is a dissimilarity matrix quantifying the differences between rankings of two subjects. The choice of the dissimilarity dramatically affects the classification outcome and therefore the computation of an appropriate dissimilarity matrix is an issue. Several distance measures have been proposed for ranking data. We propose generalizations of this kind of distance using copulas adapted to the case of missing data. We consider the case of the extreme list where only the top-k and/or bottom-k ranks are known. We discuss an optimistic and a pessimistic imputation of missing values and show its effect on the classification. Those generalizations provide a more flexible instrument to model different types of data dependence structures and consider different situations in the classification process. Simulated and real data are used to illustrate the performance and the importance of our proposal.File | Dimensione | Formato | |
---|---|---|---|
6478.pdf
accesso chiuso
Tipologia:
Documento in versione editoriale
Dimensione
799.29 kB
Formato
Adobe PDF
|
799.29 kB | Adobe PDF | Visualizza/Apri Richiedi una copia |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.