Clustering of ranking data aims at the identification of groups of subjects with a homogenous, common, preference behavior. Human beings naturally tend to rank objects in the everyday life such as shops, one’s place of living, choice of occupations, singers and football teams, according to their preferences. More generally, ranking data occurs when a number of subjects are asked to rank a list of objects according to their personal preference order. The input in cluster analysis is a dissimilarity matrix quantifying the differences between rankings of two subjects. The choice of the dissimilarity dramatically affects the classification outcome and therefore the computation of an appropriate dissimilarity matrix is an issue. Several distance measures have been proposed for ranking data. We propose generalizations of this kind of distance using copulas adapted to the case of missing data. We consider the case of the extreme list where only the top-k and/or bottom-k ranks are known. We discuss an optimistic and a pessimistic imputation of missing values and show its effect on the classification. Those generalizations provide a more flexible instrument to model different types of data dependence structures and consider different situations in the classification process. Simulated and real data are used to illustrate the performance and the importance of our proposal.

Clustering ranked data using copulas

Nai Ruscone, Marta
2019-01-01

Abstract

Clustering of ranking data aims at the identification of groups of subjects with a homogenous, common, preference behavior. Human beings naturally tend to rank objects in the everyday life such as shops, one’s place of living, choice of occupations, singers and football teams, according to their preferences. More generally, ranking data occurs when a number of subjects are asked to rank a list of objects according to their personal preference order. The input in cluster analysis is a dissimilarity matrix quantifying the differences between rankings of two subjects. The choice of the dissimilarity dramatically affects the classification outcome and therefore the computation of an appropriate dissimilarity matrix is an issue. Several distance measures have been proposed for ranking data. We propose generalizations of this kind of distance using copulas adapted to the case of missing data. We consider the case of the extreme list where only the top-k and/or bottom-k ranks are known. We discuss an optimistic and a pessimistic imputation of missing values and show its effect on the classification. Those generalizations provide a more flexible instrument to model different types of data dependence structures and consider different situations in the classification process. Simulated and real data are used to illustrate the performance and the importance of our proposal.
File in questo prodotto:
File Dimensione Formato  
6478.pdf

accesso chiuso

Tipologia: Documento in versione editoriale
Dimensione 799.29 kB
Formato Adobe PDF
799.29 kB Adobe PDF   Visualizza/Apri   Richiedi una copia

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11567/1013489
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact