This paper presents an unsupervised feature learning approach based on 3D-skeleton data for human action and human discrete emotion recognition. Relying on the time series of skeleton data analysis to perform such tasks is effective and important to preserve the individual's privacy better. Besides, such methods can represent a viable alternative to emotion recognition applications, in which most works use frontal or profile facial images disclosing the subject's appearance. On the other hand, current unsupervised methods are able to encode the high variety of contexts and nature of the data, but often at the expense of a higher model complexity or longer computational time. To lessen these shortcomings, this paper proposes a convolutional residual autoencoder that models the skeletal geometry across the temporal dynamics of the data without relying on computationally expensive recurrent architectures. Our approach also implements a Graph Laplacian Regularization leveraging upon the implicit skeleton joints connectivity, further improving the robustness of the feature embeddings learned without using action or emotion labels. It was validated on large-scale datasets, having variability in the domain, the input skeleton data (e.g. the number of joints, adjacency matrices), and sensor technology. The results show its effectiveness by notably surpassing the performance of the state-of-the-art unsupervised methods while also achieving better recognition scores compared to the several fully supervised approaches. Extensive experimental analysis proves the usefulness of the proposed method under various evaluation protocols with observed higher-quality feature representations, even if when it is trained with fewer data. The results highlight the proposed method's remarkable transfer-ability across various domains, and its faster inference time.
Graph Laplacian-Improved Convolutional Residual Autoencoder for Unsupervised Human Action and Emotion Recognition
Giancarlo Paoletti;Alessio Del Bue
2022-01-01
Abstract
This paper presents an unsupervised feature learning approach based on 3D-skeleton data for human action and human discrete emotion recognition. Relying on the time series of skeleton data analysis to perform such tasks is effective and important to preserve the individual's privacy better. Besides, such methods can represent a viable alternative to emotion recognition applications, in which most works use frontal or profile facial images disclosing the subject's appearance. On the other hand, current unsupervised methods are able to encode the high variety of contexts and nature of the data, but often at the expense of a higher model complexity or longer computational time. To lessen these shortcomings, this paper proposes a convolutional residual autoencoder that models the skeletal geometry across the temporal dynamics of the data without relying on computationally expensive recurrent architectures. Our approach also implements a Graph Laplacian Regularization leveraging upon the implicit skeleton joints connectivity, further improving the robustness of the feature embeddings learned without using action or emotion labels. It was validated on large-scale datasets, having variability in the domain, the input skeleton data (e.g. the number of joints, adjacency matrices), and sensor technology. The results show its effectiveness by notably surpassing the performance of the state-of-the-art unsupervised methods while also achieving better recognition scores compared to the several fully supervised approaches. Extensive experimental analysis proves the usefulness of the proposed method under various evaluation protocols with observed higher-quality feature representations, even if when it is trained with fewer data. The results highlight the proposed method's remarkable transfer-ability across various domains, and its faster inference time.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.