Action recognition, a sub-field of computer vision, has garnered increasing attention in recent years due to its potential applications in addressing a wide range of real-world problems. By analysing individuals’ movements and actions, researchers can better understand their underlying motivations, thoughts, and emotions, which has numerous practical applications, including the development of more effective algorithms that can better understand and respond to human behaviour. Some useful applications of action recognition include security surveillance systems, human-robot and human-computer interaction, patient monitoring and assistive technologies, sign language recognition, consumer behaviour analysis, and sports analysis. Despite recent progress, the development of a fully automated human activity recognition system that can accurately classify activities remains challenging due to the complexity of visual data, such as varying camera viewpoints, occlusions, changes in scale and appearance, background clutter, and lighting changes. A skeleton-based approach offers privacy-preserving characteristics and allows the model to focus on the essential characteristics of the body and its movements rather than being influenced by extraneous factors. This can result in a more accurate understanding of human anatomy and movement. Supervised learning approaches are effective in annotating sequences with corresponding actions or activities. However, this process is time-consuming, requires specialised knowledge, and is prone to human error. The problem is further complicated by intra-class and inter-class similarities, making it difficult to distinguish between different actions. As a result, the reliance on annotated data for sequence annotation may compromise the scalability of big data systems. This motivates the need to explore unsupervised methods as an alternative. Unsupervised learning techniques effectively overcome the challenges faced by traditional supervised methods in this research field. These challenges include a lack of labelled data and the high variability of human actions. Despite this, unsupervised learning for HAR remains an emerging sub-field of research, leading to the exploration of new techniques such as clustering, dimensionality reduction, and deep learning. The main focus of this thesis is unsupervised action recognition using 3D skeleton poses as a specific typology of data, aiming to introduce new algorithms that address the limitations of previous models and provide insight into the usefulness of unsupervised learning. This study presents a subspace clustering algorithm for the classification of trimmed sequences of actions using skeleton joints datasets, introducing new strategies for handling temporal data using covariance matrices. Additionally, a novel unsupervised method using a convolutional autoencoder to learn human action representations is proposed. This approach demonstrates the benefits of combining residual convolutions with spatio-temporal convolutions, resulting in more efficient and memory-effective architectures with the introduction of graph Laplacian regularisation to reconstruct skeleton-based action sequences better. This research also examined the effectiveness of unsupervised methods for human emotion recognition from full-body movement data. However, current unsupervised methods, while designed for high recognition accuracy, do not consider the resilience of the models to perturbed data, which is common in real-world scenarios. Based on these findings, a novel framework was developed, incorporating a transformer encoder-decoder with strong denoising capabilities and additional losses to improve robustness against such data perturbation and alteration.

Unsupervised Human Action Recognition using 3D Skeleton Poses

PAOLETTI, GIANCARLO
2023-03-27

Abstract

Action recognition, a sub-field of computer vision, has garnered increasing attention in recent years due to its potential applications in addressing a wide range of real-world problems. By analysing individuals’ movements and actions, researchers can better understand their underlying motivations, thoughts, and emotions, which has numerous practical applications, including the development of more effective algorithms that can better understand and respond to human behaviour. Some useful applications of action recognition include security surveillance systems, human-robot and human-computer interaction, patient monitoring and assistive technologies, sign language recognition, consumer behaviour analysis, and sports analysis. Despite recent progress, the development of a fully automated human activity recognition system that can accurately classify activities remains challenging due to the complexity of visual data, such as varying camera viewpoints, occlusions, changes in scale and appearance, background clutter, and lighting changes. A skeleton-based approach offers privacy-preserving characteristics and allows the model to focus on the essential characteristics of the body and its movements rather than being influenced by extraneous factors. This can result in a more accurate understanding of human anatomy and movement. Supervised learning approaches are effective in annotating sequences with corresponding actions or activities. However, this process is time-consuming, requires specialised knowledge, and is prone to human error. The problem is further complicated by intra-class and inter-class similarities, making it difficult to distinguish between different actions. As a result, the reliance on annotated data for sequence annotation may compromise the scalability of big data systems. This motivates the need to explore unsupervised methods as an alternative. Unsupervised learning techniques effectively overcome the challenges faced by traditional supervised methods in this research field. These challenges include a lack of labelled data and the high variability of human actions. Despite this, unsupervised learning for HAR remains an emerging sub-field of research, leading to the exploration of new techniques such as clustering, dimensionality reduction, and deep learning. The main focus of this thesis is unsupervised action recognition using 3D skeleton poses as a specific typology of data, aiming to introduce new algorithms that address the limitations of previous models and provide insight into the usefulness of unsupervised learning. This study presents a subspace clustering algorithm for the classification of trimmed sequences of actions using skeleton joints datasets, introducing new strategies for handling temporal data using covariance matrices. Additionally, a novel unsupervised method using a convolutional autoencoder to learn human action representations is proposed. This approach demonstrates the benefits of combining residual convolutions with spatio-temporal convolutions, resulting in more efficient and memory-effective architectures with the introduction of graph Laplacian regularisation to reconstruct skeleton-based action sequences better. This research also examined the effectiveness of unsupervised methods for human emotion recognition from full-body movement data. However, current unsupervised methods, while designed for high recognition accuracy, do not consider the resilience of the models to perturbed data, which is common in real-world scenarios. Based on these findings, a novel framework was developed, incorporating a transformer encoder-decoder with strong denoising capabilities and additional losses to improve robustness against such data perturbation and alteration.
27-mar-2023
File in questo prodotto:
File Dimensione Formato  
phdunige_4777159.pdf

accesso aperto

Tipologia: Tesi di dottorato
Dimensione 33.22 MB
Formato Adobe PDF
33.22 MB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11567/1109462
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact