Computer vision is the science related to teaching machines to see and understand digital images or videos. During the last decade, computer vision has seen tremendous progress on perception tasks such as object detection, semantic segmentation, and video action recognition, which lead to the development and improvements of important industrial applications such as self-driving cars and medical image analysis. These advances are mainly due to fast computation offered by GPUs, the development of high capacity models such as deep neural networks, and the availability of large datasets, often composed by a variety of modalities. In this thesis, we explore how multimodal data can be used to train deep convolutional neural networks. Humans perceive the world through multiple senses, and reason over the multimodal space of stimuli to act and understand the environment. One way to improve the perception capabilities of deep learning methods is to use different modalities as input, as it offers different and complementary information about the scene. Recent multimodal datasets for computer vision tasks include modalities such as depth maps, infrared, skeleton coordinates, and others, besides the traditional RGB. This thesis investigates deep learning systems that learn from multiple visual modalities. In particular, we are interested in a very practical scenario in which an input modality is missing at test time. The question we address is the following: how can we take advantage of multimodal datasets for training our model, knowing that, at test time, a modality might be missing or too noisy? The case of having access to more information at training time than at test time is referred to as learning using privileged information. In this work, we develop methods to address this challenge, with special focus on the tasks of action and object recognition, and on the modalities of depth, optical flow, and RGB, that we use for inference at test time. This thesis advances the art of multimodal learning in three different ways. First, we develop a deep learning method for video classification that is trained on RGB and depth data, and is able to hallucinate depth features and predictions at test time. Second, we build on this method and propose a more generic mechanism based on adversarial learning to learn to mimic the predictions originated by the depth modality, and is able to automatically switch from true depth features to generated depth features in case of a noisy sensor. Third, we develop a method that learns a single network trained on RGB data, that is enriched with additional supervision information from other modalities such as depth and optical flow at training time, and that outperforms an ensemble of networks trained independently on these modalities.

Learning with Privileged Information using Multimodal Data

DA CRUZ GARCIA, NUNO RICARDO
2020-02-28

Abstract

Computer vision is the science related to teaching machines to see and understand digital images or videos. During the last decade, computer vision has seen tremendous progress on perception tasks such as object detection, semantic segmentation, and video action recognition, which lead to the development and improvements of important industrial applications such as self-driving cars and medical image analysis. These advances are mainly due to fast computation offered by GPUs, the development of high capacity models such as deep neural networks, and the availability of large datasets, often composed by a variety of modalities. In this thesis, we explore how multimodal data can be used to train deep convolutional neural networks. Humans perceive the world through multiple senses, and reason over the multimodal space of stimuli to act and understand the environment. One way to improve the perception capabilities of deep learning methods is to use different modalities as input, as it offers different and complementary information about the scene. Recent multimodal datasets for computer vision tasks include modalities such as depth maps, infrared, skeleton coordinates, and others, besides the traditional RGB. This thesis investigates deep learning systems that learn from multiple visual modalities. In particular, we are interested in a very practical scenario in which an input modality is missing at test time. The question we address is the following: how can we take advantage of multimodal datasets for training our model, knowing that, at test time, a modality might be missing or too noisy? The case of having access to more information at training time than at test time is referred to as learning using privileged information. In this work, we develop methods to address this challenge, with special focus on the tasks of action and object recognition, and on the modalities of depth, optical flow, and RGB, that we use for inference at test time. This thesis advances the art of multimodal learning in three different ways. First, we develop a deep learning method for video classification that is trained on RGB and depth data, and is able to hallucinate depth features and predictions at test time. Second, we build on this method and propose a more generic mechanism based on adversarial learning to learn to mimic the predictions originated by the depth modality, and is able to automatically switch from true depth features to generated depth features in case of a noisy sensor. Third, we develop a method that learns a single network trained on RGB data, that is enriched with additional supervision information from other modalities such as depth and optical flow at training time, and that outperforms an ensemble of networks trained independently on these modalities.
28-feb-2020
File in questo prodotto:
File Dimensione Formato  
phdunige_4317854.pdf

accesso aperto

Tipologia: Tesi di dottorato
Dimensione 6.64 MB
Formato Adobe PDF
6.64 MB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11567/997636
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact