Audio-Visual Learning for Scene Understanding

IRIS

Multimodal deep learning aims at combining the complementary information of different modalities. Among all modalities, audio and video are the predominant ones that humans use to explore the world. In this thesis, we decided to focus our study on audio-visual deep learning to mimic with our networks how humans perceive the world. Our research includes images, audio signals and acoustic images. The latter provide spatial audio information and are obtained from a planar array of microphones combining their raw audios with the beamforming algorithm. They better mimic human auditory systems, which cannot be replicated using just one microphone, not able alone to give spatial sound cues. However, as microphones arrays are not so widespread, we also study how to handle the missing spatialized audio modality at test time. As a solution, we propose to distill acoustic images content to audio features during the training in order to handle their absence at test time. This is done for supervised audio classification using the generalized distillation framework, which we also extend for self-supervised learning. Next, we devise a method for reconstructing acoustic images given a single microphone and an RGB frame. Therefore, in case we just dispose of a standard video, we are able to synthesize spatial audio, which is useful for many audio-visual tasks, including sound localization. Lastly, as another example of restoring one modality from available ones, we inpaint degraded images providing audio features, to reconstruct the missing region not only to be visually plausible but also semantically consistent with the related sound. This includes also cross-modal generation, in the limit case of completely missing or hidden visual modality: our method naturally deals with it, being able to generate images from sound. In summary we show how audio can help visual learning and vice versa, by transferring knowledge between the two modalities at training time, in order to distill, reconstruct, or restore the missing modality at test time.

Audio-Visual Learning for Scene Understanding

SANGUINETI, VALENTINA

2022-02-25

Abstract

Multimodal deep learning aims at combining the complementary information of different modalities. Among all modalities, audio and video are the predominant ones that humans use to explore the world. In this thesis, we decided to focus our study on audio-visual deep learning to mimic with our networks how humans perceive the world. Our research includes images, audio signals and acoustic images. The latter provide spatial audio information and are obtained from a planar array of microphones combining their raw audios with the beamforming algorithm. They better mimic human auditory systems, which cannot be replicated using just one microphone, not able alone to give spatial sound cues. However, as microphones arrays are not so widespread, we also study how to handle the missing spatialized audio modality at test time. As a solution, we propose to distill acoustic images content to audio features during the training in order to handle their absence at test time. This is done for supervised audio classification using the generalized distillation framework, which we also extend for self-supervised learning. Next, we devise a method for reconstructing acoustic images given a single microphone and an RGB frame. Therefore, in case we just dispose of a standard video, we are able to synthesize spatial audio, which is useful for many audio-visual tasks, including sound localization. Lastly, as another example of restoring one modality from available ones, we inpaint degraded images providing audio features, to reconstruct the missing region not only to be visually plausible but also semantically consistent with the related sound. This includes also cross-modal generation, in the limit case of completely missing or hidden visual modality: our method naturally deals with it, being able to generate images from sound. In summary we show how audio can help visual learning and vice versa, by transferring knowledge between the two modalities at training time, in order to distill, reconstruct, or restore the missing modality at test time.

Scheda breve

Scheda completa

Scheda completa (DC)

Data di discussione della tesi

25-feb-2022

Appare nelle tipologie:

Tesi di dottorato

File in questo prodotto:

File	Dimensione	Formato
phdunige_3950432_1.pdf accesso aperto Descrizione: Abstract, Chapter 1 (Introduction), Chapter 2 (Related Works) Tipologia: Tesi di dottorato Dimensione 1.1 MB Formato Adobe PDF Visualizza/Apri	1.1 MB	Adobe PDF	Visualizza/Apri
phdunige_3950432_2.pdf accesso aperto Descrizione: Chapter 3 Tipologia: Tesi di dottorato Dimensione 13.84 MB Formato Adobe PDF Visualizza/Apri	13.84 MB	Adobe PDF	Visualizza/Apri
phdunige_3950432_3.pdf accesso aperto Descrizione: Chapter 4 first part Tipologia: Tesi di dottorato Dimensione 13.38 MB Formato Adobe PDF Visualizza/Apri	13.38 MB	Adobe PDF	Visualizza/Apri
phdunige_3950432_4.pdf accesso aperto Descrizione: Chapter 4 second part Tipologia: Tesi di dottorato Dimensione 16.51 MB Formato Adobe PDF Visualizza/Apri	16.51 MB	Adobe PDF	Visualizza/Apri
phdunige_3950432_5.pdf accesso aperto Descrizione: Chapter 5 Tipologia: Tesi di dottorato Dimensione 22.87 MB Formato Adobe PDF Visualizza/Apri	22.87 MB	Adobe PDF	Visualizza/Apri
phdunige_3950432_6.pdf accesso aperto Descrizione: Chapter 6, Conclusions, References Tipologia: Tesi di dottorato Dimensione 15.66 MB Formato Adobe PDF Visualizza/Apri	15.66 MB	Adobe PDF	Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11567/1068960

Citazioni

ND

ND

ND

social impact