In the framework of a theory for invariant sensory signal representations, a signature which is invariant and selective for speech sounds can be obtained through projections in template signals and pooling over their transformations under a group. For locally compact groups, e.g., translations, the theory explains the resilience of convolutional neural networks with filter weight sharing and max pooling across their local translations in frequency or time. In this paper we propose a discriminative approach for learning an optimum set of templates, under a family of transformations, namely frequency transpositions and perturbations of the vocal tract length, which are among the primary sources of speech variability. Implicitly, we generalize convolutional networks to transformations other than translations, and derive data-specific templates by training a deep network with convolution-pooling layers and densely connected layers. We demonstrate that such a representation, combining group-generalized convolutions, theoretical invariance guarantees and discriminative template selection, improves frame classification performance over standard translation-CNNs and DNNs on TIMIT and Wall Street Journal datasets.

Discriminative template learning in group-convolutional networks for invariant speech representations

Lorenzo Rosasco;
2015

Abstract

In the framework of a theory for invariant sensory signal representations, a signature which is invariant and selective for speech sounds can be obtained through projections in template signals and pooling over their transformations under a group. For locally compact groups, e.g., translations, the theory explains the resilience of convolutional neural networks with filter weight sharing and max pooling across their local translations in frequency or time. In this paper we propose a discriminative approach for learning an optimum set of templates, under a family of transformations, namely frequency transpositions and perturbations of the vocal tract length, which are among the primary sources of speech variability. Implicitly, we generalize convolutional networks to transformations other than translations, and derive data-specific templates by training a deep network with convolution-pooling layers and densely connected layers. We demonstrate that such a representation, combining group-generalized convolutions, theoretical invariance guarantees and discriminative template selection, improves frame classification performance over standard translation-CNNs and DNNs on TIMIT and Wall Street Journal datasets.
File in questo prodotto:
File Dimensione Formato  
i15_3229.pdf

accesso aperto

Descrizione: Contributo principale
Tipologia: Documento in versione editoriale
Dimensione 2.23 MB
Formato Adobe PDF
2.23 MB Adobe PDF Visualizza/Apri
Zhang_DiscriminativeTemplateLearning_INTERSPEECH15_poster.pdf

accesso chiuso

Descrizione: poster
Tipologia: Altro materiale allegato
Dimensione 1.57 MB
Formato Adobe PDF
1.57 MB Adobe PDF   Visualizza/Apri   Richiedi una copia

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: http://hdl.handle.net/11567/888557
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact