Discriminative template learning in group-convolutional networks for invariant speech representations

IRIS

In the framework of a theory for invariant sensory signal representations, a signature which is invariant and selective for speech sounds can be obtained through projections in template signals and pooling over their transformations under a group. For locally compact groups, e.g., translations, the theory explains the resilience of convolutional neural networks with filter weight sharing and max pooling across their local translations in frequency or time. In this paper we propose a discriminative approach for learning an optimum set of templates, under a family of transformations, namely frequency transpositions and perturbations of the vocal tract length, which are among the primary sources of speech variability. Implicitly, we generalize convolutional networks to transformations other than translations, and derive data-specific templates by training a deep network with convolution-pooling layers and densely connected layers. We demonstrate that such a representation, combining group-generalized convolutions, theoretical invariance guarantees and discriminative template selection, improves frame classification performance over standard translation-CNNs and DNNs on TIMIT and Wall Street Journal datasets.

Discriminative template learning in group-convolutional networks for invariant speech representations

Chiyuan Zhang;Stephen Voinea;Georgios Evangelopoulos;Lorenzo Rosasco;Tomaso Poggio

2015-01-01

Abstract

In the framework of a theory for invariant sensory signal representations, a signature which is invariant and selective for speech sounds can be obtained through projections in template signals and pooling over their transformations under a group. For locally compact groups, e.g., translations, the theory explains the resilience of convolutional neural networks with filter weight sharing and max pooling across their local translations in frequency or time. In this paper we propose a discriminative approach for learning an optimum set of templates, under a family of transformations, namely frequency transpositions and perturbations of the vocal tract length, which are among the primary sources of speech variability. Implicitly, we generalize convolutional networks to transformations other than translations, and derive data-specific templates by training a deep network with convolution-pooling layers and densely connected layers. We demonstrate that such a representation, combining group-generalized convolutions, theoretical invariance guarantees and discriminative template selection, improves frame classification performance over standard translation-CNNs and DNNs on TIMIT and Wall Street Journal datasets.

Scheda breve

Scheda completa

Scheda completa (DC)

Anno

2015

Appare nelle tipologie:

04.01 - Contributo in atti di convegno

File in questo prodotto:

File	Dimensione	Formato
i15_3229.pdf accesso aperto Descrizione: Contributo principale Tipologia: Documento in versione editoriale Dimensione 2.23 MB Formato Adobe PDF Visualizza/Apri	2.23 MB	Adobe PDF	Visualizza/Apri
Zhang_DiscriminativeTemplateLearning_INTERSPEECH15_poster.pdf accesso chiuso Descrizione: poster Tipologia: Altro materiale allegato Dimensione 1.57 MB Formato Adobe PDF Visualizza/Apri Richiedi una copia	1.57 MB	Adobe PDF	Visualizza/Apri Richiedi una copia

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11567/888557

Citazioni

ND

ND

ND

social impact