Enhancing visual embeddings through weakly supervised captioning for zero-shot learning

IRIS

Visual features designed for image classification have shown to be useful in zero-shot learning (ZSL) when generalizing towards classes not seen during training. In this paper, we argue that a more effective way of building visual features for ZSL is to extract them through captioning, in order not just to classify an image but, instead, to describe it. However, modern captioning models rely on a massive level of supervision, e.g up to 15 extended descriptions per instance provided by humans, which is simply not available for ZSL benchmarks. In the latter in fact, the available annotations inform about the presence/absence of attributes within a fixed list only. Worse, attributes are seldom annotated at the image level, but rather, at the class level only: because of this, the annotation cannot be visually grounded. In this paper, we deal with such a weakly supervised regime to train an end-to-end LSTM captioner, whose backbone CNN image encoder can provide better features for ZSL. Our enhancement of visual features, called 'VisEn', is compatible with any generic ZSL method, without requiring changes in its pipeline (a part from adapting hyper-parameters). Experimentally, VisEn is capable of sharply improving recognition performance on unseen classes, as we demonstrate thorough an ablation study which encompasses different ZSL approaches. Further, on the challenging fine-grained CUB dataset, VisEn improves by margin state-of-the-art methods, by using visual descriptors of one order of magnitude smaller.

Enhancing visual embeddings through weakly supervised captioning for zero-shot learning

Bustreo M.;Cavazza J.;Murino V.

2019-01-01

Abstract

Visual features designed for image classification have shown to be useful in zero-shot learning (ZSL) when generalizing towards classes not seen during training. In this paper, we argue that a more effective way of building visual features for ZSL is to extract them through captioning, in order not just to classify an image but, instead, to describe it. However, modern captioning models rely on a massive level of supervision, e.g up to 15 extended descriptions per instance provided by humans, which is simply not available for ZSL benchmarks. In the latter in fact, the available annotations inform about the presence/absence of attributes within a fixed list only. Worse, attributes are seldom annotated at the image level, but rather, at the class level only: because of this, the annotation cannot be visually grounded. In this paper, we deal with such a weakly supervised regime to train an end-to-end LSTM captioner, whose backbone CNN image encoder can provide better features for ZSL. Our enhancement of visual features, called 'VisEn', is compatible with any generic ZSL method, without requiring changes in its pipeline (a part from adapting hyper-parameters). Experimentally, VisEn is capable of sharply improving recognition performance on unseen classes, as we demonstrate thorough an ablation study which encompasses different ZSL approaches. Further, on the challenging fine-grained CUB dataset, VisEn improves by margin state-of-the-art methods, by using visual descriptors of one order of magnitude smaller.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno
	
				2019
			
	ISBN
	
				978-1-7281-5023-9
			
	Appare nelle tipologie:
	
				04.01 - Contributo in atti di convegno

File in questo prodotto:

File	Dimensione	Formato
Bustreo_Enhancing_Visual_Embeddings_through_Weakly_Supervised_Captioning_for_Zero-Shot_Learning_ICCVW_2019_paper.pdf accesso aperto Tipologia: Documento in Post-print Dimensione 754.75 kB Formato Adobe PDF Visualizza/Apri	754.75 kB	Adobe PDF	Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11567/1017177

Citazioni

ND

4

2

social impact