Visual features designed for image classification have shown to be useful in zero-shot learning (ZSL) when generalizing towards classes not seen during training. In this paper, we argue that a more effective way of building visual features for ZSL is to extract them through captioning, in order not just to classify an image but, instead, to describe it. However, modern captioning models rely on a massive level of supervision, e.g up to 15 extended descriptions per instance provided by humans, which is simply not available for ZSL benchmarks. In the latter in fact, the available annotations inform about the presence/absence of attributes within a fixed list only. Worse, attributes are seldom annotated at the image level, but rather, at the class level only: because of this, the annotation cannot be visually grounded. In this paper, we deal with such a weakly supervised regime to train an end-to-end LSTM captioner, whose backbone CNN image encoder can provide better features for ZSL. Our enhancement of visual features, called 'VisEn', is compatible with any generic ZSL method, without requiring changes in its pipeline (a part from adapting hyper-parameters). Experimentally, VisEn is capable of sharply improving recognition performance on unseen classes, as we demonstrate thorough an ablation study which encompasses different ZSL approaches. Further, on the challenging fine-grained CUB dataset, VisEn improves by margin state-of-the-art methods, by using visual descriptors of one order of magnitude smaller.

Enhancing visual embeddings through weakly supervised captioning for zero-shot learning

Bustreo M.;Cavazza J.;Murino V.
2019-01-01

Abstract

Visual features designed for image classification have shown to be useful in zero-shot learning (ZSL) when generalizing towards classes not seen during training. In this paper, we argue that a more effective way of building visual features for ZSL is to extract them through captioning, in order not just to classify an image but, instead, to describe it. However, modern captioning models rely on a massive level of supervision, e.g up to 15 extended descriptions per instance provided by humans, which is simply not available for ZSL benchmarks. In the latter in fact, the available annotations inform about the presence/absence of attributes within a fixed list only. Worse, attributes are seldom annotated at the image level, but rather, at the class level only: because of this, the annotation cannot be visually grounded. In this paper, we deal with such a weakly supervised regime to train an end-to-end LSTM captioner, whose backbone CNN image encoder can provide better features for ZSL. Our enhancement of visual features, called 'VisEn', is compatible with any generic ZSL method, without requiring changes in its pipeline (a part from adapting hyper-parameters). Experimentally, VisEn is capable of sharply improving recognition performance on unseen classes, as we demonstrate thorough an ablation study which encompasses different ZSL approaches. Further, on the challenging fine-grained CUB dataset, VisEn improves by margin state-of-the-art methods, by using visual descriptors of one order of magnitude smaller.
2019
978-1-7281-5023-9
File in questo prodotto:
File Dimensione Formato  
Bustreo_Enhancing_Visual_Embeddings_through_Weakly_Supervised_Captioning_for_Zero-Shot_Learning_ICCVW_2019_paper.pdf

accesso aperto

Tipologia: Documento in Post-print
Dimensione 754.75 kB
Formato Adobe PDF
754.75 kB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11567/1017177
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 3
  • ???jsp.display-item.citation.isi??? 1
social impact