Scene classification of remote sensing images is a challenging task due to the complexity and variety of natural scenes. In recent years, Convolutional Neural Networks (CNNs) have achieved impressive performances in many remote sensing scene classification benchmarks. However, in CNNs the long-range visual dependencies are often neglected due to the local filter design, leading to suboptimal performances in cluttered scenes such as urban areas. Recently proposed Transformer architecture resolved this issue by taking a broader neighborhood into account through the multi-head self-attention component. In this paper, we propose a novel method which borrows ideas from 'knowledge distillation' and applied to recent vision Transformers. Specifically, we propose a compound loss computed on a Transformer-based student and a CNN teacher in a joint fashion and utilize it for the task of single-label scene classification. Because of the student's capability in capturing long-range visual dependencies, along with the inductive bias inherited from the teacher, our proposed model improves the classification accuracy on four well-known datasets compared to state-of-the-art approaches.

A CNN-Transformer Knowledge Distillation for Remote Sensing Scene Classification

Maggiolo L.;Moser G.;Serpico S. B.
2022-01-01

Abstract

Scene classification of remote sensing images is a challenging task due to the complexity and variety of natural scenes. In recent years, Convolutional Neural Networks (CNNs) have achieved impressive performances in many remote sensing scene classification benchmarks. However, in CNNs the long-range visual dependencies are often neglected due to the local filter design, leading to suboptimal performances in cluttered scenes such as urban areas. Recently proposed Transformer architecture resolved this issue by taking a broader neighborhood into account through the multi-head self-attention component. In this paper, we propose a novel method which borrows ideas from 'knowledge distillation' and applied to recent vision Transformers. Specifically, we propose a compound loss computed on a Transformer-based student and a CNN teacher in a joint fashion and utilize it for the task of single-label scene classification. Because of the student's capability in capturing long-range visual dependencies, along with the inductive bias inherited from the teacher, our proposed model improves the classification accuracy on four well-known datasets compared to state-of-the-art approaches.
File in questo prodotto:
File Dimensione Formato  
22.igarss.mostaan.pdf

accesso chiuso

Descrizione: Contributo in atti di convegno
Tipologia: Documento in versione editoriale
Dimensione 1.05 MB
Formato Adobe PDF
1.05 MB Adobe PDF   Visualizza/Apri   Richiedi una copia

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11567/1102957
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 0
  • ???jsp.display-item.citation.isi??? ND
social impact