A CNN-Transformer Knowledge Distillation for Remote Sensing Scene Classification

IRIS

Scene classification of remote sensing images is a challenging task due to the complexity and variety of natural scenes. In recent years, Convolutional Neural Networks (CNNs) have achieved impressive performances in many remote sensing scene classification benchmarks. However, in CNNs the long-range visual dependencies are often neglected due to the local filter design, leading to suboptimal performances in cluttered scenes such as urban areas. Recently proposed Transformer architecture resolved this issue by taking a broader neighborhood into account through the multi-head self-attention component. In this paper, we propose a novel method which borrows ideas from 'knowledge distillation' and applied to recent vision Transformers. Specifically, we propose a compound loss computed on a Transformer-based student and a CNN teacher in a joint fashion and utilize it for the task of single-label scene classification. Because of the student's capability in capturing long-range visual dependencies, along with the inductive bias inherited from the teacher, our proposed model improves the classification accuracy on four well-known datasets compared to state-of-the-art approaches.

A CNN-Transformer Knowledge Distillation for Remote Sensing Scene Classification

Nabi M.;Maggiolo L.;Moser G.;Serpico S. B.

2022-01-01

Abstract

Scene classification of remote sensing images is a challenging task due to the complexity and variety of natural scenes. In recent years, Convolutional Neural Networks (CNNs) have achieved impressive performances in many remote sensing scene classification benchmarks. However, in CNNs the long-range visual dependencies are often neglected due to the local filter design, leading to suboptimal performances in cluttered scenes such as urban areas. Recently proposed Transformer architecture resolved this issue by taking a broader neighborhood into account through the multi-head self-attention component. In this paper, we propose a novel method which borrows ideas from 'knowledge distillation' and applied to recent vision Transformers. Specifically, we propose a compound loss computed on a Transformer-based student and a CNN teacher in a joint fashion and utilize it for the task of single-label scene classification. Because of the student's capability in capturing long-range visual dependencies, along with the inductive bias inherited from the teacher, our proposed model improves the classification accuracy on four well-known datasets compared to state-of-the-art approaches.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno
	
				2022
			
	ISBN
	
				978-1-6654-2792-0
			
	Appare nelle tipologie:
	
				04.01 - Contributo in atti di convegno

File in questo prodotto:

File	Dimensione	Formato
22.igarss.mostaan.pdf accesso chiuso Descrizione: Contributo in atti di convegno Tipologia: Documento in versione editoriale Dimensione 1.05 MB Formato Adobe PDF Visualizza/Apri Richiedi una copia	1.05 MB	Adobe PDF	Visualizza/Apri Richiedi una copia

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11567/1102957

Citazioni

ND

6

4

social impact