Chameleon: A Multimodal Learning Framework Robust to Missing Modalities

IRIS - Institutional Research Information System
IRIS è il sistema di gestione integrata dei dati della ricerca (persone, progetti, pubblicazioni, attività) adottato dall'Università degli Studi dell’Insubria.

IRInSubria - Institutional Repository Insubria
IRInSubria raccoglie, conserva, documenta e dissemina le informazioni sulla produzione scientifica dell'Università degli Studi dell’Insubria anche ai fini della valutazione della ricerca.

Multimodal learning has demonstrated remarkable performance improvements over unimodal architectures. However, multimodal learning methods often exhibit deteriorated performances if one or more modalities are missing. This may be attributed to the commonly used multi-branch design containing modality-specific components, making such approaches reliant on the availability of a complete set of modalities. In this work, we propose a robust multimodal learning framework, Chameleon, that adapts a common-space visual learning network to align all input modalities. To enable this, we present the unification of input modalities into one format by encoding any non-visual modality into visual representations thus making it robust to missing modalities. Extensive experiments are performed on multimodal classification task using four textual-visual (Hateful Memes, UPMC Food-101, MM-IMDb, and Ferramenta) and two audio-visual (avMNIST, VoxCeleb) datasets. Chameleon not only achieves superior performance when all modalities are present at train/test time but also demonstrates notable resilience in the case of missing modalities.

Chameleon: A Multimodal Learning Framework Robust to Missing Modalities

Liaqat, Muhammad Irzam;Nawaz, Shah;Zaheer, Muhammad Zaigham;Saeed, Muhammad Saad;Sajjad, Hassan;De Schepper, Tom;Nandakumar, Karthik;Khan, Muhammad Haris;Gallo, Ignazio;Schedl, Markus

2025-01-01

Abstract

Multimodal learning has demonstrated remarkable performance improvements over unimodal architectures. However, multimodal learning methods often exhibit deteriorated performances if one or more modalities are missing. This may be attributed to the commonly used multi-branch design containing modality-specific components, making such approaches reliant on the availability of a complete set of modalities. In this work, we propose a robust multimodal learning framework, Chameleon, that adapts a common-space visual learning network to align all input modalities. To enable this, we present the unification of input modalities into one format by encoding any non-visual modality into visual representations thus making it robust to missing modalities. Extensive experiments are performed on multimodal classification task using four textual-visual (Hateful Memes, UPMC Food-101, MM-IMDb, and Ferramenta) and two audio-visual (avMNIST, VoxCeleb) datasets. Chameleon not only achieves superior performance when all modalities are present at train/test time but also demonstrates notable resilience in the case of missing modalities.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno
	
				2025
			
	Rivista
	
				INTERNATIONAL JOURNAL OF MULTIMEDIA INFORMATION RETRIEVAL
			
	DOI
	
				https://dx.doi.org/10.1007/s13735-025-00370-y
			
	Codice Web of Science
	
				WOS:001500480200001
			
	Codice Scopus
	
				2-s2.0-105007433645
			
	Parole chiave
	
				Multimodal learning; Vision and other modalities; Missing modalities
			
	Tutti gli autori
	
						Liaqat, Muhammad Irzam; Nawaz, Shah; Zaheer, Muhammad Zaigham; Saeed, Muhammad Saad; Sajjad, Hassan; De Schepper, Tom; Nandakumar, Karthik; Khan, Muha...espandi
						
	Appare nelle tipologie:
	
				Articolo su Rivista

File in questo prodotto:

Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11383/2193991

Attenzione

L'Ateneo sottopone a validazione solo i file PDF allegati

Citazioni

ND

12

5

social impact