A self-supervised seed-driven approach to topic modelling and clustering

IRIS - Institutional Research Information System
IRIS è il sistema di gestione integrata dei dati della ricerca (persone, progetti, pubblicazioni, attività) adottato dall'Università degli Studi dell’Insubria.

IRInSubria - Institutional Repository Insubria
IRInSubria raccoglie, conserva, documenta e dissemina le informazioni sulla produzione scientifica dell'Università degli Studi dell’Insubria anche ai fini della valutazione della ricerca.

Topic models are useful tools for extracting the most salient themes within a collection of documents, grouping them to construct clusters representative of each specific topic. These clusters summarize and represent the semantic contents of the documents for better document interpretation. In this work, we present a light approach able to learn topic representations in a Self-Supervised fashion. More specifically, we propose a lightweight and scalable architecture using a seed-word driven approach to simultaneously co-learn a representation from a document and its corresponding word embeddings. The results obtained on a variety of datasets of different sizes and natures show that our model is capable of extracting meaningful topics. Furthermore, our experiments on five benchmark datasets illustrate that our model outperforms both traditional and neural topic modelling baseline models in terms of different coherence and clustering accuracy measures.

A self-supervised seed-driven approach to topic modelling and clustering

Bahrainian, Seyed Ali;Raballo, Andrea;Mira, Antonietta;Crestani, Fabio^Ultimo

2024-01-01

Abstract

Topic models are useful tools for extracting the most salient themes within a collection of documents, grouping them to construct clusters representative of each specific topic. These clusters summarize and represent the semantic contents of the documents for better document interpretation. In this work, we present a light approach able to learn topic representations in a Self-Supervised fashion. More specifically, we propose a lightweight and scalable architecture using a seed-word driven approach to simultaneously co-learn a representation from a document and its corresponding word embeddings. The results obtained on a variety of datasets of different sizes and natures show that our model is capable of extracting meaningful topics. Furthermore, our experiments on five benchmark datasets illustrate that our model outperforms both traditional and neural topic modelling baseline models in terms of different coherence and clustering accuracy measures.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno
	
				2024
			
	Anno di pubblicazione online
	
				2024
			
	Rivista
	
				JOURNAL OF INTELLIGENT INFORMATION SYSTEMS
			
	DOI
	
				https://dx.doi.org/10.1007/s10844-024-00891-8
			
	Codice Web of Science
	
	Codice Scopus
	
				2-s2.0-85205390833
			
	Parole chiave
	
				Topic models; Bayesian optimization; Word embeddings; Seed-words learning; BERT
			
	Tutti gli autori
	
						Ravenda, Federico; Bahrainian, Seyed Ali; Raballo, Andrea; Mira, Antonietta; Crestani, Fabio
					
	Appare nelle tipologie:
	
				Articolo su Rivista

File in questo prodotto:

Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11383/2184535

Attenzione

L'Ateneo sottopone a validazione solo i file PDF allegati

Citazioni

ND

0

ND

social impact