Web content mining with multi-source machine learning for intelligent web agents.

IRIS - Institutional Research Information System
IRIS è il sistema di gestione integrata dei dati della ricerca (persone, progetti, pubblicazioni, attività) adottato dall'Università degli Studi dell’Insubria.

IRInSubria - Institutional Repository Insubria
IRInSubria raccoglie, conserva, documenta e dissemina le informazioni sulla produzione scientifica dell'Università degli Studi dell’Insubria anche ai fini della valutazione della ricerca.

The web is recognized as the largest data source in the world. The nature of such data is characterized by partial or no structure, and even worse there exist no standard data schema for the even low-volumed structured data. Web Mining aims to extract useful knowledge from the Web by using a variety of techniques that have to cope with the heterogeneity and lack of a unique and fixed way of representing information. An important aspect in Web Mining is played by the automation of extraction rules with proper algorithms. Machine Learning techniques have been successfully applied toWeb Mining and Information Extraction tasks thanks to the generalization and adaptation capabilities that are a key requirement on general content, heterogeneous web pages. The World Wide Web is a graph, more precisely a directed labeled graph where the nodes are represented by the pages and the edges are represented by links between them. Recent works propose the exploitation of the web structure (Link Analysis) for content extraction, for example one can leverage the content category of neighbor pages to categorize the contents of difficult web pages where word-frequency-based techniques are not robust enough. In this thesis we propose an automated method suitable for a wide range of domains based on Machine Learning and Link Analysis. In particular we propose an inductive model able to recognize content pages where structured information is located after being trained with proper input data. In order to keep the recognition speed high enough for real-world applications an additional algorithm is proposed which lets the approach to boost both in speed and quality. The proposed method has been tested with controlled dataset in a classic train-and-test scenario and in a real-world web crawling system.

Web content mining with multi-source machine learning for intelligent web agents / Carullo, Moreno. - (2011).

Web content mining with multi-source machine learning for intelligent web agents.

Carullo, Moreno

2011-01-01

Abstract

The web is recognized as the largest data source in the world. The nature of such data is characterized by partial or no structure, and even worse there exist no standard data schema for the even low-volumed structured data. Web Mining aims to extract useful knowledge from the Web by using a variety of techniques that have to cope with the heterogeneity and lack of a unique and fixed way of representing information. An important aspect in Web Mining is played by the automation of extraction rules with proper algorithms. Machine Learning techniques have been successfully applied toWeb Mining and Information Extraction tasks thanks to the generalization and adaptation capabilities that are a key requirement on general content, heterogeneous web pages. The World Wide Web is a graph, more precisely a directed labeled graph where the nodes are represented by the pages and the edges are represented by links between them. Recent works propose the exploitation of the web structure (Link Analysis) for content extraction, for example one can leverage the content category of neighbor pages to categorize the contents of difficult web pages where word-frequency-based techniques are not robust enough. In this thesis we propose an automated method suitable for a wide range of domains based on Machine Learning and Link Analysis. In particular we propose an inductive model able to recognize content pages where structured information is located after being trained with proper input data. In order to keep the recognition speed high enough for real-world applications an additional algorithm is proposed which lets the approach to boost both in speed and quality. The proposed method has been tested with controlled dataset in a classic train-and-test scenario and in a real-world web crawling system.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno di discussione
	
				2011
			
	Parole chiave
	
				web mining, machine learning
			
	Citazione
	
				Web content mining with multi-source machine learning for intelligent web agents / Carullo, Moreno. - (2011).
			
	Appare nelle tipologie:
	
				Tesi di dottorato (ex InsubriaSPACE)

File in questo prodotto:

File	Dimensione	Formato
Phd_thesis_carullo_completa.pdf accesso aperto Descrizione: testo completo tesi Tipologia: Tesi di dottorato Licenza: Non specificato Dimensione 3.3 MB Formato Adobe PDF Visualizza/Apri	3.3 MB	Adobe PDF	Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11383/2090208

Attenzione

L'Ateneo sottopone a validazione solo i file PDF allegati

Citazioni

ND

ND

ND

social impact