Data and Text Mining Techniques for In-Domain and Cross-Domain Applications

In the big data era, a wide amount of data has been generated in different domains, from social media to news feeds, from health care to genomic functionalities. When addressing a problem, we usually need to harness multiple disparate datasets. Data from different domains may follow different modali...

Full description

Bibliographic Details
Main Author: Domeniconi, Giacomo <1986>
Other Authors: Moro, Gianluca
Format: Doctoral Thesis
Language:en
Published: Alma Mater Studiorum - Università di Bologna 2016
Subjects:
Online Access:http://amsdottorato.unibo.it/7494/
id ndltd-unibo.it-oai-amsdottorato.cib.unibo.it-7494
record_format oai_dc
spelling ndltd-unibo.it-oai-amsdottorato.cib.unibo.it-74942016-09-06T05:02:38Z Data and Text Mining Techniques for In-Domain and Cross-Domain Applications Domeniconi, Giacomo <1986> ING-INF/05 Sistemi di elaborazione delle informazioni In the big data era, a wide amount of data has been generated in different domains, from social media to news feeds, from health care to genomic functionalities. When addressing a problem, we usually need to harness multiple disparate datasets. Data from different domains may follow different modalities, each of which has a different representation, distribution, scale and density. For example, text is usually represented as discrete sparse word count vectors, whereas an image is represented by pixel intensities, and so on. Nowadays plenty of Data Mining and Machine Learning techniques are proposed in literature, which have already achieved significant success in many knowledge engineering areas, including classification, regression and clustering. Anyway some challenging issues remain when tackling a new problem: how to represent the problem? What approach is better to use among the huge quantity of possibilities? What is the information to be used in the Machine Learning task and how to represent it? There exist any different domains from which borrow knowledge? This dissertation proposes some possible representation approaches for problems in different domains, from text mining to genomic analysis. In particular, one of the major contributions is a different way to represent a classical classification problem: instead of using an instance related to each object (a document, or a gene, or a social post, etc.) to be classified, it is proposed to use a pair of objects or a pair object-class, using the relationship between them as label. The application of this approach is tested on both flat and hierarchical text categorization datasets, where it potentially allows the efficient addition of new categories during classification. Furthermore, the same idea is used to extract conversational threads from an unregulated pool of messages and also to classify the biomedical literature based on the genomic features treated. Alma Mater Studiorum - Università di Bologna Moro, Gianluca Sartori, Claudio 2016-05-12 Doctoral Thesis PeerReviewed application/pdf en http://amsdottorato.unibo.it/7494/ info:eu-repo/semantics/embargoedAccess info:eu-repo/date/embargoEnd/2017-02-28
collection NDLTD
language en
format Doctoral Thesis
sources NDLTD
topic ING-INF/05 Sistemi di elaborazione delle informazioni
spellingShingle ING-INF/05 Sistemi di elaborazione delle informazioni
Domeniconi, Giacomo <1986>
Data and Text Mining Techniques for In-Domain and Cross-Domain Applications
description In the big data era, a wide amount of data has been generated in different domains, from social media to news feeds, from health care to genomic functionalities. When addressing a problem, we usually need to harness multiple disparate datasets. Data from different domains may follow different modalities, each of which has a different representation, distribution, scale and density. For example, text is usually represented as discrete sparse word count vectors, whereas an image is represented by pixel intensities, and so on. Nowadays plenty of Data Mining and Machine Learning techniques are proposed in literature, which have already achieved significant success in many knowledge engineering areas, including classification, regression and clustering. Anyway some challenging issues remain when tackling a new problem: how to represent the problem? What approach is better to use among the huge quantity of possibilities? What is the information to be used in the Machine Learning task and how to represent it? There exist any different domains from which borrow knowledge? This dissertation proposes some possible representation approaches for problems in different domains, from text mining to genomic analysis. In particular, one of the major contributions is a different way to represent a classical classification problem: instead of using an instance related to each object (a document, or a gene, or a social post, etc.) to be classified, it is proposed to use a pair of objects or a pair object-class, using the relationship between them as label. The application of this approach is tested on both flat and hierarchical text categorization datasets, where it potentially allows the efficient addition of new categories during classification. Furthermore, the same idea is used to extract conversational threads from an unregulated pool of messages and also to classify the biomedical literature based on the genomic features treated.
author2 Moro, Gianluca
author_facet Moro, Gianluca
Domeniconi, Giacomo <1986>
author Domeniconi, Giacomo <1986>
author_sort Domeniconi, Giacomo <1986>
title Data and Text Mining Techniques for In-Domain and Cross-Domain Applications
title_short Data and Text Mining Techniques for In-Domain and Cross-Domain Applications
title_full Data and Text Mining Techniques for In-Domain and Cross-Domain Applications
title_fullStr Data and Text Mining Techniques for In-Domain and Cross-Domain Applications
title_full_unstemmed Data and Text Mining Techniques for In-Domain and Cross-Domain Applications
title_sort data and text mining techniques for in-domain and cross-domain applications
publisher Alma Mater Studiorum - Università di Bologna
publishDate 2016
url http://amsdottorato.unibo.it/7494/
work_keys_str_mv AT domeniconigiacomo1986 dataandtextminingtechniquesforindomainandcrossdomainapplications
_version_ 1718382820479467520