Kernel Methods for Tree Structured Data

Machine learning comprises a series of techniques for automatic extraction of meaningful information from large collections of noisy data. In many real world applications, data is naturally represented in structured form. Since traditional methods in machine learning deal with vectorial informatio...

Full description

Bibliographic Details
Main Author:	Da San Martino, Giovanni <1979>
Other Authors:	Sperduti, Alessandro
Format:	Doctoral Thesis
Language:	en
Published:	Alma Mater Studiorum - Università di Bologna 2009
Subjects:	INF/01 Informatica
Online Access:	http://amsdottorato.unibo.it/1400/

id	ndltd-unibo.it-oai-amsdottorato.cib.unibo.it-1400
record_format	oai_dc
spelling	ndltd-unibo.it-oai-amsdottorato.cib.unibo.it-14002014-03-24T16:27:47Z Kernel Methods for Tree Structured Data Da San Martino, Giovanni <1979> INF/01 Informatica Machine learning comprises a series of techniques for automatic extraction of meaningful information from large collections of noisy data. In many real world applications, data is naturally represented in structured form. Since traditional methods in machine learning deal with vectorial information, they require an a priori form of preprocessing. Among all the learning techniques for dealing with structured data, kernel methods are recognized to have a strong theoretical background and to be effective approaches. They do not require an explicit vectorial representation of the data in terms of features, but rely on a measure of similarity between any pair of objects of a domain, the kernel function. Designing fast and good kernel functions is a challenging problem. In the case of tree structured data two issues become relevant: kernel for trees should not be sparse and should be fast to compute. The sparsity problem arises when, given a dataset and a kernel function, most structures of the dataset are completely dissimilar to one another. In those cases the classifier has too few information for making correct predictions on unseen data. In fact, it tends to produce a discriminating function behaving as the nearest neighbour rule. Sparsity is likely to arise for some standard tree kernel functions, such as the subtree and subset tree kernel, when they are applied to datasets with node labels belonging to a large domain. A second drawback of using tree kernels is the time complexity required both in learning and classification phases. Such a complexity can sometimes prevents the kernel application in scenarios involving large amount of data. This thesis proposes three contributions for resolving the above issues of kernel for trees. A first contribution aims at creating kernel functions which adapt to the statistical properties of the dataset, thus reducing its sparsity with respect to traditional tree kernel functions. Specifically, we propose to encode the input trees by an algorithm able to project the data onto a lower dimensional space with the property that similar structures are mapped similarly. By building kernel functions on the lower dimensional representation, we are able to perform inexact matchings between different inputs in the original space. A second contribution is the proposal of a novel kernel function based on the convolution kernel framework. Convolution kernel measures the similarity of two objects in terms of the similarities of their subparts. Most convolution kernels are based on counting the number of shared substructures, partially discarding information about their position in the original structure. The kernel function we propose is, instead, especially focused on this aspect. A third contribution is devoted at reducing the computational burden related to the calculation of a kernel function between a tree and a forest of trees, which is a typical operation in the classification phase and, for some algorithms, also in the learning phase. We propose a general methodology applicable to convolution kernels. Moreover, we show an instantiation of our technique when kernels such as the subtree and subset tree kernels are employed. In those cases, Direct Acyclic Graphs can be used to compactly represent shared substructures in different trees, thus reducing the computational burden and storage requirements. Alma Mater Studiorum - Università di Bologna Sperduti, Alessandro 2009-04-20 Doctoral Thesis PeerReviewed application/pdf en http://amsdottorato.unibo.it/1400/ info:eu-repo/semantics/openAccess
collection	NDLTD
language	en
format	Doctoral Thesis
sources	NDLTD
topic	INF/01 Informatica
spellingShingle	INF/01 Informatica Da San Martino, Giovanni <1979> Kernel Methods for Tree Structured Data
description	Machine learning comprises a series of techniques for automatic extraction of meaningful information from large collections of noisy data. In many real world applications, data is naturally represented in structured form. Since traditional methods in machine learning deal with vectorial information, they require an a priori form of preprocessing. Among all the learning techniques for dealing with structured data, kernel methods are recognized to have a strong theoretical background and to be effective approaches. They do not require an explicit vectorial representation of the data in terms of features, but rely on a measure of similarity between any pair of objects of a domain, the kernel function. Designing fast and good kernel functions is a challenging problem. In the case of tree structured data two issues become relevant: kernel for trees should not be sparse and should be fast to compute. The sparsity problem arises when, given a dataset and a kernel function, most structures of the dataset are completely dissimilar to one another. In those cases the classifier has too few information for making correct predictions on unseen data. In fact, it tends to produce a discriminating function behaving as the nearest neighbour rule. Sparsity is likely to arise for some standard tree kernel functions, such as the subtree and subset tree kernel, when they are applied to datasets with node labels belonging to a large domain. A second drawback of using tree kernels is the time complexity required both in learning and classification phases. Such a complexity can sometimes prevents the kernel application in scenarios involving large amount of data. This thesis proposes three contributions for resolving the above issues of kernel for trees. A first contribution aims at creating kernel functions which adapt to the statistical properties of the dataset, thus reducing its sparsity with respect to traditional tree kernel functions. Specifically, we propose to encode the input trees by an algorithm able to project the data onto a lower dimensional space with the property that similar structures are mapped similarly. By building kernel functions on the lower dimensional representation, we are able to perform inexact matchings between different inputs in the original space. A second contribution is the proposal of a novel kernel function based on the convolution kernel framework. Convolution kernel measures the similarity of two objects in terms of the similarities of their subparts. Most convolution kernels are based on counting the number of shared substructures, partially discarding information about their position in the original structure. The kernel function we propose is, instead, especially focused on this aspect. A third contribution is devoted at reducing the computational burden related to the calculation of a kernel function between a tree and a forest of trees, which is a typical operation in the classification phase and, for some algorithms, also in the learning phase. We propose a general methodology applicable to convolution kernels. Moreover, we show an instantiation of our technique when kernels such as the subtree and subset tree kernels are employed. In those cases, Direct Acyclic Graphs can be used to compactly represent shared substructures in different trees, thus reducing the computational burden and storage requirements.
author2	Sperduti, Alessandro
author_facet	Sperduti, Alessandro Da San Martino, Giovanni <1979>
author	Da San Martino, Giovanni <1979>
author_sort	Da San Martino, Giovanni <1979>
title	Kernel Methods for Tree Structured Data
title_short	Kernel Methods for Tree Structured Data
title_full	Kernel Methods for Tree Structured Data
title_fullStr	Kernel Methods for Tree Structured Data
title_full_unstemmed	Kernel Methods for Tree Structured Data
title_sort	kernel methods for tree structured data
publisher	Alma Mater Studiorum - Università di Bologna
publishDate	2009
url	http://amsdottorato.unibo.it/1400/
work_keys_str_mv	AT dasanmartinogiovanni1979 kernelmethodsfortreestructureddata
_version_	1716653917861838848

Kernel Methods for Tree Structured Data

Similar Items