Integration of heterogeneous data types using self organizing maps

With the growth of computer networks and the advancement of hardware technologies, unprecedented access to data volumes become accessible in a distributed fashion forming heterogeneous data sources. Understanding and combining these data into data warehouses, or merging remote public data into exist...

Full description

Bibliographic Details
Main Author: Bourennani, Farid
Other Authors: Zhu, Ying
Language:en
Published: UOIT 2009
Subjects:
Online Access:http://hdl.handle.net/10155/41
id ndltd-LACETR-oai-collectionscanada.gc.ca-OOSHDU.10155-41
record_format oai_dc
spelling ndltd-LACETR-oai-collectionscanada.gc.ca-OOSHDU.10155-412013-04-17T04:05:44ZIntegration of heterogeneous data types using self organizing mapsBourennani, FaridData integrationInformation structuresSelf-organizing mapsData cleaningWith the growth of computer networks and the advancement of hardware technologies, unprecedented access to data volumes become accessible in a distributed fashion forming heterogeneous data sources. Understanding and combining these data into data warehouses, or merging remote public data into existing databases can significantly enrich the information provided by these data. This problem is called data integration: combining data residing at different sources, and providing the user with a unified view of these data. There are two issues with making use of remote data sources: (1) discovery of relevant data sources, and (2) performing the proper joins between the local data source and the relevant remote databases. Both can be solved if one can effectively identify semantically-related attributes between the local data sources and the available remote data sources. However, performing these tasks manually is time-consuming because of the large data sizes and the unavailability of schema documentation; therefore, an automated tool would be definitely more suitable. Automatically detecting similar entities based on the content is challenging due to three factors. First, because the amount of records is voluminous, it is difficult to perceive or discover information structures or relationships. Second, the schemas of the databases are unfamiliar; therefore, detecting relevant data is difficult. Third, the database entity types are heterogeneous and there is no existing solution for extracting a richer classification result from the processing of two different data types, or at least from textual and numerical data. We propose to utilize self-organizing maps (SOM) to aid the visual exploration of the large data volumes. The unsupervised classification property of SOM facilitates the integration of completely unfamiliar relational database tables and attributes based on the contents. In order to accommodate heterogeneous data types found in relational databases, we extended the term frequency – inverse document frequency (TF-IDF) measure to handle numerical and textual attribute types by unified vectorization processing. The resulting map allows the user to browse the heterogeneously typed database attributes and discover clusters of documents (attributes) having similar content. iii The discovered clusters can significantly aid in manual or automated constructions of data integrity constraints in data cleaning or schema mappings for data integration.UOITZhu, YingPu, Ken2009-11-12T21:45:33Z2009-11-12T21:45:33Z2009-07-01Thesishttp://hdl.handle.net/10155/41en
collection NDLTD
language en
sources NDLTD
topic Data integration
Information structures
Self-organizing maps
Data cleaning
spellingShingle Data integration
Information structures
Self-organizing maps
Data cleaning
Bourennani, Farid
Integration of heterogeneous data types using self organizing maps
description With the growth of computer networks and the advancement of hardware technologies, unprecedented access to data volumes become accessible in a distributed fashion forming heterogeneous data sources. Understanding and combining these data into data warehouses, or merging remote public data into existing databases can significantly enrich the information provided by these data. This problem is called data integration: combining data residing at different sources, and providing the user with a unified view of these data. There are two issues with making use of remote data sources: (1) discovery of relevant data sources, and (2) performing the proper joins between the local data source and the relevant remote databases. Both can be solved if one can effectively identify semantically-related attributes between the local data sources and the available remote data sources. However, performing these tasks manually is time-consuming because of the large data sizes and the unavailability of schema documentation; therefore, an automated tool would be definitely more suitable. Automatically detecting similar entities based on the content is challenging due to three factors. First, because the amount of records is voluminous, it is difficult to perceive or discover information structures or relationships. Second, the schemas of the databases are unfamiliar; therefore, detecting relevant data is difficult. Third, the database entity types are heterogeneous and there is no existing solution for extracting a richer classification result from the processing of two different data types, or at least from textual and numerical data. We propose to utilize self-organizing maps (SOM) to aid the visual exploration of the large data volumes. The unsupervised classification property of SOM facilitates the integration of completely unfamiliar relational database tables and attributes based on the contents. In order to accommodate heterogeneous data types found in relational databases, we extended the term frequency – inverse document frequency (TF-IDF) measure to handle numerical and textual attribute types by unified vectorization processing. The resulting map allows the user to browse the heterogeneously typed database attributes and discover clusters of documents (attributes) having similar content. iii The discovered clusters can significantly aid in manual or automated constructions of data integrity constraints in data cleaning or schema mappings for data integration.
author2 Zhu, Ying
author_facet Zhu, Ying
Bourennani, Farid
author Bourennani, Farid
author_sort Bourennani, Farid
title Integration of heterogeneous data types using self organizing maps
title_short Integration of heterogeneous data types using self organizing maps
title_full Integration of heterogeneous data types using self organizing maps
title_fullStr Integration of heterogeneous data types using self organizing maps
title_full_unstemmed Integration of heterogeneous data types using self organizing maps
title_sort integration of heterogeneous data types using self organizing maps
publisher UOIT
publishDate 2009
url http://hdl.handle.net/10155/41
work_keys_str_mv AT bourennanifarid integrationofheterogeneousdatatypesusingselforganizingmaps
_version_ 1716580205292683264