Integration of heterogeneous data types using self organizing maps

With the growth of computer networks and the advancement of hardware technologies, unprecedented access to data volumes become accessible in a distributed fashion forming heterogeneous data sources. Understanding and combining these data into data warehouses, or merging remote public data into exist...

Full description

Bibliographic Details
Main Author:	Bourennani, Farid
Other Authors:	Zhu, Ying
Language:	en
Published:	UOIT 2009
Subjects:	Data integration Information structures Self-organizing maps Data cleaning
Online Access:	http://hdl.handle.net/10155/41

id	ndltd-LACETR-oai-collectionscanada.gc.ca-OOSHDU.10155-41
record_format	oai_dc
spelling	ndltd-LACETR-oai-collectionscanada.gc.ca-OOSHDU.10155-412013-04-17T04:05:44ZIntegration of heterogeneous data types using self organizing mapsBourennani, FaridData integrationInformation structuresSelf-organizing mapsData cleaningWith the growth of computer networks and the advancement of hardware technologies, unprecedented access to data volumes become accessible in a distributed fashion forming heterogeneous data sources. Understanding and combining these data into data warehouses, or merging remote public data into existing databases can significantly enrich the information provided by these data. This problem is called data integration: combining data residing at different sources, and providing the user with a unified view of these data. There are two issues with making use of remote data sources: (1) discovery of relevant data sources, and (2) performing the proper joins between the local data source and the relevant remote databases. Both can be solved if one can effectively identify semantically-related attributes between the local data sources and the available remote data sources. However, performing these tasks manually is time-consuming because of the large data sizes and the unavailability of schema documentation; therefore, an automated tool would be definitely more suitable. Automatically detecting similar entities based on the content is challenging due to three factors. First, because the amount of records is voluminous, it is difficult to perceive or discover information structures or relationships. Second, the schemas of the databases are unfamiliar; therefore, detecting relevant data is difficult. Third, the database entity types are heterogeneous and there is no existing solution for extracting a richer classification result from the processing of two different data types, or at least from textual and numerical data. We propose to utilize self-organizing maps (SOM) to aid the visual exploration of the large data volumes. The unsupervised classification property of SOM facilitates the integration of completely unfamiliar relational database tables and attributes based on the contents. In order to accommodate heterogeneous data types found in relational databases, we extended the term frequency – inverse document frequency (TF-IDF) measure to handle numerical and textual attribute types by unified vectorization processing. The resulting map allows the user to browse the heterogeneously typed database attributes and discover clusters of documents (attributes) having similar content. iii The discovered clusters can significantly aid in manual or automated constructions of data integrity constraints in data cleaning or schema mappings for data integration.UOITZhu, YingPu, Ken2009-11-12T21:45:33Z2009-11-12T21:45:33Z2009-07-01Thesishttp://hdl.handle.net/10155/41en
collection	NDLTD
language	en
sources	NDLTD
topic	Data integration Information structures Self-organizing maps Data cleaning
spellingShingle	Data integration Information structures Self-organizing maps Data cleaning Bourennani, Farid Integration of heterogeneous data types using self organizing maps
description	With the growth of computer networks and the advancement of hardware technologies, unprecedented access to data volumes become accessible in a distributed fashion forming heterogeneous data sources. Understanding and combining these data into data warehouses, or merging remote public data into existing databases can significantly enrich the information provided by these data. This problem is called data integration: combining data residing at different sources, and providing the user with a unified view of these data. There are two issues with making use of remote data sources: (1) discovery of relevant data sources, and (2) performing the proper joins between the local data source and the relevant remote databases. Both can be solved if one can effectively identify semantically-related attributes between the local data sources and the available remote data sources. However, performing these tasks manually is time-consuming because of the large data sizes and the unavailability of schema documentation; therefore, an automated tool would be definitely more suitable. Automatically detecting similar entities based on the content is challenging due to three factors. First, because the amount of records is voluminous, it is difficult to perceive or discover information structures or relationships. Second, the schemas of the databases are unfamiliar; therefore, detecting relevant data is difficult. Third, the database entity types are heterogeneous and there is no existing solution for extracting a richer classification result from the processing of two different data types, or at least from textual and numerical data. We propose to utilize self-organizing maps (SOM) to aid the visual exploration of the large data volumes. The unsupervised classification property of SOM facilitates the integration of completely unfamiliar relational database tables and attributes based on the contents. In order to accommodate heterogeneous data types found in relational databases, we extended the term frequency – inverse document frequency (TF-IDF) measure to handle numerical and textual attribute types by unified vectorization processing. The resulting map allows the user to browse the heterogeneously typed database attributes and discover clusters of documents (attributes) having similar content. iii The discovered clusters can significantly aid in manual or automated constructions of data integrity constraints in data cleaning or schema mappings for data integration.
author2	Zhu, Ying
author_facet	Zhu, Ying Bourennani, Farid
author	Bourennani, Farid
author_sort	Bourennani, Farid
title	Integration of heterogeneous data types using self organizing maps
title_short	Integration of heterogeneous data types using self organizing maps
title_full	Integration of heterogeneous data types using self organizing maps
title_fullStr	Integration of heterogeneous data types using self organizing maps
title_full_unstemmed	Integration of heterogeneous data types using self organizing maps
title_sort	integration of heterogeneous data types using self organizing maps
publisher	UOIT
publishDate	2009
url	http://hdl.handle.net/10155/41
work_keys_str_mv	AT bourennanifarid integrationofheterogeneousdatatypesusingselforganizingmaps
_version_	1716580205292683264

Integration of heterogeneous data types using self organizing maps

Similar Items