Measuring Data Quality of Geoscience Datasets Using Data Mining Techniques

Currently there are many methods of collecting geoscience data, such as station observations, satellite images, sensor networks, etc. All of these data sources from different regions and time intervals are combined in geoscience research activities today. Using a mixture of several different data so...

Full description

Bibliographic Details
Main Authors: Cuo Cai, Kunqing Xie
Format: Article
Language:English
Published: Ubiquity Press 2007-10-01
Series:Data Science Journal
Subjects:
Online Access:http://datascience.codata.org/articles/499
id doaj-8c27ea4a3135433f8983c9e296f3068b
record_format Article
spelling doaj-8c27ea4a3135433f8983c9e296f3068b2020-11-24T22:58:03ZengUbiquity PressData Science Journal1683-14702007-10-01610.2481/dsj.6.S738501Measuring Data Quality of Geoscience Datasets Using Data Mining TechniquesCuo Cai0Kunqing Xie1Center for Information Science, Peking University, Beijing 100871, ChinaCenter for Information Science, Peking University, Beijing 100871, ChinaCurrently there are many methods of collecting geoscience data, such as station observations, satellite images, sensor networks, etc. All of these data sources from different regions and time intervals are combined in geoscience research activities today. Using a mixture of several different data sources may have benefits but may also lead to severe data quality problems, such as inconsistent data and missing values. There have been efforts to produce more consistent data sets from multiple data sources. However, because of the huge gaps in data quality among the different sources, data quality inequality among different regions and time intervals has still occurred in the resultant data sets. As the construction methods of these data sets are quite complicated, it would be difficult for users to know the data quality of a dataset not to mention the data quality for a specified location or a given time interval. In this paper, the authors address the problem by generating a data quality measure for all regions and time intervals of a dataset. The data quality measure is computed by comparing the constructed datasets and their sources or other relevant data, using data mining techniques. This paper also demonstrates how to handle major quality problems, such as outliers and missing values, by using data mining techniques in the geoscience data, especially in global climate data.http://datascience.codata.org/articles/499Data qualityGeoscience dataData miningData integration
collection DOAJ
language English
format Article
sources DOAJ
author Cuo Cai
Kunqing Xie
spellingShingle Cuo Cai
Kunqing Xie
Measuring Data Quality of Geoscience Datasets Using Data Mining Techniques
Data Science Journal
Data quality
Geoscience data
Data mining
Data integration
author_facet Cuo Cai
Kunqing Xie
author_sort Cuo Cai
title Measuring Data Quality of Geoscience Datasets Using Data Mining Techniques
title_short Measuring Data Quality of Geoscience Datasets Using Data Mining Techniques
title_full Measuring Data Quality of Geoscience Datasets Using Data Mining Techniques
title_fullStr Measuring Data Quality of Geoscience Datasets Using Data Mining Techniques
title_full_unstemmed Measuring Data Quality of Geoscience Datasets Using Data Mining Techniques
title_sort measuring data quality of geoscience datasets using data mining techniques
publisher Ubiquity Press
series Data Science Journal
issn 1683-1470
publishDate 2007-10-01
description Currently there are many methods of collecting geoscience data, such as station observations, satellite images, sensor networks, etc. All of these data sources from different regions and time intervals are combined in geoscience research activities today. Using a mixture of several different data sources may have benefits but may also lead to severe data quality problems, such as inconsistent data and missing values. There have been efforts to produce more consistent data sets from multiple data sources. However, because of the huge gaps in data quality among the different sources, data quality inequality among different regions and time intervals has still occurred in the resultant data sets. As the construction methods of these data sets are quite complicated, it would be difficult for users to know the data quality of a dataset not to mention the data quality for a specified location or a given time interval. In this paper, the authors address the problem by generating a data quality measure for all regions and time intervals of a dataset. The data quality measure is computed by comparing the constructed datasets and their sources or other relevant data, using data mining techniques. This paper also demonstrates how to handle major quality problems, such as outliers and missing values, by using data mining techniques in the geoscience data, especially in global climate data.
topic Data quality
Geoscience data
Data mining
Data integration
url http://datascience.codata.org/articles/499
work_keys_str_mv AT cuocai measuringdataqualityofgeosciencedatasetsusingdataminingtechniques
AT kunqingxie measuringdataqualityofgeosciencedatasetsusingdataminingtechniques
_version_ 1725648567470653440