An Empirical Study of Imputation Techniques for Software Data Sets
Software Project Effort/Cost/Time Estimation has been one of the hot topics of research in the current software engineering industry. Solutions for effort/cost/time estimation are in great demand. Knowledge of accurate effort/cost/time estimates early in the software project life cycle enables proje...
Main Author: | |
---|---|
Other Authors: | |
Format: | Others |
Language: | en |
Published: |
LSU
2005
|
Subjects: | |
Online Access: | http://etd.lsu.edu/docs/available/etd-05062005-163801/ |
id |
ndltd-LSU-oai-etd.lsu.edu-etd-05062005-163801 |
---|---|
record_format |
oai_dc |
spelling |
ndltd-LSU-oai-etd.lsu.edu-etd-05062005-1638012013-01-07T22:49:01Z An Empirical Study of Imputation Techniques for Software Data Sets Yenduri, Sumanth Computer Science Software Project Effort/Cost/Time Estimation has been one of the hot topics of research in the current software engineering industry. Solutions for effort/cost/time estimation are in great demand. Knowledge of accurate effort/cost/time estimates early in the software project life cycle enables project managers manage and exploit resources efficiently. The constraints of cost and time can also be met. To this day, most companies rely on their historical database of past project data sets to predict estimates for future projects. Like other data sets, software project data sets also suffer from numerous problems. The most important problem is they contain missing/incomplete data. Significant amounts of missing or incomplete data are frequently found in data sets utilized to build effort/cost/time prediction models in the current software industry. The reasons are numerous and the missingness is inevitable. The traditional approaches used by the companies ignore all the missing data and provide estimates based on the remaining complete information. Thus, the very estimates are prone to bias. In this thesis, we investigate the application of a few well-known data imputation techniques (Listwise Deletion, Mean Imputation, 10 variants of Hot-Deck Imputation and Full Information Maximum Likelihood Approach) to six real-time software project data sets. Using the imputed data sets we build effort prediction models to evaluate their performance. We study the inherent characteristics of software project data sets such as data set size, missing mechanism, pattern of missingness etc and provide a generic classification schema for all software project data sets based on their characteristics. We further implement a hybrid methodology for solving the same. We perform useful experimental analyses and compare the impacts of these methods for enhancing prediction accuracies. We also highlight the conditions to be considered and measures to be taken while using an imputation technique. We note the ideal and worst conditions for each method. Finally, we discuss the findings and the appropriateness of each method for data imputation to software project data sets. S.S. Iyengar R.C. Ward Donald Kraft B.B. Karki G. Gu LSU 2005-05-11 text application/pdf http://etd.lsu.edu/docs/available/etd-05062005-163801/ http://etd.lsu.edu/docs/available/etd-05062005-163801/ en unrestricted I hereby certify that, if appropriate, I have obtained and attached herein a written permission statement from the owner(s) of each third party copyrighted matter to be included in my thesis, dissertation, or project report, allowing distribution as specified below. I certify that the version I submitted is the same as that approved by my advisory committee. I hereby grant to LSU or its agents the non-exclusive license to archive and make accessible, under the conditions specified below and in appropriate University policies, my thesis, dissertation, or project report in whole or in part in all forms of media, now or hereafter known. I retain all other ownership rights to the copyright of the thesis, dissertation or project report. I also retain the right to use in future works (such as articles or books) all or part of this thesis, dissertation, or project report. |
collection |
NDLTD |
language |
en |
format |
Others
|
sources |
NDLTD |
topic |
Computer Science |
spellingShingle |
Computer Science Yenduri, Sumanth An Empirical Study of Imputation Techniques for Software Data Sets |
description |
Software Project Effort/Cost/Time Estimation has been one of the hot topics of research in the current software engineering industry. Solutions for effort/cost/time estimation are in great demand. Knowledge of accurate effort/cost/time estimates early in the software project life cycle enables project managers manage and exploit resources efficiently. The constraints of cost and time can also be met. To this day, most companies rely on their historical database of past project data sets to predict estimates for future projects. Like other data sets, software project data sets also suffer from numerous problems. The most important problem is they contain missing/incomplete data. Significant amounts of missing or incomplete data are frequently found in data sets utilized to build effort/cost/time prediction models in the current software industry. The reasons are numerous and the missingness is inevitable. The traditional approaches used by the companies ignore all the missing data and provide estimates based on the remaining complete information. Thus, the very estimates are prone to bias.
In this thesis, we investigate the application of a few well-known data imputation techniques (Listwise Deletion, Mean Imputation, 10 variants of Hot-Deck Imputation and Full Information Maximum Likelihood Approach) to six real-time software project data sets. Using the imputed data sets we build effort prediction models to evaluate their performance. We study the inherent characteristics of software project data sets such as data set size, missing mechanism, pattern of missingness etc and provide a generic classification schema for all software project data sets based on their characteristics. We further implement a hybrid methodology for solving the same. We perform useful experimental analyses and compare the impacts of these methods for enhancing prediction accuracies. We also highlight the conditions to be considered and measures to be taken while using an imputation technique. We note the ideal and worst conditions for each method. Finally, we discuss the findings and the appropriateness of each method for data imputation to software project data sets. |
author2 |
S.S. Iyengar |
author_facet |
S.S. Iyengar Yenduri, Sumanth |
author |
Yenduri, Sumanth |
author_sort |
Yenduri, Sumanth |
title |
An Empirical Study of Imputation Techniques for Software Data Sets |
title_short |
An Empirical Study of Imputation Techniques for Software Data Sets |
title_full |
An Empirical Study of Imputation Techniques for Software Data Sets |
title_fullStr |
An Empirical Study of Imputation Techniques for Software Data Sets |
title_full_unstemmed |
An Empirical Study of Imputation Techniques for Software Data Sets |
title_sort |
empirical study of imputation techniques for software data sets |
publisher |
LSU |
publishDate |
2005 |
url |
http://etd.lsu.edu/docs/available/etd-05062005-163801/ |
work_keys_str_mv |
AT yendurisumanth anempiricalstudyofimputationtechniquesforsoftwaredatasets AT yendurisumanth empiricalstudyofimputationtechniquesforsoftwaredatasets |
_version_ |
1716476493571293184 |