An Empirical Study of Imputation Techniques for Software Data Sets

Software Project Effort/Cost/Time Estimation has been one of the hot topics of research in the current software engineering industry. Solutions for effort/cost/time estimation are in great demand. Knowledge of accurate effort/cost/time estimates early in the software project life cycle enables proje...

Full description

Bibliographic Details
Main Author: Yenduri, Sumanth
Other Authors: S.S. Iyengar
Format: Others
Language:en
Published: LSU 2005
Subjects:
Online Access:http://etd.lsu.edu/docs/available/etd-05062005-163801/
id ndltd-LSU-oai-etd.lsu.edu-etd-05062005-163801
record_format oai_dc
spelling ndltd-LSU-oai-etd.lsu.edu-etd-05062005-1638012013-01-07T22:49:01Z An Empirical Study of Imputation Techniques for Software Data Sets Yenduri, Sumanth Computer Science Software Project Effort/Cost/Time Estimation has been one of the hot topics of research in the current software engineering industry. Solutions for effort/cost/time estimation are in great demand. Knowledge of accurate effort/cost/time estimates early in the software project life cycle enables project managers manage and exploit resources efficiently. The constraints of cost and time can also be met. To this day, most companies rely on their historical database of past project data sets to predict estimates for future projects. Like other data sets, software project data sets also suffer from numerous problems. The most important problem is they contain missing/incomplete data. Significant amounts of missing or incomplete data are frequently found in data sets utilized to build effort/cost/time prediction models in the current software industry. The reasons are numerous and the missingness is inevitable. The traditional approaches used by the companies ignore all the missing data and provide estimates based on the remaining complete information. Thus, the very estimates are prone to bias. In this thesis, we investigate the application of a few well-known data imputation techniques (Listwise Deletion, Mean Imputation, 10 variants of Hot-Deck Imputation and Full Information Maximum Likelihood Approach) to six real-time software project data sets. Using the imputed data sets we build effort prediction models to evaluate their performance. We study the inherent characteristics of software project data sets such as data set size, missing mechanism, pattern of missingness etc and provide a generic classification schema for all software project data sets based on their characteristics. We further implement a hybrid methodology for solving the same. We perform useful experimental analyses and compare the impacts of these methods for enhancing prediction accuracies. We also highlight the conditions to be considered and measures to be taken while using an imputation technique. We note the ideal and worst conditions for each method. Finally, we discuss the findings and the appropriateness of each method for data imputation to software project data sets. S.S. Iyengar R.C. Ward Donald Kraft B.B. Karki G. Gu LSU 2005-05-11 text application/pdf http://etd.lsu.edu/docs/available/etd-05062005-163801/ http://etd.lsu.edu/docs/available/etd-05062005-163801/ en unrestricted I hereby certify that, if appropriate, I have obtained and attached herein a written permission statement from the owner(s) of each third party copyrighted matter to be included in my thesis, dissertation, or project report, allowing distribution as specified below. I certify that the version I submitted is the same as that approved by my advisory committee. I hereby grant to LSU or its agents the non-exclusive license to archive and make accessible, under the conditions specified below and in appropriate University policies, my thesis, dissertation, or project report in whole or in part in all forms of media, now or hereafter known. I retain all other ownership rights to the copyright of the thesis, dissertation or project report. I also retain the right to use in future works (such as articles or books) all or part of this thesis, dissertation, or project report.
collection NDLTD
language en
format Others
sources NDLTD
topic Computer Science
spellingShingle Computer Science
Yenduri, Sumanth
An Empirical Study of Imputation Techniques for Software Data Sets
description Software Project Effort/Cost/Time Estimation has been one of the hot topics of research in the current software engineering industry. Solutions for effort/cost/time estimation are in great demand. Knowledge of accurate effort/cost/time estimates early in the software project life cycle enables project managers manage and exploit resources efficiently. The constraints of cost and time can also be met. To this day, most companies rely on their historical database of past project data sets to predict estimates for future projects. Like other data sets, software project data sets also suffer from numerous problems. The most important problem is they contain missing/incomplete data. Significant amounts of missing or incomplete data are frequently found in data sets utilized to build effort/cost/time prediction models in the current software industry. The reasons are numerous and the missingness is inevitable. The traditional approaches used by the companies ignore all the missing data and provide estimates based on the remaining complete information. Thus, the very estimates are prone to bias. In this thesis, we investigate the application of a few well-known data imputation techniques (Listwise Deletion, Mean Imputation, 10 variants of Hot-Deck Imputation and Full Information Maximum Likelihood Approach) to six real-time software project data sets. Using the imputed data sets we build effort prediction models to evaluate their performance. We study the inherent characteristics of software project data sets such as data set size, missing mechanism, pattern of missingness etc and provide a generic classification schema for all software project data sets based on their characteristics. We further implement a hybrid methodology for solving the same. We perform useful experimental analyses and compare the impacts of these methods for enhancing prediction accuracies. We also highlight the conditions to be considered and measures to be taken while using an imputation technique. We note the ideal and worst conditions for each method. Finally, we discuss the findings and the appropriateness of each method for data imputation to software project data sets.
author2 S.S. Iyengar
author_facet S.S. Iyengar
Yenduri, Sumanth
author Yenduri, Sumanth
author_sort Yenduri, Sumanth
title An Empirical Study of Imputation Techniques for Software Data Sets
title_short An Empirical Study of Imputation Techniques for Software Data Sets
title_full An Empirical Study of Imputation Techniques for Software Data Sets
title_fullStr An Empirical Study of Imputation Techniques for Software Data Sets
title_full_unstemmed An Empirical Study of Imputation Techniques for Software Data Sets
title_sort empirical study of imputation techniques for software data sets
publisher LSU
publishDate 2005
url http://etd.lsu.edu/docs/available/etd-05062005-163801/
work_keys_str_mv AT yendurisumanth anempiricalstudyofimputationtechniquesforsoftwaredatasets
AT yendurisumanth empiricalstudyofimputationtechniquesforsoftwaredatasets
_version_ 1716476493571293184