Summary: | Open source projects incorporate bug triagers to help with the task of bug report
assignment to developers. One of the tasks of a triager is to identify whether an incoming
bug report is a duplicate of a pre-existing report. In order to detect duplicate bug reports,
a triager either relies on his memory and experience or on the search capabilties of the bug
repository. Both these approaches can be time consuming for the triager and may also
lead to the misidentication of duplicates. It has also been suggested that duplicate bug
reports are not necessarily harmful, instead they can complement each other to provide
additional information for developers to investigate the defect at hand. This motivates the
need for automated or semi-automated techniques for duplicate bug detection.
In the literature, two main approaches have been proposed to solve this problem. The
first approach is to prevent duplicate reports from reaching developers by automatically
filtering them while the second approach deals with providing the triager a list of top-N
similar bug reports, allowing the triager to compare the incoming bug report with the ones
provided in the list. Previous works have tried to enhance the quality of the suggested
lists, but the approaches either suffered a poor Recall Rate or they incurred additional
runtime overhead, making the deployment of a retrieval system impractical. To the extent
of our knowledge, there has been little work done to do an exhaustive comparison of
the performance of different Information Retrieval Models (especially using more recent
techniques such as topic modeling) on this problem and understanding the effectiveness of
different heuristics across various application domains.
In this thesis, we compare the performance of word based models (derivatives of the
Vector Space Model) such as TF-IDF, Log-Entropy with that of topic based models such as
Latent Semantic Indexing (LSI), Latent Dirichlet Allocation (LDA) and Random Indexing
(RI). We leverage heuristics that incorporate exception stack frames, surface features,
summary and long description from the free-form text in the bug report. We perform
experiments on subsets of bug reports from Eclipse and Firefox and achieve a recall rate of
60% and 58% respectively. We find that word based models, in particular a Log-Entropy
based weighting scheme, outperform topic based ones such as LSI and LDA.
Using historical bug data from Eclipse and NetBeans, we determine the optimal time
frame for a desired level of duplicate bug report coverage. We realize an Online Duplicate
Detection Framework that uses a sliding window of a constant time frame as a first step
towards simulating incoming bug reports and recommending duplicates to the end user.
|