Two datasets of defect reports labeled by a crowd of annotators of unknown reliability
Classifying software defects according to any defined taxonomy is not straightforward. In order to be used for automatizing the classification of software defects, two sets of defect reports were collected from public issue tracking systems from two different real domains. Due to the lack of a domai...
Main Authors: | , , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
Elsevier
2018-06-01
|
Series: | Data in Brief |
Online Access: | http://www.sciencedirect.com/science/article/pii/S2352340918303226 |
id |
doaj-1f14f365a6e1476a8cf857d2b1c025ab |
---|---|
record_format |
Article |
spelling |
doaj-1f14f365a6e1476a8cf857d2b1c025ab2020-11-24T21:48:05ZengElsevierData in Brief2352-34092018-06-0118840845Two datasets of defect reports labeled by a crowd of annotators of unknown reliabilityJerónimo Hernández-González0Daniel Rodriguez1Iñaki Inza2Rachel Harrison3Jose A. Lozano4Department of Computer Science and Artificial Intelligence, University of the Basque Country UPV/EHU, Donostia, Spain; Corresponding author.Department of Computer Science, University of Alcala, Madrid, SpainDepartment of Computer Science and Artificial Intelligence, University of the Basque Country UPV/EHU, Donostia, SpainDepartment of Computing, Oxford Brookes University, Oxford, UKDepartment of Computer Science and Artificial Intelligence, University of the Basque Country UPV/EHU, Donostia, Spain; Basque Center for Applied Mathematics BCAM, Bilbao, SpainClassifying software defects according to any defined taxonomy is not straightforward. In order to be used for automatizing the classification of software defects, two sets of defect reports were collected from public issue tracking systems from two different real domains. Due to the lack of a domain expert, the collected defects were categorized by a set of annotators of unknown reliability according to their impact from IBM's orthogonal defect classification taxonomy. Both datasets are prepared to solve the defect classification problem by means of techniques of the learning from crowds paradigm (Hernández-González et al. [1]).Two versions of both datasets are publicly shared. In the first version, the raw data is given: the text description of defects together with the category assigned by each annotator. In the second version, the text of each defect has been transformed to a descriptive vector using text-mining techniques.http://www.sciencedirect.com/science/article/pii/S2352340918303226 |
collection |
DOAJ |
language |
English |
format |
Article |
sources |
DOAJ |
author |
Jerónimo Hernández-González Daniel Rodriguez Iñaki Inza Rachel Harrison Jose A. Lozano |
spellingShingle |
Jerónimo Hernández-González Daniel Rodriguez Iñaki Inza Rachel Harrison Jose A. Lozano Two datasets of defect reports labeled by a crowd of annotators of unknown reliability Data in Brief |
author_facet |
Jerónimo Hernández-González Daniel Rodriguez Iñaki Inza Rachel Harrison Jose A. Lozano |
author_sort |
Jerónimo Hernández-González |
title |
Two datasets of defect reports labeled by a crowd of annotators of unknown reliability |
title_short |
Two datasets of defect reports labeled by a crowd of annotators of unknown reliability |
title_full |
Two datasets of defect reports labeled by a crowd of annotators of unknown reliability |
title_fullStr |
Two datasets of defect reports labeled by a crowd of annotators of unknown reliability |
title_full_unstemmed |
Two datasets of defect reports labeled by a crowd of annotators of unknown reliability |
title_sort |
two datasets of defect reports labeled by a crowd of annotators of unknown reliability |
publisher |
Elsevier |
series |
Data in Brief |
issn |
2352-3409 |
publishDate |
2018-06-01 |
description |
Classifying software defects according to any defined taxonomy is not straightforward. In order to be used for automatizing the classification of software defects, two sets of defect reports were collected from public issue tracking systems from two different real domains. Due to the lack of a domain expert, the collected defects were categorized by a set of annotators of unknown reliability according to their impact from IBM's orthogonal defect classification taxonomy. Both datasets are prepared to solve the defect classification problem by means of techniques of the learning from crowds paradigm (Hernández-González et al. [1]).Two versions of both datasets are publicly shared. In the first version, the raw data is given: the text description of defects together with the category assigned by each annotator. In the second version, the text of each defect has been transformed to a descriptive vector using text-mining techniques. |
url |
http://www.sciencedirect.com/science/article/pii/S2352340918303226 |
work_keys_str_mv |
AT jeronimohernandezgonzalez twodatasetsofdefectreportslabeledbyacrowdofannotatorsofunknownreliability AT danielrodriguez twodatasetsofdefectreportslabeledbyacrowdofannotatorsofunknownreliability AT inakiinza twodatasetsofdefectreportslabeledbyacrowdofannotatorsofunknownreliability AT rachelharrison twodatasetsofdefectreportslabeledbyacrowdofannotatorsofunknownreliability AT josealozano twodatasetsofdefectreportslabeledbyacrowdofannotatorsofunknownreliability |
_version_ |
1725893512806793216 |