Two datasets of defect reports labeled by a crowd of annotators of unknown reliability

Classifying software defects according to any defined taxonomy is not straightforward. In order to be used for automatizing the classification of software defects, two sets of defect reports were collected from public issue tracking systems from two different real domains. Due to the lack of a domai...

Full description

Bibliographic Details
Main Authors: Jerónimo Hernández-González, Daniel Rodriguez, Iñaki Inza, Rachel Harrison, Jose A. Lozano
Format: Article
Language:English
Published: Elsevier 2018-06-01
Series:Data in Brief
Online Access:http://www.sciencedirect.com/science/article/pii/S2352340918303226
id doaj-1f14f365a6e1476a8cf857d2b1c025ab
record_format Article
spelling doaj-1f14f365a6e1476a8cf857d2b1c025ab2020-11-24T21:48:05ZengElsevierData in Brief2352-34092018-06-0118840845Two datasets of defect reports labeled by a crowd of annotators of unknown reliabilityJerónimo Hernández-González0Daniel Rodriguez1Iñaki Inza2Rachel Harrison3Jose A. Lozano4Department of Computer Science and Artificial Intelligence, University of the Basque Country UPV/EHU, Donostia, Spain; Corresponding author.Department of Computer Science, University of Alcala, Madrid, SpainDepartment of Computer Science and Artificial Intelligence, University of the Basque Country UPV/EHU, Donostia, SpainDepartment of Computing, Oxford Brookes University, Oxford, UKDepartment of Computer Science and Artificial Intelligence, University of the Basque Country UPV/EHU, Donostia, Spain; Basque Center for Applied Mathematics BCAM, Bilbao, SpainClassifying software defects according to any defined taxonomy is not straightforward. In order to be used for automatizing the classification of software defects, two sets of defect reports were collected from public issue tracking systems from two different real domains. Due to the lack of a domain expert, the collected defects were categorized by a set of annotators of unknown reliability according to their impact from IBM's orthogonal defect classification taxonomy. Both datasets are prepared to solve the defect classification problem by means of techniques of the learning from crowds paradigm (Hernández-González et al. [1]).Two versions of both datasets are publicly shared. In the first version, the raw data is given: the text description of defects together with the category assigned by each annotator. In the second version, the text of each defect has been transformed to a descriptive vector using text-mining techniques.http://www.sciencedirect.com/science/article/pii/S2352340918303226
collection DOAJ
language English
format Article
sources DOAJ
author Jerónimo Hernández-González
Daniel Rodriguez
Iñaki Inza
Rachel Harrison
Jose A. Lozano
spellingShingle Jerónimo Hernández-González
Daniel Rodriguez
Iñaki Inza
Rachel Harrison
Jose A. Lozano
Two datasets of defect reports labeled by a crowd of annotators of unknown reliability
Data in Brief
author_facet Jerónimo Hernández-González
Daniel Rodriguez
Iñaki Inza
Rachel Harrison
Jose A. Lozano
author_sort Jerónimo Hernández-González
title Two datasets of defect reports labeled by a crowd of annotators of unknown reliability
title_short Two datasets of defect reports labeled by a crowd of annotators of unknown reliability
title_full Two datasets of defect reports labeled by a crowd of annotators of unknown reliability
title_fullStr Two datasets of defect reports labeled by a crowd of annotators of unknown reliability
title_full_unstemmed Two datasets of defect reports labeled by a crowd of annotators of unknown reliability
title_sort two datasets of defect reports labeled by a crowd of annotators of unknown reliability
publisher Elsevier
series Data in Brief
issn 2352-3409
publishDate 2018-06-01
description Classifying software defects according to any defined taxonomy is not straightforward. In order to be used for automatizing the classification of software defects, two sets of defect reports were collected from public issue tracking systems from two different real domains. Due to the lack of a domain expert, the collected defects were categorized by a set of annotators of unknown reliability according to their impact from IBM's orthogonal defect classification taxonomy. Both datasets are prepared to solve the defect classification problem by means of techniques of the learning from crowds paradigm (Hernández-González et al. [1]).Two versions of both datasets are publicly shared. In the first version, the raw data is given: the text description of defects together with the category assigned by each annotator. In the second version, the text of each defect has been transformed to a descriptive vector using text-mining techniques.
url http://www.sciencedirect.com/science/article/pii/S2352340918303226
work_keys_str_mv AT jeronimohernandezgonzalez twodatasetsofdefectreportslabeledbyacrowdofannotatorsofunknownreliability
AT danielrodriguez twodatasetsofdefectreportslabeledbyacrowdofannotatorsofunknownreliability
AT inakiinza twodatasetsofdefectreportslabeledbyacrowdofannotatorsofunknownreliability
AT rachelharrison twodatasetsofdefectreportslabeledbyacrowdofannotatorsofunknownreliability
AT josealozano twodatasetsofdefectreportslabeledbyacrowdofannotatorsofunknownreliability
_version_ 1725893512806793216