Two datasets of defect reports labeled by a crowd of annotators of unknown reliability

Classifying software defects according to any defined taxonomy is not straightforward. In order to be used for automatizing the classification of software defects, two sets of defect reports were collected from public issue tracking systems from two different real domains. Due to the lack of a domai...

Full description

Bibliographic Details
Main Authors:	Jerónimo Hernández-González, Daniel Rodriguez, Iñaki Inza, Rachel Harrison, Jose A. Lozano
Format:	Article
Language:	English
Published:	Elsevier 2018-06-01
Series:	Data in Brief
Online Access:	http://www.sciencedirect.com/science/article/pii/S2352340918303226

id	doaj-1f14f365a6e1476a8cf857d2b1c025ab
record_format	Article
spelling	doaj-1f14f365a6e1476a8cf857d2b1c025ab2020-11-24T21:48:05ZengElsevierData in Brief2352-34092018-06-0118840845Two datasets of defect reports labeled by a crowd of annotators of unknown reliabilityJerónimo Hernández-González0Daniel Rodriguez1Iñaki Inza2Rachel Harrison3Jose A. Lozano4Department of Computer Science and Artificial Intelligence, University of the Basque Country UPV/EHU, Donostia, Spain; Corresponding author.Department of Computer Science, University of Alcala, Madrid, SpainDepartment of Computer Science and Artificial Intelligence, University of the Basque Country UPV/EHU, Donostia, SpainDepartment of Computing, Oxford Brookes University, Oxford, UKDepartment of Computer Science and Artificial Intelligence, University of the Basque Country UPV/EHU, Donostia, Spain; Basque Center for Applied Mathematics BCAM, Bilbao, SpainClassifying software defects according to any defined taxonomy is not straightforward. In order to be used for automatizing the classification of software defects, two sets of defect reports were collected from public issue tracking systems from two different real domains. Due to the lack of a domain expert, the collected defects were categorized by a set of annotators of unknown reliability according to their impact from IBM's orthogonal defect classification taxonomy. Both datasets are prepared to solve the defect classification problem by means of techniques of the learning from crowds paradigm (Hernández-González et al. [1]).Two versions of both datasets are publicly shared. In the first version, the raw data is given: the text description of defects together with the category assigned by each annotator. In the second version, the text of each defect has been transformed to a descriptive vector using text-mining techniques.http://www.sciencedirect.com/science/article/pii/S2352340918303226
collection	DOAJ
language	English
format	Article
sources	DOAJ
author	Jerónimo Hernández-González Daniel Rodriguez Iñaki Inza Rachel Harrison Jose A. Lozano
spellingShingle	Jerónimo Hernández-González Daniel Rodriguez Iñaki Inza Rachel Harrison Jose A. Lozano Two datasets of defect reports labeled by a crowd of annotators of unknown reliability Data in Brief
author_facet	Jerónimo Hernández-González Daniel Rodriguez Iñaki Inza Rachel Harrison Jose A. Lozano
author_sort	Jerónimo Hernández-González
title	Two datasets of defect reports labeled by a crowd of annotators of unknown reliability
title_short	Two datasets of defect reports labeled by a crowd of annotators of unknown reliability
title_full	Two datasets of defect reports labeled by a crowd of annotators of unknown reliability
title_fullStr	Two datasets of defect reports labeled by a crowd of annotators of unknown reliability
title_full_unstemmed	Two datasets of defect reports labeled by a crowd of annotators of unknown reliability
title_sort	two datasets of defect reports labeled by a crowd of annotators of unknown reliability
publisher	Elsevier
series	Data in Brief
issn	2352-3409
publishDate	2018-06-01
description	Classifying software defects according to any defined taxonomy is not straightforward. In order to be used for automatizing the classification of software defects, two sets of defect reports were collected from public issue tracking systems from two different real domains. Due to the lack of a domain expert, the collected defects were categorized by a set of annotators of unknown reliability according to their impact from IBM's orthogonal defect classification taxonomy. Both datasets are prepared to solve the defect classification problem by means of techniques of the learning from crowds paradigm (Hernández-González et al. [1]).Two versions of both datasets are publicly shared. In the first version, the raw data is given: the text description of defects together with the category assigned by each annotator. In the second version, the text of each defect has been transformed to a descriptive vector using text-mining techniques.
url	http://www.sciencedirect.com/science/article/pii/S2352340918303226
work_keys_str_mv	AT jeronimohernandezgonzalez twodatasetsofdefectreportslabeledbyacrowdofannotatorsofunknownreliability AT danielrodriguez twodatasetsofdefectreportslabeledbyacrowdofannotatorsofunknownreliability AT inakiinza twodatasetsofdefectreportslabeledbyacrowdofannotatorsofunknownreliability AT rachelharrison twodatasetsofdefectreportslabeledbyacrowdofannotatorsofunknownreliability AT josealozano twodatasetsofdefectreportslabeledbyacrowdofannotatorsofunknownreliability
_version_	1725893512806793216

Two datasets of defect reports labeled by a crowd of annotators of unknown reliability

Similar Items