Directions in abusive language training data, a systematic review: Garbage in, garbage out.

Data-driven and machine learning based approaches for detecting, categorising and measuring abusive content such as hate speech and harassment have gained traction due to their scalability, robustness and increasingly high performance. Making effective detection systems for abusive content relies on...

Full description

Bibliographic Details
Main Authors: Bertie Vidgen, Leon Derczynski
Format: Article
Language:English
Published: Public Library of Science (PLoS) 2020-01-01
Series:PLoS ONE
Online Access:https://doi.org/10.1371/journal.pone.0243300
id doaj-b2e02becbe40469eb7a70c290af75b10
record_format Article
spelling doaj-b2e02becbe40469eb7a70c290af75b102021-03-04T12:49:24ZengPublic Library of Science (PLoS)PLoS ONE1932-62032020-01-011512e024330010.1371/journal.pone.0243300Directions in abusive language training data, a systematic review: Garbage in, garbage out.Bertie VidgenLeon DerczynskiData-driven and machine learning based approaches for detecting, categorising and measuring abusive content such as hate speech and harassment have gained traction due to their scalability, robustness and increasingly high performance. Making effective detection systems for abusive content relies on having the right training datasets, reflecting a widely accepted mantra in computer science: Garbage In, Garbage Out. However, creating training datasets which are large, varied, theoretically-informed and that minimize biases is difficult, laborious and requires deep expertise. This paper systematically reviews 63 publicly available training datasets which have been created to train abusive language classifiers. It also reports on creation of a dedicated website for cataloguing abusive language data hatespeechdata.com. We discuss the challenges and opportunities of open science in this field, and argue that although more dataset sharing would bring many benefits it also poses social and ethical risks which need careful consideration. Finally, we provide evidence-based recommendations for practitioners creating new abusive content training datasets.https://doi.org/10.1371/journal.pone.0243300
collection DOAJ
language English
format Article
sources DOAJ
author Bertie Vidgen
Leon Derczynski
spellingShingle Bertie Vidgen
Leon Derczynski
Directions in abusive language training data, a systematic review: Garbage in, garbage out.
PLoS ONE
author_facet Bertie Vidgen
Leon Derczynski
author_sort Bertie Vidgen
title Directions in abusive language training data, a systematic review: Garbage in, garbage out.
title_short Directions in abusive language training data, a systematic review: Garbage in, garbage out.
title_full Directions in abusive language training data, a systematic review: Garbage in, garbage out.
title_fullStr Directions in abusive language training data, a systematic review: Garbage in, garbage out.
title_full_unstemmed Directions in abusive language training data, a systematic review: Garbage in, garbage out.
title_sort directions in abusive language training data, a systematic review: garbage in, garbage out.
publisher Public Library of Science (PLoS)
series PLoS ONE
issn 1932-6203
publishDate 2020-01-01
description Data-driven and machine learning based approaches for detecting, categorising and measuring abusive content such as hate speech and harassment have gained traction due to their scalability, robustness and increasingly high performance. Making effective detection systems for abusive content relies on having the right training datasets, reflecting a widely accepted mantra in computer science: Garbage In, Garbage Out. However, creating training datasets which are large, varied, theoretically-informed and that minimize biases is difficult, laborious and requires deep expertise. This paper systematically reviews 63 publicly available training datasets which have been created to train abusive language classifiers. It also reports on creation of a dedicated website for cataloguing abusive language data hatespeechdata.com. We discuss the challenges and opportunities of open science in this field, and argue that although more dataset sharing would bring many benefits it also poses social and ethical risks which need careful consideration. Finally, we provide evidence-based recommendations for practitioners creating new abusive content training datasets.
url https://doi.org/10.1371/journal.pone.0243300
work_keys_str_mv AT bertievidgen directionsinabusivelanguagetrainingdataasystematicreviewgarbageingarbageout
AT leonderczynski directionsinabusivelanguagetrainingdataasystematicreviewgarbageingarbageout
_version_ 1714801470877990912