A study on real-time low-quality content detection on Twitter from the users' perspective.

Detection techniques of malicious content such as spam and phishing on Online Social Networks (OSN) are common with little attention paid to other types of low-quality content which actually impacts users' content browsing experience most. The aim of our work is to detect low-quality content fr...

Full description

Bibliographic Details
Main Authors: Weiling Chen, Chai Kiat Yeo, Chiew Tong Lau, Bu Sung Lee
Format: Article
Language:English
Published: Public Library of Science (PLoS) 2017-01-01
Series:PLoS ONE
Online Access:http://europepmc.org/articles/PMC5549928?pdf=render
id doaj-4361f062db954048b03cceb8f651ef53
record_format Article
spelling doaj-4361f062db954048b03cceb8f651ef532020-11-25T02:41:25ZengPublic Library of Science (PLoS)PLoS ONE1932-62032017-01-01128e018248710.1371/journal.pone.0182487A study on real-time low-quality content detection on Twitter from the users' perspective.Weiling ChenChai Kiat YeoChiew Tong LauBu Sung LeeDetection techniques of malicious content such as spam and phishing on Online Social Networks (OSN) are common with little attention paid to other types of low-quality content which actually impacts users' content browsing experience most. The aim of our work is to detect low-quality content from the users' perspective in real time. To define low-quality content comprehensibly, Expectation Maximization (EM) algorithm is first used to coarsely classify low-quality tweets into four categories. Based on this preliminary study, a survey is carefully designed to gather users' opinions on different categories of low-quality content. Both direct and indirect features including newly proposed features are identified to characterize all types of low-quality content. We then further combine word level analysis with the identified features and build a keyword blacklist dictionary to improve the detection performance. We manually label an extensive Twitter dataset of 100,000 tweets and perform low-quality content detection in real time based on the characterized significant features and word level analysis. The results of our research show that our method has a high accuracy of 0.9711 and a good F1 of 0.8379 based on a random forest classifier with real time performance in the detection of low-quality content in tweets. Our work therefore achieves a positive impact in improving user experience in browsing social media content.http://europepmc.org/articles/PMC5549928?pdf=render
collection DOAJ
language English
format Article
sources DOAJ
author Weiling Chen
Chai Kiat Yeo
Chiew Tong Lau
Bu Sung Lee
spellingShingle Weiling Chen
Chai Kiat Yeo
Chiew Tong Lau
Bu Sung Lee
A study on real-time low-quality content detection on Twitter from the users' perspective.
PLoS ONE
author_facet Weiling Chen
Chai Kiat Yeo
Chiew Tong Lau
Bu Sung Lee
author_sort Weiling Chen
title A study on real-time low-quality content detection on Twitter from the users' perspective.
title_short A study on real-time low-quality content detection on Twitter from the users' perspective.
title_full A study on real-time low-quality content detection on Twitter from the users' perspective.
title_fullStr A study on real-time low-quality content detection on Twitter from the users' perspective.
title_full_unstemmed A study on real-time low-quality content detection on Twitter from the users' perspective.
title_sort study on real-time low-quality content detection on twitter from the users' perspective.
publisher Public Library of Science (PLoS)
series PLoS ONE
issn 1932-6203
publishDate 2017-01-01
description Detection techniques of malicious content such as spam and phishing on Online Social Networks (OSN) are common with little attention paid to other types of low-quality content which actually impacts users' content browsing experience most. The aim of our work is to detect low-quality content from the users' perspective in real time. To define low-quality content comprehensibly, Expectation Maximization (EM) algorithm is first used to coarsely classify low-quality tweets into four categories. Based on this preliminary study, a survey is carefully designed to gather users' opinions on different categories of low-quality content. Both direct and indirect features including newly proposed features are identified to characterize all types of low-quality content. We then further combine word level analysis with the identified features and build a keyword blacklist dictionary to improve the detection performance. We manually label an extensive Twitter dataset of 100,000 tweets and perform low-quality content detection in real time based on the characterized significant features and word level analysis. The results of our research show that our method has a high accuracy of 0.9711 and a good F1 of 0.8379 based on a random forest classifier with real time performance in the detection of low-quality content in tweets. Our work therefore achieves a positive impact in improving user experience in browsing social media content.
url http://europepmc.org/articles/PMC5549928?pdf=render
work_keys_str_mv AT weilingchen astudyonrealtimelowqualitycontentdetectionontwitterfromtheusersperspective
AT chaikiatyeo astudyonrealtimelowqualitycontentdetectionontwitterfromtheusersperspective
AT chiewtonglau astudyonrealtimelowqualitycontentdetectionontwitterfromtheusersperspective
AT busunglee astudyonrealtimelowqualitycontentdetectionontwitterfromtheusersperspective
AT weilingchen studyonrealtimelowqualitycontentdetectionontwitterfromtheusersperspective
AT chaikiatyeo studyonrealtimelowqualitycontentdetectionontwitterfromtheusersperspective
AT chiewtonglau studyonrealtimelowqualitycontentdetectionontwitterfromtheusersperspective
AT busunglee studyonrealtimelowqualitycontentdetectionontwitterfromtheusersperspective
_version_ 1724778483229392896