Classifying patient and professional voice in social media health posts

Abstract Background Patient-based analysis of social media is a growing research field with the aim of delivering precision medicine but it requires accurate classification of posts relating to patients’ experiences. We motivate the need for this type of classification as a pre-processing step for f...

Full description

Bibliographic Details
Main Authors: Beatrice Alex, Donald Whyte, Daniel Duma, Roma English Owen, Elizabeth A. L. Fairley
Format: Article
Language:English
Published: BMC 2021-08-01
Series:BMC Medical Informatics and Decision Making
Subjects:
Online Access:https://doi.org/10.1186/s12911-021-01577-9
id doaj-725abd2379164cdc9ab74e65a23a02f4
record_format Article
spelling doaj-725abd2379164cdc9ab74e65a23a02f42021-08-22T11:32:39ZengBMCBMC Medical Informatics and Decision Making1472-69472021-08-0121111010.1186/s12911-021-01577-9Classifying patient and professional voice in social media health postsBeatrice Alex0Donald Whyte1Daniel Duma2Roma English Owen3Elizabeth A. L. Fairley4Talking Medicines Limited (SC447227)Talking Medicines Limited (SC447227)Talking Medicines Limited (SC447227)Talking Medicines Limited (SC447227)Talking Medicines Limited (SC447227)Abstract Background Patient-based analysis of social media is a growing research field with the aim of delivering precision medicine but it requires accurate classification of posts relating to patients’ experiences. We motivate the need for this type of classification as a pre-processing step for further analysis of social media data in the context of related work in this area. In this paper we present experiments for a three-way document classification by patient voice, professional voice or other. We present results for a convolutional neural network classifier trained on English data from two different data sources (Reddit and Twitter) and two domains (cardiovascular and skin diseases). Results We found that document classification by patient voice, professional voice or other can be done consistently manually (0.92 accuracy). Annotators agreed roughly equally for each domain (cardiovascular and skin) but they agreed more when annotating Reddit posts compared to Twitter posts. Best classification performance was obtained when training two separate classifiers for each data source, one for Reddit and one for Twitter posts, when evaluating on in-source test data for both test sets combined with an overall accuracy of 0.95 (and macro-average F1 of 0.92) and an F1-score of 0.95 for patient voice only. Conclusion The main conclusion resulting from this work is that combining social media data from platforms with different characteristics for training a patient and professional voice classifier does not result in best possible performance. We showed that it is best to train separate models per data source (Reddit and Twitter) instead of a model using the combined training data from both sources. We also found that it is preferable to train separate models per domain (cardiovascular and skin) while showing that the difference to the combined model is only minor (0.01 accuracy). Our highest overall F1-score (0.95) obtained for classifying posts as patient voice is a very good starting point for further analysis of social media data reflecting the experience of patients.https://doi.org/10.1186/s12911-021-01577-9Patient voiceProfessional voiceSocial mediaClassificationRedditTwitter
collection DOAJ
language English
format Article
sources DOAJ
author Beatrice Alex
Donald Whyte
Daniel Duma
Roma English Owen
Elizabeth A. L. Fairley
spellingShingle Beatrice Alex
Donald Whyte
Daniel Duma
Roma English Owen
Elizabeth A. L. Fairley
Classifying patient and professional voice in social media health posts
BMC Medical Informatics and Decision Making
Patient voice
Professional voice
Social media
Classification
Reddit
Twitter
author_facet Beatrice Alex
Donald Whyte
Daniel Duma
Roma English Owen
Elizabeth A. L. Fairley
author_sort Beatrice Alex
title Classifying patient and professional voice in social media health posts
title_short Classifying patient and professional voice in social media health posts
title_full Classifying patient and professional voice in social media health posts
title_fullStr Classifying patient and professional voice in social media health posts
title_full_unstemmed Classifying patient and professional voice in social media health posts
title_sort classifying patient and professional voice in social media health posts
publisher BMC
series BMC Medical Informatics and Decision Making
issn 1472-6947
publishDate 2021-08-01
description Abstract Background Patient-based analysis of social media is a growing research field with the aim of delivering precision medicine but it requires accurate classification of posts relating to patients’ experiences. We motivate the need for this type of classification as a pre-processing step for further analysis of social media data in the context of related work in this area. In this paper we present experiments for a three-way document classification by patient voice, professional voice or other. We present results for a convolutional neural network classifier trained on English data from two different data sources (Reddit and Twitter) and two domains (cardiovascular and skin diseases). Results We found that document classification by patient voice, professional voice or other can be done consistently manually (0.92 accuracy). Annotators agreed roughly equally for each domain (cardiovascular and skin) but they agreed more when annotating Reddit posts compared to Twitter posts. Best classification performance was obtained when training two separate classifiers for each data source, one for Reddit and one for Twitter posts, when evaluating on in-source test data for both test sets combined with an overall accuracy of 0.95 (and macro-average F1 of 0.92) and an F1-score of 0.95 for patient voice only. Conclusion The main conclusion resulting from this work is that combining social media data from platforms with different characteristics for training a patient and professional voice classifier does not result in best possible performance. We showed that it is best to train separate models per data source (Reddit and Twitter) instead of a model using the combined training data from both sources. We also found that it is preferable to train separate models per domain (cardiovascular and skin) while showing that the difference to the combined model is only minor (0.01 accuracy). Our highest overall F1-score (0.95) obtained for classifying posts as patient voice is a very good starting point for further analysis of social media data reflecting the experience of patients.
topic Patient voice
Professional voice
Social media
Classification
Reddit
Twitter
url https://doi.org/10.1186/s12911-021-01577-9
work_keys_str_mv AT beatricealex classifyingpatientandprofessionalvoiceinsocialmediahealthposts
AT donaldwhyte classifyingpatientandprofessionalvoiceinsocialmediahealthposts
AT danielduma classifyingpatientandprofessionalvoiceinsocialmediahealthposts
AT romaenglishowen classifyingpatientandprofessionalvoiceinsocialmediahealthposts
AT elizabethalfairley classifyingpatientandprofessionalvoiceinsocialmediahealthposts
_version_ 1721199654313394176