Using Twitter to Measure Public Discussion of Diseases: A Case Study

BackgroundTwitter is increasingly used to estimate disease prevalence, but such measurements can be biased, due to both biased sampling and inherent ambiguity of natural language. ObjectiveWe characterized the extent of these biases and how they vary with disease....

Full description

Bibliographic Details
Main Authors:	Weeg, Christopher, Schwartz, H Andrew, Hill, Shawndra, Merchant, Raina M, Arango, Catalina, Ungar, Lyle
Format:	Article
Language:	English
Published:	JMIR Publications 2015-06-01
Series:	JMIR Public Health and Surveillance
Online Access:	http://publichealth.jmir.org/2015/1/e6/

id	doaj-5936e435c7b04f20913684abfb190ffe
record_format	Article
spelling	doaj-5936e435c7b04f20913684abfb190ffe2021-05-02T19:28:13ZengJMIR PublicationsJMIR Public Health and Surveillance2369-29602015-06-0111e610.2196/publichealth.3953Using Twitter to Measure Public Discussion of Diseases: A Case StudyWeeg, ChristopherSchwartz, H AndrewHill, ShawndraMerchant, Raina MArango, CatalinaUngar, Lyle BackgroundTwitter is increasingly used to estimate disease prevalence, but such measurements can be biased, due to both biased sampling and inherent ambiguity of natural language. ObjectiveWe characterized the extent of these biases and how they vary with disease. MethodsWe correlated self-reported prevalence rates for 22 diseases from Experian’s Simmons National Consumer Study (n=12,305) with the number of times these diseases were mentioned on Twitter during the same period (2012). We also identified and corrected for two types of bias present in Twitter data: (1) demographic variance between US Twitter users and the general US population; and (2) natural language ambiguity, which creates the possibility that mention of a disease name may not actually refer to the disease (eg, “heart attack” on Twitter often does not refer to myocardial infarction). We measured the correlation between disease prevalence and Twitter disease mentions both with and without bias correction. This allowed us to quantify each disease’s overrepresentation or underrepresentation on Twitter, relative to its prevalence. ResultsOur sample included 80,680,449 tweets. Adjusting disease prevalence to correct for Twitter demographics more than doubles the correlation between Twitter disease mentions and disease prevalence in the general population (from .113 to .258, P <.001). In addition, diseases varied widely in how often mentions of their names on Twitter actually referred to the diseases, from 14.89% (3827/25,704) of instances (for stroke) to 99.92% (5044/5048) of instances (for arthritis). Applying ambiguity correction to our Twitter corpus achieves a correlation between disease mentions and prevalence of .208 ( P <.001). Simultaneously applying correction for both demographics and ambiguity more than triples the baseline correlation to .366 ( P <.001). Compared with prevalence rates, cancer appeared most overrepresented in Twitter, whereas high cholesterol appeared most underrepresented. ConclusionsTwitter is a potentially useful tool to measure public interest in and concerns about different diseases, but when comparing diseases, improvements can be made by adjusting for population demographics and word ambiguity.http://publichealth.jmir.org/2015/1/e6/
collection	DOAJ
language	English
format	Article
sources	DOAJ
author	Weeg, Christopher Schwartz, H Andrew Hill, Shawndra Merchant, Raina M Arango, Catalina Ungar, Lyle
spellingShingle	Weeg, Christopher Schwartz, H Andrew Hill, Shawndra Merchant, Raina M Arango, Catalina Ungar, Lyle Using Twitter to Measure Public Discussion of Diseases: A Case Study JMIR Public Health and Surveillance
author_facet	Weeg, Christopher Schwartz, H Andrew Hill, Shawndra Merchant, Raina M Arango, Catalina Ungar, Lyle
author_sort	Weeg, Christopher
title	Using Twitter to Measure Public Discussion of Diseases: A Case Study
title_short	Using Twitter to Measure Public Discussion of Diseases: A Case Study
title_full	Using Twitter to Measure Public Discussion of Diseases: A Case Study
title_fullStr	Using Twitter to Measure Public Discussion of Diseases: A Case Study
title_full_unstemmed	Using Twitter to Measure Public Discussion of Diseases: A Case Study
title_sort	using twitter to measure public discussion of diseases: a case study
publisher	JMIR Publications
series	JMIR Public Health and Surveillance
issn	2369-2960
publishDate	2015-06-01
description	BackgroundTwitter is increasingly used to estimate disease prevalence, but such measurements can be biased, due to both biased sampling and inherent ambiguity of natural language. ObjectiveWe characterized the extent of these biases and how they vary with disease. MethodsWe correlated self-reported prevalence rates for 22 diseases from Experian’s Simmons National Consumer Study (n=12,305) with the number of times these diseases were mentioned on Twitter during the same period (2012). We also identified and corrected for two types of bias present in Twitter data: (1) demographic variance between US Twitter users and the general US population; and (2) natural language ambiguity, which creates the possibility that mention of a disease name may not actually refer to the disease (eg, “heart attack” on Twitter often does not refer to myocardial infarction). We measured the correlation between disease prevalence and Twitter disease mentions both with and without bias correction. This allowed us to quantify each disease’s overrepresentation or underrepresentation on Twitter, relative to its prevalence. ResultsOur sample included 80,680,449 tweets. Adjusting disease prevalence to correct for Twitter demographics more than doubles the correlation between Twitter disease mentions and disease prevalence in the general population (from .113 to .258, P <.001). In addition, diseases varied widely in how often mentions of their names on Twitter actually referred to the diseases, from 14.89% (3827/25,704) of instances (for stroke) to 99.92% (5044/5048) of instances (for arthritis). Applying ambiguity correction to our Twitter corpus achieves a correlation between disease mentions and prevalence of .208 ( P <.001). Simultaneously applying correction for both demographics and ambiguity more than triples the baseline correlation to .366 ( P <.001). Compared with prevalence rates, cancer appeared most overrepresented in Twitter, whereas high cholesterol appeared most underrepresented. ConclusionsTwitter is a potentially useful tool to measure public interest in and concerns about different diseases, but when comparing diseases, improvements can be made by adjusting for population demographics and word ambiguity.
url	http://publichealth.jmir.org/2015/1/e6/
work_keys_str_mv	AT weegchristopher usingtwittertomeasurepublicdiscussionofdiseasesacasestudy AT schwartzhandrew usingtwittertomeasurepublicdiscussionofdiseasesacasestudy AT hillshawndra usingtwittertomeasurepublicdiscussionofdiseasesacasestudy AT merchantrainam usingtwittertomeasurepublicdiscussionofdiseasesacasestudy AT arangocatalina usingtwittertomeasurepublicdiscussionofdiseasesacasestudy AT ungarlyle usingtwittertomeasurepublicdiscussionofdiseasesacasestudy
_version_	1721488291583229952

Using Twitter to Measure Public Discussion of Diseases: A Case Study

Similar Items