Using Twitter data for demographic research

<b>Background</b>: Social media data is a promising source of social science data. However, deriving the demographic characteristics of users and dealing with the nonrandom, nonrepresentative populations from which they are drawn represent challenges for social scientists. <b>Ob...

Full description

Bibliographic Details
Main Authors: Dilek Yildiz, Jo Munson, Agnese Vitali, Ramine Tinati, Jennifer A. Holland
Format: Article
Language:English
Published: Max Planck Institute for Demographic Research 2017-11-01
Series:Demographic Research
Subjects:
Online Access:https://www.demographic-research.org/volumes/vol37/46/
Description
Summary:<b>Background</b>: Social media data is a promising source of social science data. However, deriving the demographic characteristics of users and dealing with the nonrandom, nonrepresentative populations from which they are drawn represent challenges for social scientists. <b>Objective</b>: Given the growing use of social media data in social science research, this paper asks two questions: 1) To what extent are findings obtained with social media data generalizable to broader populations, and 2) what is the best practice for estimating demographic information from Twitter data? <b>Methods</b>: Our analyses use information gathered from 979,992 geo-located Tweets sent by 22,356 unique users in South East England between 23 June and 4 July 2014. We estimate demographic characteristics of the Twitter users with the crowd-sourcing platform CrowdFlower and the image-recognition software Face++. To evaluate bias in the data, we run a series of log-linear models with offsets and calibrate the nonrepresentative sample of Twitter users with mid-year population estimates for South East England. <b>Results</b>: CrowdFlower proves to be more accurate than Face++ for the measurement of age, whereas both tools are highly reliable for measuring the sex of Twitter users. The calibration exercise allows bias correction in the age-, sex-, and location-specific population counts obtained from the Twitter population by augmenting Twitter data with mid-year population estimates. <b>Contribution</b>: The paper proposes best practices for estimating Twitter users' basic demographic characteristics and a calibration method to address the selection bias in the Twitter population, allowing researchers to generalize findings based on Twitter to the general population.
ISSN:1435-9871