Location inference for hidden population with online text analysis

Abstract Background Understanding the geographic distribution of hidden population, such as men who have sex with men (MSM), sex workers, or injecting drug users, are of great importance for the adequate deployment of intervention strategies and public health decision making. However, due to the har...

Full description

Bibliographic Details
Main Authors: Chuchu Liu, Ziqiang Cao, Xin Lu
Format: Article
Language:English
Published: BMC 2020-12-01
Series:International Journal of Health Geographics
Subjects:
MSM
Online Access:https://doi.org/10.1186/s12942-020-00245-x
id doaj-b93d2d340d484e8780204b18282df8f6
record_format Article
spelling doaj-b93d2d340d484e8780204b18282df8f62020-12-13T12:06:59ZengBMCInternational Journal of Health Geographics1476-072X2020-12-0119111210.1186/s12942-020-00245-xLocation inference for hidden population with online text analysisChuchu Liu0Ziqiang Cao1Xin Lu2College of Systems Engineering, National University of Defense TechnologyCollege of Systems Engineering, National University of Defense TechnologyCollege of Systems Engineering, National University of Defense TechnologyAbstract Background Understanding the geographic distribution of hidden population, such as men who have sex with men (MSM), sex workers, or injecting drug users, are of great importance for the adequate deployment of intervention strategies and public health decision making. However, due to the hard-to-access properties, e.g., lack of a sampling frame, sensitivity issue, reporting error, etc., traditional survey methods are largely limited when studying such populations. With data extracted from the very active online community of MSM in China, in this study we adopt and develop location inferring methods to achieve a high-resolution mapping of users in this community at national level. Methods We collect a comprehensive dataset from the largest sub-community related to MSM topics in Baidu Tieba, covering 628,360 MSM-related users. Based on users’ publicly available posts, we evaluate and compare the performances of mainstream location inference algorithms on the online locating problem of Chinese MSM population. To improve the inference accuracy, other approaches in natural language processing are introduced into the location extraction, such as context analysis and pattern recognition. In addition, we develop a hybrid voting algorithm (HVA-LI) by allowing different approaches to vote to determine the best inference results, which guarantees a more effective way on location inference for hidden population. Results By comparing the performances of popular inference algorithms, we find that the classic gazetteer-based algorithm has achieved better results. And in the HVA-LI algorithms, the hybrid algorithm consisting of the simple gazetteer-based method and named entity recognition (NER) is proven to be the best to deal with inferring users’ locations disclosed in short texts on online communities, improving the inferring accuracy from 50.3 to 71.3% on the MSM-related dataset. Conclusions In this study, we have explored the possibility of location inferring by analyzing textual content posted by online users. A more effective hybrid algorithm, i.e., the Gazetteer & NER algorithm is proposed, which is conducive to overcoming the sparse location labeling problem in user profiles, and can be extended to the inference of geo-statistics for other hidden populations.https://doi.org/10.1186/s12942-020-00245-xLocation inferenceHidden populationMSMText analysisGeographic distribution
collection DOAJ
language English
format Article
sources DOAJ
author Chuchu Liu
Ziqiang Cao
Xin Lu
spellingShingle Chuchu Liu
Ziqiang Cao
Xin Lu
Location inference for hidden population with online text analysis
International Journal of Health Geographics
Location inference
Hidden population
MSM
Text analysis
Geographic distribution
author_facet Chuchu Liu
Ziqiang Cao
Xin Lu
author_sort Chuchu Liu
title Location inference for hidden population with online text analysis
title_short Location inference for hidden population with online text analysis
title_full Location inference for hidden population with online text analysis
title_fullStr Location inference for hidden population with online text analysis
title_full_unstemmed Location inference for hidden population with online text analysis
title_sort location inference for hidden population with online text analysis
publisher BMC
series International Journal of Health Geographics
issn 1476-072X
publishDate 2020-12-01
description Abstract Background Understanding the geographic distribution of hidden population, such as men who have sex with men (MSM), sex workers, or injecting drug users, are of great importance for the adequate deployment of intervention strategies and public health decision making. However, due to the hard-to-access properties, e.g., lack of a sampling frame, sensitivity issue, reporting error, etc., traditional survey methods are largely limited when studying such populations. With data extracted from the very active online community of MSM in China, in this study we adopt and develop location inferring methods to achieve a high-resolution mapping of users in this community at national level. Methods We collect a comprehensive dataset from the largest sub-community related to MSM topics in Baidu Tieba, covering 628,360 MSM-related users. Based on users’ publicly available posts, we evaluate and compare the performances of mainstream location inference algorithms on the online locating problem of Chinese MSM population. To improve the inference accuracy, other approaches in natural language processing are introduced into the location extraction, such as context analysis and pattern recognition. In addition, we develop a hybrid voting algorithm (HVA-LI) by allowing different approaches to vote to determine the best inference results, which guarantees a more effective way on location inference for hidden population. Results By comparing the performances of popular inference algorithms, we find that the classic gazetteer-based algorithm has achieved better results. And in the HVA-LI algorithms, the hybrid algorithm consisting of the simple gazetteer-based method and named entity recognition (NER) is proven to be the best to deal with inferring users’ locations disclosed in short texts on online communities, improving the inferring accuracy from 50.3 to 71.3% on the MSM-related dataset. Conclusions In this study, we have explored the possibility of location inferring by analyzing textual content posted by online users. A more effective hybrid algorithm, i.e., the Gazetteer & NER algorithm is proposed, which is conducive to overcoming the sparse location labeling problem in user profiles, and can be extended to the inference of geo-statistics for other hidden populations.
topic Location inference
Hidden population
MSM
Text analysis
Geographic distribution
url https://doi.org/10.1186/s12942-020-00245-x
work_keys_str_mv AT chuchuliu locationinferenceforhiddenpopulationwithonlinetextanalysis
AT ziqiangcao locationinferenceforhiddenpopulationwithonlinetextanalysis
AT xinlu locationinferenceforhiddenpopulationwithonlinetextanalysis
_version_ 1724385279899336704