Vigi4Med Scraper: A Framework for Web Forum Structured Data Extraction and Semantic Representation.

The extraction of information from social media is an essential yet complicated step for data analysis in multiple domains. In this paper, we present Vigi4Med Scraper, a generic open source framework for extracting structured data from web forums. Our framework is highly configurable; using a config...

Full description

Bibliographic Details
Main Authors: Bissan Audeh, Michel Beigbeder, Antoine Zimmermann, Philippe Jaillon, Cédric Bousquet
Format: Article
Language:English
Published: Public Library of Science (PLoS) 2017-01-01
Series:PLoS ONE
Online Access:http://europepmc.org/articles/PMC5266266?pdf=render
id doaj-775b7d8c75184cb9a0b6477239b74167
record_format Article
spelling doaj-775b7d8c75184cb9a0b6477239b741672020-11-25T00:07:16ZengPublic Library of Science (PLoS)PLoS ONE1932-62032017-01-01121e016965810.1371/journal.pone.0169658Vigi4Med Scraper: A Framework for Web Forum Structured Data Extraction and Semantic Representation.Bissan AudehMichel BeigbederAntoine ZimmermannPhilippe JaillonCédric BousquetThe extraction of information from social media is an essential yet complicated step for data analysis in multiple domains. In this paper, we present Vigi4Med Scraper, a generic open source framework for extracting structured data from web forums. Our framework is highly configurable; using a configuration file, the user can freely choose the data to extract from any web forum. The extracted data are anonymized and represented in a semantic structure using Resource Description Framework (RDF) graphs. This representation enables efficient manipulation by data analysis algorithms and allows the collected data to be directly linked to any existing semantic resource. To avoid server overload, an integrated proxy with caching functionality imposes a minimal delay between sequential requests. Vigi4Med Scraper represents the first step of Vigi4Med, a project to detect adverse drug reactions (ADRs) from social networks founded by the French drug safety agency Agence Nationale de Sécurité du Médicament (ANSM). Vigi4Med Scraper has successfully extracted greater than 200 gigabytes of data from the web forums of over 20 different websites.http://europepmc.org/articles/PMC5266266?pdf=render
collection DOAJ
language English
format Article
sources DOAJ
author Bissan Audeh
Michel Beigbeder
Antoine Zimmermann
Philippe Jaillon
Cédric Bousquet
spellingShingle Bissan Audeh
Michel Beigbeder
Antoine Zimmermann
Philippe Jaillon
Cédric Bousquet
Vigi4Med Scraper: A Framework for Web Forum Structured Data Extraction and Semantic Representation.
PLoS ONE
author_facet Bissan Audeh
Michel Beigbeder
Antoine Zimmermann
Philippe Jaillon
Cédric Bousquet
author_sort Bissan Audeh
title Vigi4Med Scraper: A Framework for Web Forum Structured Data Extraction and Semantic Representation.
title_short Vigi4Med Scraper: A Framework for Web Forum Structured Data Extraction and Semantic Representation.
title_full Vigi4Med Scraper: A Framework for Web Forum Structured Data Extraction and Semantic Representation.
title_fullStr Vigi4Med Scraper: A Framework for Web Forum Structured Data Extraction and Semantic Representation.
title_full_unstemmed Vigi4Med Scraper: A Framework for Web Forum Structured Data Extraction and Semantic Representation.
title_sort vigi4med scraper: a framework for web forum structured data extraction and semantic representation.
publisher Public Library of Science (PLoS)
series PLoS ONE
issn 1932-6203
publishDate 2017-01-01
description The extraction of information from social media is an essential yet complicated step for data analysis in multiple domains. In this paper, we present Vigi4Med Scraper, a generic open source framework for extracting structured data from web forums. Our framework is highly configurable; using a configuration file, the user can freely choose the data to extract from any web forum. The extracted data are anonymized and represented in a semantic structure using Resource Description Framework (RDF) graphs. This representation enables efficient manipulation by data analysis algorithms and allows the collected data to be directly linked to any existing semantic resource. To avoid server overload, an integrated proxy with caching functionality imposes a minimal delay between sequential requests. Vigi4Med Scraper represents the first step of Vigi4Med, a project to detect adverse drug reactions (ADRs) from social networks founded by the French drug safety agency Agence Nationale de Sécurité du Médicament (ANSM). Vigi4Med Scraper has successfully extracted greater than 200 gigabytes of data from the web forums of over 20 different websites.
url http://europepmc.org/articles/PMC5266266?pdf=render
work_keys_str_mv AT bissanaudeh vigi4medscraperaframeworkforwebforumstructureddataextractionandsemanticrepresentation
AT michelbeigbeder vigi4medscraperaframeworkforwebforumstructureddataextractionandsemanticrepresentation
AT antoinezimmermann vigi4medscraperaframeworkforwebforumstructureddataextractionandsemanticrepresentation
AT philippejaillon vigi4medscraperaframeworkforwebforumstructureddataextractionandsemanticrepresentation
AT cedricbousquet vigi4medscraperaframeworkforwebforumstructureddataextractionandsemanticrepresentation
_version_ 1725419145781051392