Collecting specialty-related medical terms: Development and evaluation of a resource for Spanish

Abstract Background Controlled vocabularies are fundamental resources for information extraction from clinical texts using natural language processing (NLP). Standard language resources available in the healthcare domain such as the UMLS metathesaurus or SNOMED CT are widely used for this purpose, b...

Full description

Bibliographic Details
Main Authors: Pilar López-Úbeda, Alexandra Pomares-Quimbaya, Manuel Carlos Díaz-Galiano, Stefan Schulz
Format: Article
Language:English
Published: BMC 2021-05-01
Series:BMC Medical Informatics and Decision Making
Subjects:
Online Access:https://doi.org/10.1186/s12911-021-01495-w
id doaj-2fde6ce264d3403ea3ff5ea536437927
record_format Article
spelling doaj-2fde6ce264d3403ea3ff5ea5364379272021-05-09T11:40:50ZengBMCBMC Medical Informatics and Decision Making1472-69472021-05-0121111710.1186/s12911-021-01495-wCollecting specialty-related medical terms: Development and evaluation of a resource for SpanishPilar López-Úbeda0Alexandra Pomares-Quimbaya1Manuel Carlos Díaz-Galiano2Stefan Schulz3Universidad de JaénPontificia Universidad JaverianaUniversidad de JaénMedical University of GrazAbstract Background Controlled vocabularies are fundamental resources for information extraction from clinical texts using natural language processing (NLP). Standard language resources available in the healthcare domain such as the UMLS metathesaurus or SNOMED CT are widely used for this purpose, but with limitations such as lexical ambiguity of clinical terms. However, most of them are unambiguous within text limited to a given clinical specialty. This is one rationale besides others to classify clinical text by the clinical specialty to which they belong. Results This paper addresses this limitation by proposing and applying a method that automatically extracts Spanish medical terms classified and weighted per sub-domain, using Spanish MEDLINE titles and abstracts as input. The hypothesis is biomedical NLP tasks benefit from collections of domain terms that are specific to clinical subdomains. We use PubMed queries that generate sub-domain specific corpora from Spanish titles and abstracts, from which token n-grams are collected and metrics of relevance, discriminatory power, and broadness per sub-domain are computed. The generated term set, called Spanish core vocabulary about clinical specialties (SCOVACLIS), was made available to the scientific community and used in a text classification problem obtaining improvements of 6 percentage points in the F-measure compared to the baseline using Multilayer Perceptron, thus demonstrating the hypothesis that a specialized term set improves NLP tasks. Conclusion The creation and validation of SCOVACLIS support the hypothesis that specific term sets reduce the level of ambiguity when compared to a specialty-independent and broad-scope vocabulary.https://doi.org/10.1186/s12911-021-01495-wNatural language processingVocabularyMedical sub-languageClinical specialtyMedical sub-domain
collection DOAJ
language English
format Article
sources DOAJ
author Pilar López-Úbeda
Alexandra Pomares-Quimbaya
Manuel Carlos Díaz-Galiano
Stefan Schulz
spellingShingle Pilar López-Úbeda
Alexandra Pomares-Quimbaya
Manuel Carlos Díaz-Galiano
Stefan Schulz
Collecting specialty-related medical terms: Development and evaluation of a resource for Spanish
BMC Medical Informatics and Decision Making
Natural language processing
Vocabulary
Medical sub-language
Clinical specialty
Medical sub-domain
author_facet Pilar López-Úbeda
Alexandra Pomares-Quimbaya
Manuel Carlos Díaz-Galiano
Stefan Schulz
author_sort Pilar López-Úbeda
title Collecting specialty-related medical terms: Development and evaluation of a resource for Spanish
title_short Collecting specialty-related medical terms: Development and evaluation of a resource for Spanish
title_full Collecting specialty-related medical terms: Development and evaluation of a resource for Spanish
title_fullStr Collecting specialty-related medical terms: Development and evaluation of a resource for Spanish
title_full_unstemmed Collecting specialty-related medical terms: Development and evaluation of a resource for Spanish
title_sort collecting specialty-related medical terms: development and evaluation of a resource for spanish
publisher BMC
series BMC Medical Informatics and Decision Making
issn 1472-6947
publishDate 2021-05-01
description Abstract Background Controlled vocabularies are fundamental resources for information extraction from clinical texts using natural language processing (NLP). Standard language resources available in the healthcare domain such as the UMLS metathesaurus or SNOMED CT are widely used for this purpose, but with limitations such as lexical ambiguity of clinical terms. However, most of them are unambiguous within text limited to a given clinical specialty. This is one rationale besides others to classify clinical text by the clinical specialty to which they belong. Results This paper addresses this limitation by proposing and applying a method that automatically extracts Spanish medical terms classified and weighted per sub-domain, using Spanish MEDLINE titles and abstracts as input. The hypothesis is biomedical NLP tasks benefit from collections of domain terms that are specific to clinical subdomains. We use PubMed queries that generate sub-domain specific corpora from Spanish titles and abstracts, from which token n-grams are collected and metrics of relevance, discriminatory power, and broadness per sub-domain are computed. The generated term set, called Spanish core vocabulary about clinical specialties (SCOVACLIS), was made available to the scientific community and used in a text classification problem obtaining improvements of 6 percentage points in the F-measure compared to the baseline using Multilayer Perceptron, thus demonstrating the hypothesis that a specialized term set improves NLP tasks. Conclusion The creation and validation of SCOVACLIS support the hypothesis that specific term sets reduce the level of ambiguity when compared to a specialty-independent and broad-scope vocabulary.
topic Natural language processing
Vocabulary
Medical sub-language
Clinical specialty
Medical sub-domain
url https://doi.org/10.1186/s12911-021-01495-w
work_keys_str_mv AT pilarlopezubeda collectingspecialtyrelatedmedicaltermsdevelopmentandevaluationofaresourceforspanish
AT alexandrapomaresquimbaya collectingspecialtyrelatedmedicaltermsdevelopmentandevaluationofaresourceforspanish
AT manuelcarlosdiazgaliano collectingspecialtyrelatedmedicaltermsdevelopmentandevaluationofaresourceforspanish
AT stefanschulz collectingspecialtyrelatedmedicaltermsdevelopmentandevaluationofaresourceforspanish
_version_ 1721454139930574848