A Word Pair Dataset for Semantic Similarity and Relatedness in Korean Medical Vocabulary: Reference Development and Validation

BackgroundThe fact that medical terms require special expertise and are becoming increasingly complex makes it difficult to employ natural language processing techniques in medical informatics. Several human-validated reference standards for medical terms have been developed...

Full description

Bibliographic Details
Main Authors: Yunjin Yum, Jeong Moon Lee, Moon Joung Jang, Yoojoong Kim, Jong-Ho Kim, Seongtae Kim, Unsub Shin, Sanghoun Song, Hyung Joon Joo
Format: Article
Language:English
Published: JMIR Publications 2021-06-01
Series:JMIR Medical Informatics
Online Access:https://medinform.jmir.org/2021/6/e29667
id doaj-8c1100258ba14e728dea27b8ae4abc79
record_format Article
spelling doaj-8c1100258ba14e728dea27b8ae4abc792021-06-24T14:49:22ZengJMIR PublicationsJMIR Medical Informatics2291-96942021-06-0196e2966710.2196/29667A Word Pair Dataset for Semantic Similarity and Relatedness in Korean Medical Vocabulary: Reference Development and ValidationYunjin Yumhttps://orcid.org/0000-0003-3070-3615Jeong Moon Leehttps://orcid.org/0000-0003-4020-4561Moon Joung Janghttps://orcid.org/0000-0002-6506-4254Yoojoong Kimhttps://orcid.org/0000-0002-6615-9116Jong-Ho Kimhttps://orcid.org/0000-0002-1309-0821Seongtae Kimhttps://orcid.org/0000-0002-9298-323XUnsub Shinhttps://orcid.org/0000-0001-9744-5206Sanghoun Songhttps://orcid.org/0000-0002-4234-232XHyung Joon Joohttps://orcid.org/0000-0003-1846-8464 BackgroundThe fact that medical terms require special expertise and are becoming increasingly complex makes it difficult to employ natural language processing techniques in medical informatics. Several human-validated reference standards for medical terms have been developed to evaluate word embedding models using the semantic similarity and relatedness of medical word pairs. However, there are very few reference standards in non-English languages. In addition, because the existing reference standards were developed a long time ago, there is a need to develop an updated standard to represent recent findings in medical sciences. ObjectiveWe propose a new Korean word pair reference set to verify embedding models. MethodsFrom January 2010 to December 2020, 518 medical textbooks, 72,844 health information news, and 15,698 medical research articles were collected, and the top 10,000 medical terms were selected to develop medical word pairs. Attending physicians (n=16) participated in the verification of the developed set with 607 word pairs. ResultsThe proportion of word pairs answered by all participants was 90.8% (551/607) for the similarity task and 86.5% (525/605) for the relatedness task. The similarity and relatedness of the word pair showed a high correlation (ρ=0.70, P<.001). The intraclass correlation coefficients to assess the interrater agreements of the word pair sets were 0.47 on the similarity task and 0.53 on the relatedness task. The final reference standard was 604 word pairs for the similarity task and 599 word pairs for relatedness, excluding word pairs with answers corresponding to outliers and word pairs that were answered by less than 50% of all the respondents. When FastText models were applied to the final reference standard word pair sets, the embedding models learning medical documents had a higher correlation between the calculated cosine similarity scores compared to human-judged similarity and relatedness scores (namu, ρ=0.12 vs with medical text for the similarity task, ρ=0.47; namu, ρ=0.02 vs with medical text for the relatedness task, ρ=0.30). ConclusionsKorean medical word pair reference standard sets for semantic similarity and relatedness were developed based on medical documents from the past 10 years. It is expected that our word pair reference sets will be actively utilized in the development of medical and multilingual natural language processing technology in the future.https://medinform.jmir.org/2021/6/e29667
collection DOAJ
language English
format Article
sources DOAJ
author Yunjin Yum
Jeong Moon Lee
Moon Joung Jang
Yoojoong Kim
Jong-Ho Kim
Seongtae Kim
Unsub Shin
Sanghoun Song
Hyung Joon Joo
spellingShingle Yunjin Yum
Jeong Moon Lee
Moon Joung Jang
Yoojoong Kim
Jong-Ho Kim
Seongtae Kim
Unsub Shin
Sanghoun Song
Hyung Joon Joo
A Word Pair Dataset for Semantic Similarity and Relatedness in Korean Medical Vocabulary: Reference Development and Validation
JMIR Medical Informatics
author_facet Yunjin Yum
Jeong Moon Lee
Moon Joung Jang
Yoojoong Kim
Jong-Ho Kim
Seongtae Kim
Unsub Shin
Sanghoun Song
Hyung Joon Joo
author_sort Yunjin Yum
title A Word Pair Dataset for Semantic Similarity and Relatedness in Korean Medical Vocabulary: Reference Development and Validation
title_short A Word Pair Dataset for Semantic Similarity and Relatedness in Korean Medical Vocabulary: Reference Development and Validation
title_full A Word Pair Dataset for Semantic Similarity and Relatedness in Korean Medical Vocabulary: Reference Development and Validation
title_fullStr A Word Pair Dataset for Semantic Similarity and Relatedness in Korean Medical Vocabulary: Reference Development and Validation
title_full_unstemmed A Word Pair Dataset for Semantic Similarity and Relatedness in Korean Medical Vocabulary: Reference Development and Validation
title_sort word pair dataset for semantic similarity and relatedness in korean medical vocabulary: reference development and validation
publisher JMIR Publications
series JMIR Medical Informatics
issn 2291-9694
publishDate 2021-06-01
description BackgroundThe fact that medical terms require special expertise and are becoming increasingly complex makes it difficult to employ natural language processing techniques in medical informatics. Several human-validated reference standards for medical terms have been developed to evaluate word embedding models using the semantic similarity and relatedness of medical word pairs. However, there are very few reference standards in non-English languages. In addition, because the existing reference standards were developed a long time ago, there is a need to develop an updated standard to represent recent findings in medical sciences. ObjectiveWe propose a new Korean word pair reference set to verify embedding models. MethodsFrom January 2010 to December 2020, 518 medical textbooks, 72,844 health information news, and 15,698 medical research articles were collected, and the top 10,000 medical terms were selected to develop medical word pairs. Attending physicians (n=16) participated in the verification of the developed set with 607 word pairs. ResultsThe proportion of word pairs answered by all participants was 90.8% (551/607) for the similarity task and 86.5% (525/605) for the relatedness task. The similarity and relatedness of the word pair showed a high correlation (ρ=0.70, P<.001). The intraclass correlation coefficients to assess the interrater agreements of the word pair sets were 0.47 on the similarity task and 0.53 on the relatedness task. The final reference standard was 604 word pairs for the similarity task and 599 word pairs for relatedness, excluding word pairs with answers corresponding to outliers and word pairs that were answered by less than 50% of all the respondents. When FastText models were applied to the final reference standard word pair sets, the embedding models learning medical documents had a higher correlation between the calculated cosine similarity scores compared to human-judged similarity and relatedness scores (namu, ρ=0.12 vs with medical text for the similarity task, ρ=0.47; namu, ρ=0.02 vs with medical text for the relatedness task, ρ=0.30). ConclusionsKorean medical word pair reference standard sets for semantic similarity and relatedness were developed based on medical documents from the past 10 years. It is expected that our word pair reference sets will be actively utilized in the development of medical and multilingual natural language processing technology in the future.
url https://medinform.jmir.org/2021/6/e29667
work_keys_str_mv AT yunjinyum awordpairdatasetforsemanticsimilarityandrelatednessinkoreanmedicalvocabularyreferencedevelopmentandvalidation
AT jeongmoonlee awordpairdatasetforsemanticsimilarityandrelatednessinkoreanmedicalvocabularyreferencedevelopmentandvalidation
AT moonjoungjang awordpairdatasetforsemanticsimilarityandrelatednessinkoreanmedicalvocabularyreferencedevelopmentandvalidation
AT yoojoongkim awordpairdatasetforsemanticsimilarityandrelatednessinkoreanmedicalvocabularyreferencedevelopmentandvalidation
AT jonghokim awordpairdatasetforsemanticsimilarityandrelatednessinkoreanmedicalvocabularyreferencedevelopmentandvalidation
AT seongtaekim awordpairdatasetforsemanticsimilarityandrelatednessinkoreanmedicalvocabularyreferencedevelopmentandvalidation
AT unsubshin awordpairdatasetforsemanticsimilarityandrelatednessinkoreanmedicalvocabularyreferencedevelopmentandvalidation
AT sanghounsong awordpairdatasetforsemanticsimilarityandrelatednessinkoreanmedicalvocabularyreferencedevelopmentandvalidation
AT hyungjoonjoo awordpairdatasetforsemanticsimilarityandrelatednessinkoreanmedicalvocabularyreferencedevelopmentandvalidation
AT yunjinyum wordpairdatasetforsemanticsimilarityandrelatednessinkoreanmedicalvocabularyreferencedevelopmentandvalidation
AT jeongmoonlee wordpairdatasetforsemanticsimilarityandrelatednessinkoreanmedicalvocabularyreferencedevelopmentandvalidation
AT moonjoungjang wordpairdatasetforsemanticsimilarityandrelatednessinkoreanmedicalvocabularyreferencedevelopmentandvalidation
AT yoojoongkim wordpairdatasetforsemanticsimilarityandrelatednessinkoreanmedicalvocabularyreferencedevelopmentandvalidation
AT jonghokim wordpairdatasetforsemanticsimilarityandrelatednessinkoreanmedicalvocabularyreferencedevelopmentandvalidation
AT seongtaekim wordpairdatasetforsemanticsimilarityandrelatednessinkoreanmedicalvocabularyreferencedevelopmentandvalidation
AT unsubshin wordpairdatasetforsemanticsimilarityandrelatednessinkoreanmedicalvocabularyreferencedevelopmentandvalidation
AT sanghounsong wordpairdatasetforsemanticsimilarityandrelatednessinkoreanmedicalvocabularyreferencedevelopmentandvalidation
AT hyungjoonjoo wordpairdatasetforsemanticsimilarityandrelatednessinkoreanmedicalvocabularyreferencedevelopmentandvalidation
_version_ 1721361327594668032