What Disease Does This Patient Have? A Large-Scale Open Domain Question Answering Dataset from Medical Exams

Open domain question answering (OpenQA) tasks have been recently attracting more and more attention from the natural language processing (NLP) community. In this work, we present the first free-form multiple-choice OpenQA dataset for solving medical problems, <span style="font-variant: small...

Full description

Bibliographic Details
Main Authors: Di Jin, Eileen Pan, Nassim Oufattole, Wei-Hung Weng, Hanyi Fang, Peter Szolovits
Format: Article
Language:English
Published: MDPI AG 2021-07-01
Series:Applied Sciences
Subjects:
Online Access:https://www.mdpi.com/2076-3417/11/14/6421
id doaj-11ebc944223d456caec3066e6152ebd1
record_format Article
spelling doaj-11ebc944223d456caec3066e6152ebd12021-07-23T13:29:34ZengMDPI AGApplied Sciences2076-34172021-07-01116421642110.3390/app11146421What Disease Does This Patient Have? A Large-Scale Open Domain Question Answering Dataset from Medical ExamsDi Jin0Eileen Pan1Nassim Oufattole2Wei-Hung Weng3Hanyi Fang4Peter Szolovits5Computer Science and Artificial Intelligence, Massachusetts Institute of Technology, Cambridge, MA 02139, USAComputer Science and Artificial Intelligence, Massachusetts Institute of Technology, Cambridge, MA 02139, USAComputer Science and Artificial Intelligence, Massachusetts Institute of Technology, Cambridge, MA 02139, USAComputer Science and Artificial Intelligence, Massachusetts Institute of Technology, Cambridge, MA 02139, USATongji Medical College, Huazhong University of Science and Technology, Wuhan 430074, ChinaComputer Science and Artificial Intelligence, Massachusetts Institute of Technology, Cambridge, MA 02139, USAOpen domain question answering (OpenQA) tasks have been recently attracting more and more attention from the natural language processing (NLP) community. In this work, we present the first free-form multiple-choice OpenQA dataset for solving medical problems, <span style="font-variant: small-caps;">MedQA</span>, collected from the professional medical board exams. It covers three languages: English, simplified Chinese, and traditional Chinese, and contains 12,723, 34,251, and 14,123 questions for the three languages, respectively. We implement both rule-based and popular neural methods by sequentially combining a document retriever and a machine comprehension model. Through experiments, we find that even the current best method can only achieve 36.7%, 42.0%, and 70.1% of test accuracy on the English, traditional Chinese, and simplified Chinese questions, respectively. We expect <span style="font-variant: small-caps;">MedQA</span> to present great challenges to existing OpenQA systems and hope that it can serve as a platform to promote much stronger OpenQA models from the NLP community in the future.https://www.mdpi.com/2076-3417/11/14/6421natural language processingopen-domain question answeringmulti-choice question answeringclinical question answering
collection DOAJ
language English
format Article
sources DOAJ
author Di Jin
Eileen Pan
Nassim Oufattole
Wei-Hung Weng
Hanyi Fang
Peter Szolovits
spellingShingle Di Jin
Eileen Pan
Nassim Oufattole
Wei-Hung Weng
Hanyi Fang
Peter Szolovits
What Disease Does This Patient Have? A Large-Scale Open Domain Question Answering Dataset from Medical Exams
Applied Sciences
natural language processing
open-domain question answering
multi-choice question answering
clinical question answering
author_facet Di Jin
Eileen Pan
Nassim Oufattole
Wei-Hung Weng
Hanyi Fang
Peter Szolovits
author_sort Di Jin
title What Disease Does This Patient Have? A Large-Scale Open Domain Question Answering Dataset from Medical Exams
title_short What Disease Does This Patient Have? A Large-Scale Open Domain Question Answering Dataset from Medical Exams
title_full What Disease Does This Patient Have? A Large-Scale Open Domain Question Answering Dataset from Medical Exams
title_fullStr What Disease Does This Patient Have? A Large-Scale Open Domain Question Answering Dataset from Medical Exams
title_full_unstemmed What Disease Does This Patient Have? A Large-Scale Open Domain Question Answering Dataset from Medical Exams
title_sort what disease does this patient have? a large-scale open domain question answering dataset from medical exams
publisher MDPI AG
series Applied Sciences
issn 2076-3417
publishDate 2021-07-01
description Open domain question answering (OpenQA) tasks have been recently attracting more and more attention from the natural language processing (NLP) community. In this work, we present the first free-form multiple-choice OpenQA dataset for solving medical problems, <span style="font-variant: small-caps;">MedQA</span>, collected from the professional medical board exams. It covers three languages: English, simplified Chinese, and traditional Chinese, and contains 12,723, 34,251, and 14,123 questions for the three languages, respectively. We implement both rule-based and popular neural methods by sequentially combining a document retriever and a machine comprehension model. Through experiments, we find that even the current best method can only achieve 36.7%, 42.0%, and 70.1% of test accuracy on the English, traditional Chinese, and simplified Chinese questions, respectively. We expect <span style="font-variant: small-caps;">MedQA</span> to present great challenges to existing OpenQA systems and hope that it can serve as a platform to promote much stronger OpenQA models from the NLP community in the future.
topic natural language processing
open-domain question answering
multi-choice question answering
clinical question answering
url https://www.mdpi.com/2076-3417/11/14/6421
work_keys_str_mv AT dijin whatdiseasedoesthispatienthavealargescaleopendomainquestionansweringdatasetfrommedicalexams
AT eileenpan whatdiseasedoesthispatienthavealargescaleopendomainquestionansweringdatasetfrommedicalexams
AT nassimoufattole whatdiseasedoesthispatienthavealargescaleopendomainquestionansweringdatasetfrommedicalexams
AT weihungweng whatdiseasedoesthispatienthavealargescaleopendomainquestionansweringdatasetfrommedicalexams
AT hanyifang whatdiseasedoesthispatienthavealargescaleopendomainquestionansweringdatasetfrommedicalexams
AT peterszolovits whatdiseasedoesthispatienthavealargescaleopendomainquestionansweringdatasetfrommedicalexams
_version_ 1721289495070900224