New directions in biomedical text annotation: definitions, guidelines and corpus construction

<p>Abstract</p> <p>Background</p> <p>While biomedical text mining is emerging as an important research area, practical results have proven difficult to achieve. We believe that an important first step towards more accurate text-mining lies in the ability to identify and...

Full description

Bibliographic Details
Main Authors: Rzhetsky Andrey, Wilbur W John, Shatkay Hagit
Format: Article
Language:English
Published: BMC 2006-07-01
Series:BMC Bioinformatics
Online Access:http://www.biomedcentral.com/1471-2105/7/356
id doaj-969b3cf7fab045bc869d68b45ed9e259
record_format Article
spelling doaj-969b3cf7fab045bc869d68b45ed9e2592020-11-25T02:51:56ZengBMCBMC Bioinformatics1471-21052006-07-017135610.1186/1471-2105-7-356New directions in biomedical text annotation: definitions, guidelines and corpus constructionRzhetsky AndreyWilbur W JohnShatkay Hagit<p>Abstract</p> <p>Background</p> <p>While biomedical text mining is emerging as an important research area, practical results have proven difficult to achieve. We believe that an important first step towards more accurate text-mining lies in the ability to identify and characterize text that satisfies various types of information needs. We report here the results of our inquiry into properties of scientific text that have sufficient generality to transcend the confines of a narrow subject area, while supporting practical mining of text for factual information. Our ultimate goal is to annotate a significant corpus of biomedical text and train machine learning methods to automatically categorize such text along certain dimensions that we have defined.</p> <p>Results</p> <p>We have identified five qualitative dimensions that we believe characterize a broad range of scientific sentences, and are therefore useful for supporting a general approach to text-mining: focus, polarity, certainty, evidence, and directionality. We define these dimensions and describe the guidelines we have developed for annotating text with regard to them.</p> <p>To examine the effectiveness of the guidelines, twelve annotators independently annotated the same set of 101 sentences that were randomly selected from current biomedical periodicals. Analysis of these annotations shows 70–80% inter-annotator agreement, suggesting that our guidelines indeed present a well-defined, executable and reproducible task.</p> <p>Conclusion</p> <p>We present our guidelines defining a text annotation task, along with annotation results from multiple independently produced annotations, demonstrating the feasibility of the task. The annotation of a very large corpus of documents along these guidelines is currently ongoing. These annotations form the basis for the categorization of text along multiple dimensions, to support viable text mining for experimental results, methodology statements, and other forms of information. We are currently developing machine learning methods, to be trained and tested on the annotated corpus, that would allow for the automatic categorization of biomedical text along the general dimensions that we have presented. The guidelines in full detail, along with annotated examples, are publicly available.</p> http://www.biomedcentral.com/1471-2105/7/356
collection DOAJ
language English
format Article
sources DOAJ
author Rzhetsky Andrey
Wilbur W John
Shatkay Hagit
spellingShingle Rzhetsky Andrey
Wilbur W John
Shatkay Hagit
New directions in biomedical text annotation: definitions, guidelines and corpus construction
BMC Bioinformatics
author_facet Rzhetsky Andrey
Wilbur W John
Shatkay Hagit
author_sort Rzhetsky Andrey
title New directions in biomedical text annotation: definitions, guidelines and corpus construction
title_short New directions in biomedical text annotation: definitions, guidelines and corpus construction
title_full New directions in biomedical text annotation: definitions, guidelines and corpus construction
title_fullStr New directions in biomedical text annotation: definitions, guidelines and corpus construction
title_full_unstemmed New directions in biomedical text annotation: definitions, guidelines and corpus construction
title_sort new directions in biomedical text annotation: definitions, guidelines and corpus construction
publisher BMC
series BMC Bioinformatics
issn 1471-2105
publishDate 2006-07-01
description <p>Abstract</p> <p>Background</p> <p>While biomedical text mining is emerging as an important research area, practical results have proven difficult to achieve. We believe that an important first step towards more accurate text-mining lies in the ability to identify and characterize text that satisfies various types of information needs. We report here the results of our inquiry into properties of scientific text that have sufficient generality to transcend the confines of a narrow subject area, while supporting practical mining of text for factual information. Our ultimate goal is to annotate a significant corpus of biomedical text and train machine learning methods to automatically categorize such text along certain dimensions that we have defined.</p> <p>Results</p> <p>We have identified five qualitative dimensions that we believe characterize a broad range of scientific sentences, and are therefore useful for supporting a general approach to text-mining: focus, polarity, certainty, evidence, and directionality. We define these dimensions and describe the guidelines we have developed for annotating text with regard to them.</p> <p>To examine the effectiveness of the guidelines, twelve annotators independently annotated the same set of 101 sentences that were randomly selected from current biomedical periodicals. Analysis of these annotations shows 70–80% inter-annotator agreement, suggesting that our guidelines indeed present a well-defined, executable and reproducible task.</p> <p>Conclusion</p> <p>We present our guidelines defining a text annotation task, along with annotation results from multiple independently produced annotations, demonstrating the feasibility of the task. The annotation of a very large corpus of documents along these guidelines is currently ongoing. These annotations form the basis for the categorization of text along multiple dimensions, to support viable text mining for experimental results, methodology statements, and other forms of information. We are currently developing machine learning methods, to be trained and tested on the annotated corpus, that would allow for the automatic categorization of biomedical text along the general dimensions that we have presented. The guidelines in full detail, along with annotated examples, are publicly available.</p>
url http://www.biomedcentral.com/1471-2105/7/356
work_keys_str_mv AT rzhetskyandrey newdirectionsinbiomedicaltextannotationdefinitionsguidelinesandcorpusconstruction
AT wilburwjohn newdirectionsinbiomedicaltextannotationdefinitionsguidelinesandcorpusconstruction
AT shatkayhagit newdirectionsinbiomedicaltextannotationdefinitionsguidelinesandcorpusconstruction
_version_ 1724732445796859904