The biomedical discourse relation bank

Abstract Background Identification of discourse relations, such as causal and contrastive relations, between situations mentioned in text is an important task for biomedical text-mining. A biomedical text corpus annotated with discourse relations would...

Full description

Bibliographic Details
Main Authors:	Joshi Aravind, Frid Nadya, McRoy Susan, Prasad Rashmi, Yu Hong
Format:	Article
Language:	English
Published:	BMC 2011-05-01
Series:	BMC Bioinformatics
Online Access:	http://www.biomedcentral.com/1471-2105/12/188

id	doaj-9823b6beab1d49c48967e281184e78ed
record_format	Article
spelling	doaj-9823b6beab1d49c48967e281184e78ed2020-11-24T21:41:41ZengBMCBMC Bioinformatics1471-21052011-05-0112118810.1186/1471-2105-12-188The biomedical discourse relation bankJoshi AravindFrid NadyaMcRoy SusanPrasad RashmiYu Hong<p>Abstract</p> <p>Background</p> <p>Identification of discourse relations, such as causal and contrastive relations, between situations mentioned in text is an important task for biomedical text-mining. A biomedical text corpus annotated with discourse relations would be very useful for developing and evaluating methods for biomedical discourse processing. However, little effort has been made to develop such an annotated resource.</p> <p>Results</p> <p>We have developed the Biomedical Discourse Relation Bank (BioDRB), in which we have annotated explicit and implicit discourse relations in 24 open-access full-text biomedical articles from the GENIA corpus. Guidelines for the annotation were adapted from the Penn Discourse TreeBank (PDTB), which has discourse relations annotated over open-domain news articles. We introduced new conventions and modifications to the sense classification. We report reliable inter-annotator agreement of over 80% for all sub-tasks. Experiments for identifying the sense of explicit discourse connectives show the connective itself as a highly reliable indicator for coarse sense classification (accuracy 90.9% and F1 score 0.89). These results are comparable to results obtained with the same classifier on the PDTB data. With more refined sense classification, there is degradation in performance (accuracy 69.2% and F1 score 0.28), mainly due to sparsity in the data. The size of the corpus was found to be sufficient for identifying the sense of explicit connectives, with classifier performance stabilizing at about 1900 training instances. Finally, the classifier performs poorly when trained on PDTB and tested on BioDRB (accuracy 54.5% and F1 score 0.57).</p> <p>Conclusion</p> <p>Our work shows that discourse relations can be reliably annotated in biomedical text. Coarse sense disambiguation of explicit connectives can be done with high reliability by using just the connective as a feature, but more refined sense classification requires either richer features or more annotated data. The poor performance of a classifier trained in the open domain and tested in the biomedical domain suggests significant differences in the semantic usage of connectives across these domains, and provides robust evidence for a biomedical sublanguage for discourse and the need to develop a specialized biomedical discourse annotated corpus. The results of our cross-domain experiments are consistent with related work on identifying connectives in BioDRB.</p> http://www.biomedcentral.com/1471-2105/12/188
collection	DOAJ
language	English
format	Article
sources	DOAJ
author	Joshi Aravind Frid Nadya McRoy Susan Prasad Rashmi Yu Hong
spellingShingle	Joshi Aravind Frid Nadya McRoy Susan Prasad Rashmi Yu Hong The biomedical discourse relation bank BMC Bioinformatics
author_facet	Joshi Aravind Frid Nadya McRoy Susan Prasad Rashmi Yu Hong
author_sort	Joshi Aravind
title	The biomedical discourse relation bank
title_short	The biomedical discourse relation bank
title_full	The biomedical discourse relation bank
title_fullStr	The biomedical discourse relation bank
title_full_unstemmed	The biomedical discourse relation bank
title_sort	biomedical discourse relation bank
publisher	BMC
series	BMC Bioinformatics
issn	1471-2105
publishDate	2011-05-01
description	<p>Abstract</p> <p>Background</p> <p>Identification of discourse relations, such as causal and contrastive relations, between situations mentioned in text is an important task for biomedical text-mining. A biomedical text corpus annotated with discourse relations would be very useful for developing and evaluating methods for biomedical discourse processing. However, little effort has been made to develop such an annotated resource.</p> <p>Results</p> <p>We have developed the Biomedical Discourse Relation Bank (BioDRB), in which we have annotated explicit and implicit discourse relations in 24 open-access full-text biomedical articles from the GENIA corpus. Guidelines for the annotation were adapted from the Penn Discourse TreeBank (PDTB), which has discourse relations annotated over open-domain news articles. We introduced new conventions and modifications to the sense classification. We report reliable inter-annotator agreement of over 80% for all sub-tasks. Experiments for identifying the sense of explicit discourse connectives show the connective itself as a highly reliable indicator for coarse sense classification (accuracy 90.9% and F1 score 0.89). These results are comparable to results obtained with the same classifier on the PDTB data. With more refined sense classification, there is degradation in performance (accuracy 69.2% and F1 score 0.28), mainly due to sparsity in the data. The size of the corpus was found to be sufficient for identifying the sense of explicit connectives, with classifier performance stabilizing at about 1900 training instances. Finally, the classifier performs poorly when trained on PDTB and tested on BioDRB (accuracy 54.5% and F1 score 0.57).</p> <p>Conclusion</p> <p>Our work shows that discourse relations can be reliably annotated in biomedical text. Coarse sense disambiguation of explicit connectives can be done with high reliability by using just the connective as a feature, but more refined sense classification requires either richer features or more annotated data. The poor performance of a classifier trained in the open domain and tested in the biomedical domain suggests significant differences in the semantic usage of connectives across these domains, and provides robust evidence for a biomedical sublanguage for discourse and the need to develop a specialized biomedical discourse annotated corpus. The results of our cross-domain experiments are consistent with related work on identifying connectives in BioDRB.</p>
url	http://www.biomedcentral.com/1471-2105/12/188
work_keys_str_mv	AT joshiaravind thebiomedicaldiscourserelationbank AT fridnadya thebiomedicaldiscourserelationbank AT mcroysusan thebiomedicaldiscourserelationbank AT prasadrashmi thebiomedicaldiscourserelationbank AT yuhong thebiomedicaldiscourserelationbank AT joshiaravind biomedicaldiscourserelationbank AT fridnadya biomedicaldiscourserelationbank AT mcroysusan biomedicaldiscourserelationbank AT prasadrashmi biomedicaldiscourserelationbank AT yuhong biomedicaldiscourserelationbank
_version_	1725920470228795392

The biomedical discourse relation bank

Similar Items