The Searchbench - Combining Sentence-semantic, Full-text and Bibliographic Search in Digital Libraries

We describe a novel approach to precise searching in the full content of digital libraries. The Searchbench (for search workbench) is based on sentence-wise syntactic and semantic natural language processing (NLP) of both born-digital and scanned publications in PDF format. The term born-digital me...

Full description

Bibliographic Details
Main Authors: Ulrich Schäfer, Bernd Kiefer, Christian Spurk, Jörg Steffen, Rui Wang, Benjamin Weitz, Magdalena Wolska
Format: Article
Language:English
Published: openjournals.nl 2013-02-01
Series:Liber Quarterly: The Journal of European Research Libraries
Subjects:
Online Access:https://test.openjournals.nl/liberquarterly/article/view/10685
id doaj-e2751875c0214e3e86787260cb6127c1
record_format Article
spelling doaj-e2751875c0214e3e86787260cb6127c12021-09-30T14:16:45Zengopenjournals.nlLiber Quarterly: The Journal of European Research Libraries2213-056X2013-02-01224The Searchbench - Combining Sentence-semantic, Full-text and Bibliographic Search in Digital LibrariesUlrich Schäfer0Bernd Kiefer1Christian Spurk2Jörg Steffen3Rui Wang4Benjamin Weitz5Magdalena Wolska6German Research Center for Artificial Intelligence (DFKI), SaarbrückenGerman Research Center for Artificial Intelligence (DFKI), SaarbrückenGerman Research Center for Artificial Intelligence (DFKI), SaarbrückenGerman Research Center for Artificial Intelligence (DFKI), SaarbrückenGerman Research Center for Artificial Intelligence (DFKI), SaarbrückenGerman Research Center for Artificial Intelligence (DFKI), SaarbrückenComputational Linguistics, Saarland University, Saarbrücken, Germany We describe a novel approach to precise searching in the full content of digital libraries. The Searchbench (for search workbench) is based on sentence-wise syntactic and semantic natural language processing (NLP) of both born-digital and scanned publications in PDF format. The term born-digital means natively digital, i.e. prepared electronically using typesetting systems such as LaTeX, OpenOffice, and the like. In the Searchbench, queries can be formulated as (possibly underspecified) statements, consisting of simple subject-predicate-object constructs such as ‘algorithm improves word alignment’. This reduces the number of false hits in large document collections when the search words happen to appear close to each other, but are not semantically related. The method also abstracts from passive voice and predicate synonyms. Moreover, negated statements can be excluded from the search results, and negated antonym predicates again count as synonyms (e.g. not include = exclude). In the Searchbench, a sentence-semantic search can be combined with search filters for classical full-text, bibliographic metadata and automatically computed domain terms. Auto-suggest fields facilitate text input. Queries can be bookmarked or emailed. Furthermore, a novel citation browser in the Searchbench allows graphical navigation in citation networks. These have been extracted automatically from metadata and paper texts. The citation browser displays short phrases from citation sentences at the edges in the citation graph and thus allows students and researchers to quickly browse publications and immerse into a new research field. By clicking on a citation edge, the original citation sentence is shown in context, and optionally also in the original PDF layout. To showcase the usefulness of our research, we have a applied it to a collection of currently approx. 25,000 open access research papers in the field of computational linguistics and language technology, the ACL Anthology ( http://aclweb.org/anthology). The Searchbench user interface is a web application running in every modern, JavaScript-enabled web browser, also on smart phones and tablet computers. The system is a free and public service at http://aclasb.dfki.de. Because the NLP technology is domain-independent, it could also be applied to newspaper texts, technical documentation, or scientific publications from other disciplines. The aim of this paper is to make the benefits of this new, language technology based approach known in library research and related fields. This article summarises 9 peer reviewed publications from the past three years that have been published in international conferences and workshops in the area of computational linguistics, and tries to present them in an appropriate way to the LIBER audience. The original papers contain more details and are freely available from the author’s homepage[1] or via the Searchbench[2]. https://test.openjournals.nl/liberquarterly/article/view/10685sentence-semantic searchnatural language processingcitation browser
collection DOAJ
language English
format Article
sources DOAJ
author Ulrich Schäfer
Bernd Kiefer
Christian Spurk
Jörg Steffen
Rui Wang
Benjamin Weitz
Magdalena Wolska
spellingShingle Ulrich Schäfer
Bernd Kiefer
Christian Spurk
Jörg Steffen
Rui Wang
Benjamin Weitz
Magdalena Wolska
The Searchbench - Combining Sentence-semantic, Full-text and Bibliographic Search in Digital Libraries
Liber Quarterly: The Journal of European Research Libraries
sentence-semantic search
natural language processing
citation browser
author_facet Ulrich Schäfer
Bernd Kiefer
Christian Spurk
Jörg Steffen
Rui Wang
Benjamin Weitz
Magdalena Wolska
author_sort Ulrich Schäfer
title The Searchbench - Combining Sentence-semantic, Full-text and Bibliographic Search in Digital Libraries
title_short The Searchbench - Combining Sentence-semantic, Full-text and Bibliographic Search in Digital Libraries
title_full The Searchbench - Combining Sentence-semantic, Full-text and Bibliographic Search in Digital Libraries
title_fullStr The Searchbench - Combining Sentence-semantic, Full-text and Bibliographic Search in Digital Libraries
title_full_unstemmed The Searchbench - Combining Sentence-semantic, Full-text and Bibliographic Search in Digital Libraries
title_sort searchbench - combining sentence-semantic, full-text and bibliographic search in digital libraries
publisher openjournals.nl
series Liber Quarterly: The Journal of European Research Libraries
issn 2213-056X
publishDate 2013-02-01
description We describe a novel approach to precise searching in the full content of digital libraries. The Searchbench (for search workbench) is based on sentence-wise syntactic and semantic natural language processing (NLP) of both born-digital and scanned publications in PDF format. The term born-digital means natively digital, i.e. prepared electronically using typesetting systems such as LaTeX, OpenOffice, and the like. In the Searchbench, queries can be formulated as (possibly underspecified) statements, consisting of simple subject-predicate-object constructs such as ‘algorithm improves word alignment’. This reduces the number of false hits in large document collections when the search words happen to appear close to each other, but are not semantically related. The method also abstracts from passive voice and predicate synonyms. Moreover, negated statements can be excluded from the search results, and negated antonym predicates again count as synonyms (e.g. not include = exclude). In the Searchbench, a sentence-semantic search can be combined with search filters for classical full-text, bibliographic metadata and automatically computed domain terms. Auto-suggest fields facilitate text input. Queries can be bookmarked or emailed. Furthermore, a novel citation browser in the Searchbench allows graphical navigation in citation networks. These have been extracted automatically from metadata and paper texts. The citation browser displays short phrases from citation sentences at the edges in the citation graph and thus allows students and researchers to quickly browse publications and immerse into a new research field. By clicking on a citation edge, the original citation sentence is shown in context, and optionally also in the original PDF layout. To showcase the usefulness of our research, we have a applied it to a collection of currently approx. 25,000 open access research papers in the field of computational linguistics and language technology, the ACL Anthology ( http://aclweb.org/anthology). The Searchbench user interface is a web application running in every modern, JavaScript-enabled web browser, also on smart phones and tablet computers. The system is a free and public service at http://aclasb.dfki.de. Because the NLP technology is domain-independent, it could also be applied to newspaper texts, technical documentation, or scientific publications from other disciplines. The aim of this paper is to make the benefits of this new, language technology based approach known in library research and related fields. This article summarises 9 peer reviewed publications from the past three years that have been published in international conferences and workshops in the area of computational linguistics, and tries to present them in an appropriate way to the LIBER audience. The original papers contain more details and are freely available from the author’s homepage[1] or via the Searchbench[2].
topic sentence-semantic search
natural language processing
citation browser
url https://test.openjournals.nl/liberquarterly/article/view/10685
work_keys_str_mv AT ulrichschafer thesearchbenchcombiningsentencesemanticfulltextandbibliographicsearchindigitallibraries
AT berndkiefer thesearchbenchcombiningsentencesemanticfulltextandbibliographicsearchindigitallibraries
AT christianspurk thesearchbenchcombiningsentencesemanticfulltextandbibliographicsearchindigitallibraries
AT jorgsteffen thesearchbenchcombiningsentencesemanticfulltextandbibliographicsearchindigitallibraries
AT ruiwang thesearchbenchcombiningsentencesemanticfulltextandbibliographicsearchindigitallibraries
AT benjaminweitz thesearchbenchcombiningsentencesemanticfulltextandbibliographicsearchindigitallibraries
AT magdalenawolska thesearchbenchcombiningsentencesemanticfulltextandbibliographicsearchindigitallibraries
AT ulrichschafer searchbenchcombiningsentencesemanticfulltextandbibliographicsearchindigitallibraries
AT berndkiefer searchbenchcombiningsentencesemanticfulltextandbibliographicsearchindigitallibraries
AT christianspurk searchbenchcombiningsentencesemanticfulltextandbibliographicsearchindigitallibraries
AT jorgsteffen searchbenchcombiningsentencesemanticfulltextandbibliographicsearchindigitallibraries
AT ruiwang searchbenchcombiningsentencesemanticfulltextandbibliographicsearchindigitallibraries
AT benjaminweitz searchbenchcombiningsentencesemanticfulltextandbibliographicsearchindigitallibraries
AT magdalenawolska searchbenchcombiningsentencesemanticfulltextandbibliographicsearchindigitallibraries
_version_ 1716863124269694976