Korpusbasierte Wörterbucharbeit mit den Daten des Projekts Deutscher Wortschatz

The corpus project Deutscher Wortschatz (German Vocabulary) at Leipzig University is collecting and processing textual data for 15 years. It now consists of approx. 2 billion running words in 160 million sentences. The dictionary is online available at www.wortschatz.uni-leipzig.de and, moreover, co...

Full description

Bibliographic Details
Main Author: Quasthoff, Uwe
Format: Article
Language:deu
Published: Bern Open Publishing 2009-01-01
Series:Linguistik Online
Online Access:http://www.linguistik-online.de/39_09/quasthoff.pdf
id doaj-903af5660a9d4677aeeb91ace385d9cb
record_format Article
spelling doaj-903af5660a9d4677aeeb91ace385d9cb2021-07-02T01:52:59ZdeuBern Open PublishingLinguistik Online1615-30142009-01-01393151162Korpusbasierte Wörterbucharbeit mit den Daten des Projekts Deutscher WortschatzQuasthoff, UweThe corpus project Deutscher Wortschatz (German Vocabulary) at Leipzig University is collecting and processing textual data for 15 years. It now consists of approx. 2 billion running words in 160 million sentences. The dictionary is online available at www.wortschatz.uni-leipzig.de and, moreover, contains word co-occurrence data.The pre-processing of the data used mainly language independent methods and were used for corpora in other languages, too.The paper describes the production process for three dictionaries for which these corpus data were used: a thesaurus, a dictionary of neologisms, and a collocation dictionary. In all cases, the raw data for the dictionary entries were produced automatically, and the final entries were written only using these pre-selections. In the case of the thesaurus, the preprocessing consisted in a corpus based detection of semantically similar words. For the neologism dictionary the yearly frequency information were used and for the collocation dictionary, word co-occurrences and part of speech information were combined.http://www.linguistik-online.de/39_09/quasthoff.pdf
collection DOAJ
language deu
format Article
sources DOAJ
author Quasthoff, Uwe
spellingShingle Quasthoff, Uwe
Korpusbasierte Wörterbucharbeit mit den Daten des Projekts Deutscher Wortschatz
Linguistik Online
author_facet Quasthoff, Uwe
author_sort Quasthoff, Uwe
title Korpusbasierte Wörterbucharbeit mit den Daten des Projekts Deutscher Wortschatz
title_short Korpusbasierte Wörterbucharbeit mit den Daten des Projekts Deutscher Wortschatz
title_full Korpusbasierte Wörterbucharbeit mit den Daten des Projekts Deutscher Wortschatz
title_fullStr Korpusbasierte Wörterbucharbeit mit den Daten des Projekts Deutscher Wortschatz
title_full_unstemmed Korpusbasierte Wörterbucharbeit mit den Daten des Projekts Deutscher Wortschatz
title_sort korpusbasierte wörterbucharbeit mit den daten des projekts deutscher wortschatz
publisher Bern Open Publishing
series Linguistik Online
issn 1615-3014
publishDate 2009-01-01
description The corpus project Deutscher Wortschatz (German Vocabulary) at Leipzig University is collecting and processing textual data for 15 years. It now consists of approx. 2 billion running words in 160 million sentences. The dictionary is online available at www.wortschatz.uni-leipzig.de and, moreover, contains word co-occurrence data.The pre-processing of the data used mainly language independent methods and were used for corpora in other languages, too.The paper describes the production process for three dictionaries for which these corpus data were used: a thesaurus, a dictionary of neologisms, and a collocation dictionary. In all cases, the raw data for the dictionary entries were produced automatically, and the final entries were written only using these pre-selections. In the case of the thesaurus, the preprocessing consisted in a corpus based detection of semantically similar words. For the neologism dictionary the yearly frequency information were used and for the collocation dictionary, word co-occurrences and part of speech information were combined.
url http://www.linguistik-online.de/39_09/quasthoff.pdf
work_keys_str_mv AT quasthoffuwe korpusbasierteworterbucharbeitmitdendatendesprojektsdeutscherwortschatz
_version_ 1721344181166669824