Halbautomatische morphosyntaktische Annotation russischer Texte
Part-of-speech and morphosyntactic annotation clearly enhances the value of corpus texts as a tool for linguistic research: It provides the user with the means to search not only for (regular expressions over) strings of letters, but also for (expressions over) morphosyntactic categories – often the only way to isolate interesting syntactic phenomena. Nowadays this annotation (‘tagging’) is mostly achieved with the help of stochastic methods: Starting from a manually annotated training corpus, an automatic tagger ‘learns’ the probabilities of single tags for a given word as well as local contexts in which the tag in question may appear. In Slavic languages, the biggest obstacle for this approach is the great number of tags which are needed to encode all relevant morphosyntactic categories. When developing a tagging strategy for the automatic annotation of the Tübingen Russian Corpora (a large Russian text corpus searchable via the internet), we tried to reduce the difficulty of the task by relying on lexical resources, i.e. a large independent lexicon of form-tag combinations derived from Zaliznjak’s (1977) Grammatical Dictionary. The choice of tagset is motivated to a lesser extent by efficiency considerations, but mainly by linguistic needs. The present article explains in some detail the tagset and the steps taken for tagging the Tübingen Russian Corpora, discusses the present outcome and points to some promising new approaches which could help to reduce error rates.