Personen	Treffen/Publikationen	Fotografien	Intern
jusla-31: Gießen 2023 jusla-30: Jena 2022 jusla-29: Bochum 2021 JuSla-Online-Kaffeepause 2020 jusla-28: Hamburg 2019 jusla-27: Heidelberg 2018 jusla-26: Bamberg 2017 jusla-25: Göttingen 2016 jusla-24: Köln 2015 jusla-23: Dresden 2014 jusla-22: München 2013 jusla-21: Göttingen 2012 jusla-20: Würzburg 2011 jusla-19: Berlin 2010 jusla-18: Regensburg 2009 jusla-17: Frankfurt am Main 2008 jusla-16: Dresden 2007 jusla-15: Bochum 2006 jusla-14: Stuttgart 2005 jusla-13: Leipzig 2004 jusla-12: Gießen 2003 jusla-11: Cambridge 2002 jusla-10: Berlin 2001 Vorwort Tanja Anstatt Horst Dippong Ursula Doleschal und Edgar Hoffmann Uwe Junghanns Holger Kuße Anke Levin-Steinmann Imke Mendoza Roland Meyer Andreas Späth Katrin Unrath-Scharpenack jusla-9: Halle/Wittenberg 2000 jusla-8: München 1999 jusla-7: Tübingen/Blaubeuren 1998 jusla-6: Wien 1997 jusla-5: Bautzen 1996 jusla-4: Frankfurt am Main 1995 jusla-3: Hamburg 1994 jusla-2: Leipzig 1993 jusla-1: Wien 1992
Sie sind hier: Homepage → Treffen → jusla-10 → jusla-Meyer Halbautomatische morphosyntaktische Annotation russischer Texte Roland Meyer Part-of-speech and morphosyntactic annotation clearly enhances the value of corpus texts as a tool for linguistic research: It provides the user with the means to search not only for (regular expressions over) strings of letters, but also for (expressions over) morphosyntactic categories – often the only way to isolate interesting syntactic phenomena. Nowadays this annotation (‘tagging’) is mostly achieved with the help of stochastic methods: Starting from a manually annotated training corpus, an automatic tagger ‘learns’ the probabilities of single tags for a given word as well as local contexts in which the tag in question may appear. In Slavic languages, the biggest obstacle for this approach is the great number of tags which are needed to encode all relevant morphosyntactic categories. When developing a tagging strategy for the automatic annotation of the Tübingen Russian Corpora (a large Russian text corpus searchable via the internet), we tried to reduce the difficulty of the task by relying on lexical resources, i.e. a large independent lexicon of form-tag combinations derived from Zaliznjak’s (1977) Grammatical Dictionary. The choice of tagset is motivated to a lesser extent by efficiency considerations, but mainly by linguistic needs. The present article explains in some detail the tagset and the steps taken for tagging the Tübingen Russian Corpora, discusses the present outcome and points to some promising new approaches which could help to reduce error rates. Linguistische Beiträge zur Slavistik. X. JungslavistInnen-Treffen Berlin 2001. Hg. Robert Hammel und Ljudmila Geist. München: Sagner 2003 (= Specimina Philologiae Slavicae 139), 192–205.

Halbautomatische morphosyntaktische Annotation russischer Texte

Roland Meyer

Part-of-speech and morphosyntactic annotation clearly enhances the value of corpus texts as a tool for linguistic research: It provides the user with the means to search not only for (regular expressions over) strings of letters, but also for (expressions over) morphosyntactic categories – often the only way to isolate interesting syntactic phenomena. Nowadays this annotation (‘tagging’) is mostly achieved with the help of stochastic methods: Starting from a manually annotated training corpus, an automatic tagger ‘learns’ the probabilities of single tags for a given word as well as local contexts in which the tag in question may appear. In Slavic languages, the biggest obstacle for this approach is the great number of tags which are needed to encode all relevant morphosyntactic categories. When developing a tagging strategy for the automatic annotation of the Tübingen Russian Corpora (a large Russian text corpus searchable via the internet), we tried to reduce the difficulty of the task by relying on lexical resources, i.e. a large independent lexicon of form-tag combinations derived from Zaliznjak’s (1977) Grammatical Dictionary. The choice of tagset is motivated to a lesser extent by efficiency considerations, but mainly by linguistic needs. The present article explains in some detail the tagset and the steps taken for tagging the Tübingen Russian Corpora, discusses the present outcome and points to some promising new approaches which could help to reduce error rates.

Linguistische Beiträge zur Slavistik. X. JungslavistInnen-Treffen Berlin 2001. Hg. Robert Hammel und Ljudmila Geist. München: Sagner 2003 (= Specimina Philologiae Slavicae 139), 192–205.