Graph-based annotation engineering: towards a gold corpus for role and reference grammar

  • This paper describes the application of annotation engineering techniques for the construction of a corpus for Role and Reference Grammar (RRG). RRG is a semantics-oriented formalism for natural language syntax popular in comparative linguistics and linguistic typology, and predominantly applied for the description of non-European languages which are less-resourced in terms of natural language processing. Because of its cross-linguistic applicability and its conjoint treatment of syntax and semantics, RRG also represents a promising framework for research challenges within natural language processing. At the moment, however, these have not been explored as no RRG corpus data is publicly available. While RRG annotations cannot be easily derived from any single treebank in existence, we suggest that they can be reliably inferred from the intersection of syntactic and semantic annotations as represented by, for example, the Universal Dependencies (UD) and PropBank (PB), and we demonstrateThis paper describes the application of annotation engineering techniques for the construction of a corpus for Role and Reference Grammar (RRG). RRG is a semantics-oriented formalism for natural language syntax popular in comparative linguistics and linguistic typology, and predominantly applied for the description of non-European languages which are less-resourced in terms of natural language processing. Because of its cross-linguistic applicability and its conjoint treatment of syntax and semantics, RRG also represents a promising framework for research challenges within natural language processing. At the moment, however, these have not been explored as no RRG corpus data is publicly available. While RRG annotations cannot be easily derived from any single treebank in existence, we suggest that they can be reliably inferred from the intersection of syntactic and semantic annotations as represented by, for example, the Universal Dependencies (UD) and PropBank (PB), and we demonstrate this for the English Web Treebank, a 250,000 token corpus of various genres of English internet text. The resulting corpus is a gold corpus for future experiments in natural language processing in the sense that it is built on existing annotations which have been created manually. A technical challenge in this context is to align UD and PB annotations, to integrate them in a coherent manner, and to distribute and to combine their information on RRG constituent and operator projections. For this purpose, we describe a framework for flexible and scalable annotation engineering based on flexible, unconstrained graph transformations of sentence graphs by means of SPARQL Update.show moreshow less

Download full text files

Export metadata

Statistics

Number of document requests

Additional Services

Share in Twitter Search Google Scholar
Metadaten
Author:Christian ChiarcosORCiDGND, Christian Fäth
URN:urn:nbn:de:bvb:384-opus4-1040868
Frontdoor URLhttps://opus.bibliothek.uni-augsburg.de/opus4/104086
ISBN:978-3-95977-105-4OPAC
ISSN:1868-8969OPAC
Parent Title (English):2nd Conference on Language, Data and Knowledge, LDK 2019, May 20–23, 2019, Leipzig, Germany
Publisher:Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Place of publication:Saarbrücken
Editor:Maria Eskevich, Gerard de Melo, Christian Fäth, John P. McCrae, Paul Buitelaar, Christian ChiarcosORCiDGND, Bettina Klimek, Milan Dojchinovski
Type:Conference Proceeding
Language:English
Year of first Publication:2019
Publishing Institution:Universität Augsburg
Release Date:2023/05/15
First Page:9:1
Last Page:9:11
Series:OpenAccess Series in Informatics (OASIcs) ; 70
DOI:https://doi.org/10.4230/OASIcs.LDK.2019.9
Institutes:Philologisch-Historische Fakultät
Philologisch-Historische Fakultät / Angewandte Computerlinguistik
Philologisch-Historische Fakultät / Angewandte Computerlinguistik / Lehrstuhl für Angewandte Computerlinguistik (ACoLi)
Dewey Decimal Classification:4 Sprache / 40 Sprache / 400 Sprache
Licence (German):CC-BY 3.0: Creative Commons - Namensnennung (mit Print on Demand)