An ontology for CoNLL-RDF: formal data structures for TSV formats in language technology

  • In language technology and language sciences, tab-separated values (TSV) represent a frequently used formalism to represent linguistically annotated natural language, often addressed as "CoNLL formats". A large number of such formats do exist, but although they share a number of common features, they are not interoperable, as different pieces of information are encoded differently in these dialects. CoNLL-RDF refers to a programming library and the associated data model that has been introduced to facilitate processing and transforming such TSV formats in a serialization-independent way. CoNLL-RDF represents CoNLL data, by means of RDF graphs and SPARQL update operations, but so far, without machine-readable semantics, with annotation properties created dynamically on the basis of a user-defined mapping from columns to labels. Current applications of CoNLL-RDF include linking between corpora and dictionaries [Mambrini and Passarotti, 2019] and knowledge graphs [Tamper et al., 2018],In language technology and language sciences, tab-separated values (TSV) represent a frequently used formalism to represent linguistically annotated natural language, often addressed as "CoNLL formats". A large number of such formats do exist, but although they share a number of common features, they are not interoperable, as different pieces of information are encoded differently in these dialects. CoNLL-RDF refers to a programming library and the associated data model that has been introduced to facilitate processing and transforming such TSV formats in a serialization-independent way. CoNLL-RDF represents CoNLL data, by means of RDF graphs and SPARQL update operations, but so far, without machine-readable semantics, with annotation properties created dynamically on the basis of a user-defined mapping from columns to labels. Current applications of CoNLL-RDF include linking between corpora and dictionaries [Mambrini and Passarotti, 2019] and knowledge graphs [Tamper et al., 2018], syntactic parsing of historical languages [Chiarcos et al., 2018; Chiarcos et al., 2018], the consolidation of syntactic and semantic annotations [Chiarcos and Fäth, 2019], a bridge between RDF corpora and a traditional corpus query language [Ionov et al., 2020], and language contact studies [Chiarcos et al., 2018]. We describe a novel extension of CoNLL-RDF, introducing a formal data model, formalized as an ontology. The ontology is a basis for linking RDF corpora with other Semantic Web resources, but more importantly, its application for transformation between different TSV formats is a major step for providing interoperability between CoNLL formats.zeige mehrzeige weniger

Volltext Dateien herunterladen

Metadaten exportieren

Statistik

Anzahl der Zugriffe auf dieses Dokument

Weitere Dienste

Teilen auf Twitter Suche bei Google Scholar
Metadaten
Verfasserangaben:Christian ChiarcosORCiDGND, Maxim Ionov, Luis Glaser, Christian Fäth
URN:urn:nbn:de:bvb:384-opus4-1040010
Frontdoor-URLhttps://opus.bibliothek.uni-augsburg.de/opus4/104001
URL:https://drops.dagstuhl.de/opus/portals/oasics/index.php?semnr=16205
ISBN:978-3-95977-199-3OPAC
ISSN:2190-6807OPAC
Titel des übergeordneten Werkes (Englisch):3rd Conference on Language, Data and Knowledge (LDK 2021), September 1–3, 2021, Zaragoza, Spain
Verlag:Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Verlagsort:Saarbrücken
Herausgeber*in:Dagmar Gromann, Gilles Sérasset, Thierry Declerck, John P. McCrae, Jorge Gracia, Julia Bosque-Gil, Fernando Bobillo, Barbara Heinisch
Typ:Konferenzveröffentlichung
Sprache:Englisch
Erstellungsdatum:24.04.2023
Jahr der Erstveröffentlichung:2021
Veröffentlichende Institution:Universität Augsburg
Datum der Freischaltung in OPUS:16.05.2023
Erste Seite:20:1
Letzte Seite:20:14
Schriftenreihe / Serie:OASIcs ; 93
Einrichtungen der Universität:Philologisch-Historische Fakultät
Philologisch-Historische Fakultät / Angewandte Computerlinguistik
Philologisch-Historische Fakultät / Angewandte Computerlinguistik / Lehrstuhl für Angewandte Computerlinguistik (ACoLi)
DDC-Klassifikation:4 Sprache / 40 Sprache / 400 Sprache
Lizenz (Deutsch):License LogoCC-BY 4.0: Creative Commons: Namensnennung (mit Print on Demand)