CoNLL-Merge: efficient harmonization of concurrent tokenization and textual variation

  • The proper detection of tokens in of running text represents the initial processing step in modular NLP pipelines. But strategies for defining these minimal units can differ, and conflicting analyses of the same text seriously limit the integration of subsequent linguistic annotations into a shared representation. As a solution, we introduce CoNLL Merge, a practical tool for harmonizing TSV-related data models, as they occur, e.g., in multi-layer corpora with non-sequential, concurrent tokenizations, but also in ensemble combinations in Natural Language Processing. CoNLL Merge works unsupervised, requires no manual intervention or external data sources, and comes with a flexible API for fully automated merging routines, validity and sanity checks. Users can chose from several merging strategies, and either preserve a reference tokenization (with possible losses of annotation granularity), create a common tokenization layer consisting of minimal shared subtokens (loss-less in terms ofThe proper detection of tokens in of running text represents the initial processing step in modular NLP pipelines. But strategies for defining these minimal units can differ, and conflicting analyses of the same text seriously limit the integration of subsequent linguistic annotations into a shared representation. As a solution, we introduce CoNLL Merge, a practical tool for harmonizing TSV-related data models, as they occur, e.g., in multi-layer corpora with non-sequential, concurrent tokenizations, but also in ensemble combinations in Natural Language Processing. CoNLL Merge works unsupervised, requires no manual intervention or external data sources, and comes with a flexible API for fully automated merging routines, validity and sanity checks. Users can chose from several merging strategies, and either preserve a reference tokenization (with possible losses of annotation granularity), create a common tokenization layer consisting of minimal shared subtokens (loss-less in terms of annotation granularity, destructive against a reference tokenization), or present tokenization clashes (loss-less and non-destructive, but introducing empty tokens as place-holders for unaligned elements). We demonstrate the applicability of the tool on two use cases from natural language processing and computational philology.show moreshow less

Download full text files

Export metadata

Statistics

Number of document requests

Additional Services

Share in Twitter Search Google Scholar
Metadaten
Author:Christian ChiarcosORCiDGND, Nico Schenk
URN:urn:nbn:de:bvb:384-opus4-1040852
Frontdoor URLhttps://opus.bibliothek.uni-augsburg.de/opus4/104085
ISBN:978-3-95977-105-4OPAC
ISSN:2190-6807OPAC
Parent Title (English):2nd Conference on Language, Data and Knowledge, LDK 2019, May 20–23, 2019, Leipzig, Germany
Publisher:Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Place of publication:Saarbrücken
Editor:Maria Eskevich, Gerard de Melo, Christian Fäth, John P. McCrae, Paul Buitelaar, Christian ChiarcosORCiDGND, Bettina Klimek, Milan Dojchinovski
Type:Conference Proceeding
Language:English
Year of first Publication:2019
Publishing Institution:Universität Augsburg
Release Date:2023/05/16
First Page:7:1
Last Page:7:14
Series:OASIcs ; 70
DOI:https://doi.org/10.4230/OASIcs.LDK.2019.7
Institutes:Philologisch-Historische Fakultät
Philologisch-Historische Fakultät / Angewandte Computerlinguistik
Philologisch-Historische Fakultät / Angewandte Computerlinguistik / Lehrstuhl für Angewandte Computerlinguistik (ACoLi)
Dewey Decimal Classification:4 Sprache / 40 Sprache / 400 Sprache
Licence (German):CC-BY 3.0: Creative Commons - Namensnennung (mit Print on Demand)