The ACoLi CoNLL libraries: beyond tab-separated values
- We introduce the ACoLi CoNLL libraries, a set of Java archives to facilitate advanced manipulations of corpora annotated in TSV formats, including all members of the CoNLL format family. In particular, we provide means for (i) rule-based re-write operations, (ii) visualization and manual annotation, (iii) merging CoNLL files, and (iv) data base support. The ACoLi CoNLL libraries provide command-line interface to these functionalities. The following aspects are technologically innovative and exceed beyond the state of the art: We support every OWPL (one word per line) corpus format with tab-separated columns, whereas most existing tools are specific to one particular CoNLL dialect. We employ established W3C standards for rule-based graph rewriting operations on CoNLL sentences. We provide means for the heuristic, but fully automated merging of CoNLL annotations of the same textual content, in particular for resolving conflicting tokenizations. We demonstrate the usefulness andWe introduce the ACoLi CoNLL libraries, a set of Java archives to facilitate advanced manipulations of corpora annotated in TSV formats, including all members of the CoNLL format family. In particular, we provide means for (i) rule-based re-write operations, (ii) visualization and manual annotation, (iii) merging CoNLL files, and (iv) data base support. The ACoLi CoNLL libraries provide command-line interface to these functionalities. The following aspects are technologically innovative and exceed beyond the state of the art: We support every OWPL (one word per line) corpus format with tab-separated columns, whereas most existing tools are specific to one particular CoNLL dialect. We employ established W3C standards for rule-based graph rewriting operations on CoNLL sentences. We provide means for the heuristic, but fully automated merging of CoNLL annotations of the same textual content, in particular for resolving conflicting tokenizations. We demonstrate the usefulness and practicability of our proposed CoNLL libraries on well-established data sets of the Universal Dependency corpus and the Penn Treebank.…
Author: | Christian ChiarcosORCiDGND, Nico Schenk |
---|---|
URN: | urn:nbn:de:bvb:384-opus4-1040959 |
Frontdoor URL | https://opus.bibliothek.uni-augsburg.de/opus4/104095 |
URL: | https://www.aclweb.org/anthology/L18-1090 |
ISBN: | 979-10-95546-00-9OPAC |
Parent Title (English): | Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC-2018), May 7–12, 2018, Miyazaki, Japan |
Publisher: | European Language Resources Association |
Place of publication: | Paris |
Editor: | Nicoletta Calzolari, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Koiti Hasida, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis, Takenobu Tokunaga |
Type: | Conference Proceeding |
Language: | English |
Year of first Publication: | 2018 |
Publishing Institution: | Universität Augsburg |
Release Date: | 2023/05/15 |
First Page: | 571 |
Last Page: | 576 |
Institutes: | Philologisch-Historische Fakultät |
Philologisch-Historische Fakultät / Angewandte Computerlinguistik | |
Philologisch-Historische Fakultät / Angewandte Computerlinguistik / Lehrstuhl für Angewandte Computerlinguistik (ACoLi) | |
Dewey Decimal Classification: | 4 Sprache / 40 Sprache / 400 Sprache |
Licence (German): | CC-BY-NC 4.0: Creative Commons: Namensnennung - Nicht kommerziell (mit Print on Demand) |