• search hit 69 of 17246
Back to Result List

Can BERT dig it? Named entity recognition for information retrieval in the archaeology domain

  • The amount of archaeological literature is growing rapidly. Until recently, these data were only accessible through metadata search. We implemented a text retrieval engine for a large archaeological text collection (~658 million words). In archaeological IR, domain-specific entities such as locations, time periods and artefacts play a central role. This motivated the development of a named entity recognition (NER) model to annotate the full collection with archaeological named entities. In this article, we present ArcheoBERTje, a BERT (Bidirectional Encoder Representations from Transformers) model pre-trained on Dutch archaeological texts. We compare the model’s quality and output on an NER task to a generic multilingual model and a generic Dutch model. We also investigate ensemble methods for combining multiple BERT models, and combining the best BERT model with a domain thesaurus using conditional random fields. We find that ArcheoBERTje outperforms both the multilingual and DutchThe amount of archaeological literature is growing rapidly. Until recently, these data were only accessible through metadata search. We implemented a text retrieval engine for a large archaeological text collection (~658 million words). In archaeological IR, domain-specific entities such as locations, time periods and artefacts play a central role. This motivated the development of a named entity recognition (NER) model to annotate the full collection with archaeological named entities. In this article, we present ArcheoBERTje, a BERT (Bidirectional Encoder Representations from Transformers) model pre-trained on Dutch archaeological texts. We compare the model’s quality and output on an NER task to a generic multilingual model and a generic Dutch model. We also investigate ensemble methods for combining multiple BERT models, and combining the best BERT model with a domain thesaurus using conditional random fields. We find that ArcheoBERTje outperforms both the multilingual and Dutch model significantly with a smaller standard deviation between runs, reaching an average F1 score of 0.735. The model also outperforms ensemble methods combining the three models. Combining ArcheoBERTje predictions and explicit domain knowledge from the thesaurus did not increase the F1 score. We quantitatively and qualitatively analyse the differences between the vocabulary and output of the BERT models on the full collection and provide some valuable insights in the effect of fine-tuning for specific domains. Our results indicate that for a highly specific text domain such as archaeology, further pre-training on domain-specific data increases the model’s quality on NER by a much larger margin than shown for other domains in the literature, and that domain-specific pre-training makes the addition of domain knowledge from a thesaurus unnecessary.show moreshow less

Download full text files

Export metadata

Statistics

Number of document requests

Additional Services

Share in Twitter Search Google Scholar
Metadaten
Author:Alex Brandsen, Suzan Verberne, Karsten LambersORCiDGND, Milco Wansleeben
URN:urn:nbn:de:bvb:384-opus4-1242035
Frontdoor URLhttps://opus.bibliothek.uni-augsburg.de/opus4/124203
ISSN:1556-4673OPAC
Parent Title (English):Journal on Computing and Cultural Heritage
Publisher:Association for Computing Machinery (ACM)
Place of publication:New York, NY
Type:Article
Language:English
Year of first Publication:2022
Publishing Institution:Universität Augsburg
Release Date:2025/08/01
Volume:15
Issue:3
First Page:51
DOI:https://doi.org/10.1145/3497842
Institutes:Philologisch-Historische Fakultät
Philologisch-Historische Fakultät / Digital Humanities
Philologisch-Historische Fakultät / Digital Humanities / Lehrstuhl für Image Processing and Visualization in Digital Humanities
Dewey Decimal Classification:0 Informatik, Informationswissenschaft, allgemeine Werke / 00 Informatik, Wissen, Systeme / 004 Datenverarbeitung; Informatik
Licence (German):CC-BY 4.0: Creative Commons: Namensnennung (mit Print on Demand)