Generating realistic test datasets for duplicate detection at scale using historical voter data

  • The detection of duplicates is an essential task in data cleaning and integration and has steadily gained importance especially for researchers and practitioners that need to process and integrate large volumes of potentially unclean data on a daily basis. To evaluate the quality and performance of duplicate detection algorithms, labeled test data are required that provide information on the contained duplicates. Current approaches for generating test data, however, are either not scalable (and therefore limited to small datasets) or not able to generate realistic data values and errors, especially outdated values. In this paper, we propose a scheme for generating test datasets that addresses both these issues and present a test dataset generated with it. Our approach relies on using historical data from the North Carolina voter register which (1) is realistic as it contains actual voter data and (2) facilitates generating realistic duplicates through the factThe detection of duplicates is an essential task in data cleaning and integration and has steadily gained importance especially for researchers and practitioners that need to process and integrate large volumes of potentially unclean data on a daily basis. To evaluate the quality and performance of duplicate detection algorithms, labeled test data are required that provide information on the contained duplicates. Current approaches for generating test data, however, are either not scalable (and therefore limited to small datasets) or not able to generate realistic data values and errors, especially outdated values. In this paper, we propose a scheme for generating test datasets that addresses both these issues and present a test dataset generated with it. Our approach relies on using historical data from the North Carolina voter register which (1) is realistic as it contains actual voter data and (2) facilitates generating realistic duplicates through the fact that current data values were collected at every election through manually filled out applications. The generated test dataset comprises more than 120 million records with up to 90 attribute values each. To the best of our knowledge, we are the first who providerealistic test data for duplicate detection at this scale.show moreshow less

Download full text files

Export metadata

Statistics

Number of document requests

Additional Services

Share in Twitter Search Google Scholar
Metadaten
Author:Fabian PanseGND, André Düjon, Wolfram Wingerath, Benjamin Wollmer
URN:urn:nbn:de:bvb:384-opus4-1172654
Frontdoor URLhttps://opus.bibliothek.uni-augsburg.de/opus4/117265
ISBN:978-3-89318-084-4OPAC
ISSN:2367-2005OPAC
Parent Title (English):EDBT 2021, 24th International Conference on Extending Database Technology, Nicosia, Cyprus, March 23-26, proceedings
Publisher:OpenProceedings
Place of publication:Konstanz
Editor:Yannis Velegrakis, Demetris Zeinalipour, Panos K. Chrysanthis, Francesco Guerra
Type:Conference Proceeding
Language:English
Year of first Publication:2021
Publishing Institution:Universität Augsburg
Release Date:2024/12/03
First Page:570
Last Page:581
DOI:https://doi.org/10.5441/002/edbt.2021.67
Institutes:Fakultät für Angewandte Informatik
Fakultät für Angewandte Informatik / Institut für Informatik
Fakultät für Angewandte Informatik / Institut für Informatik / Lehrstuhl für Data Engineering
Dewey Decimal Classification:0 Informatik, Informationswissenschaft, allgemeine Werke / 00 Informatik, Wissen, Systeme / 004 Datenverarbeitung; Informatik
Licence (German):CC-BY-NC-ND 4.0: Creative Commons: Namensnennung - Nicht kommerziell - Keine Bearbeitung (mit Print on Demand)