Generating realistic test datasets for duplicate detection at scale using historical voter data
- The detection of duplicates is an essential task in data cleaning
and integration and has steadily gained importance especially for
researchers and practitioners that need to process and integrate
large volumes of potentially unclean data on a daily basis. To
evaluate the quality and performance of duplicate detection algorithms, labeled test data are required that provide information
on the contained duplicates. Current approaches for generating
test data, however, are either not scalable (and therefore limited
to small datasets) or not able to generate realistic data values
and errors, especially outdated values. In this paper, we propose
a scheme for generating test datasets that addresses both these
issues and present a test dataset generated with it. Our approach
relies on using historical data from the North Carolina voter
register which (1) is realistic as it contains actual voter data and
(2) facilitates generating realistic duplicates through the factThe detection of duplicates is an essential task in data cleaning
and integration and has steadily gained importance especially for
researchers and practitioners that need to process and integrate
large volumes of potentially unclean data on a daily basis. To
evaluate the quality and performance of duplicate detection algorithms, labeled test data are required that provide information
on the contained duplicates. Current approaches for generating
test data, however, are either not scalable (and therefore limited
to small datasets) or not able to generate realistic data values
and errors, especially outdated values. In this paper, we propose
a scheme for generating test datasets that addresses both these
issues and present a test dataset generated with it. Our approach
relies on using historical data from the North Carolina voter
register which (1) is realistic as it contains actual voter data and
(2) facilitates generating realistic duplicates through the fact that
current data values were collected at every election through manually filled out applications. The generated test dataset comprises
more than 120 million records with up to 90 attribute values
each. To the best of our knowledge, we are the first who providerealistic test data for duplicate detection at this scale.…