Frost: a platform for benchmarking and exploring data matching results

  • “Bad” data has a direct impact on 88% of companies, with the average company losing 12% of its revenue due to it. Duplicates – multiple but different representations of the same real-world entities – are among the main reasons for poor data quality, so finding and configuring the right deduplication solution is essential. Existing data matching benchmarks focus on the quality of matching results and neglect other important factors, such as business requirements. Additionally, they often do not support the exploration of data matching results. To address this gap between the mere counting of record pairs vs. a comprehensive means to evaluate data matching solutions, we present the Frost platform. It combines existing benchmarks, established quality metrics, cost and effort metrics, and exploration techniques, making it the first platform to allow systematic exploration to understand matching results. Frost is implemented and published in the open-source application Snowman, which“Bad” data has a direct impact on 88% of companies, with the average company losing 12% of its revenue due to it. Duplicates – multiple but different representations of the same real-world entities – are among the main reasons for poor data quality, so finding and configuring the right deduplication solution is essential. Existing data matching benchmarks focus on the quality of matching results and neglect other important factors, such as business requirements. Additionally, they often do not support the exploration of data matching results. To address this gap between the mere counting of record pairs vs. a comprehensive means to evaluate data matching solutions, we present the Frost platform. It combines existing benchmarks, established quality metrics, cost and effort metrics, and exploration techniques, making it the first platform to allow systematic exploration to understand matching results. Frost is implemented and published in the open-source application Snowman, which includes the visual exploration of matching results, as shown in Figure 1.show moreshow less

Download full text files

Export metadata

Statistics

Number of document requests

Additional Services

Share in Twitter Search Google Scholar
Metadaten
Author:Martin Graf, Lukas Laskowski, Florian Papsdorf, Florian Sold, Roland Gremmelspacher, Felix Naumann, Fabian PanseGND
URN:urn:nbn:de:bvb:384-opus4-1172627
Frontdoor URLhttps://opus.bibliothek.uni-augsburg.de/opus4/117262
ISSN:2150-8097OPAC
Parent Title (English):Proceedings of the VLDB Endowment
Publisher:Association for Computing Machinery (ACM)
Place of publication:New York, NY
Type:Article
Language:English
Year of first Publication:2022
Publishing Institution:Universität Augsburg
Release Date:2024/12/03
Volume:15
Issue:12
First Page:3292
Last Page:3305
DOI:https://doi.org/10.14778/3554821.3554823
Institutes:Fakultät für Angewandte Informatik
Fakultät für Angewandte Informatik / Institut für Informatik
Fakultät für Angewandte Informatik / Institut für Informatik / Lehrstuhl für Data Engineering
Dewey Decimal Classification:0 Informatik, Informationswissenschaft, allgemeine Werke / 00 Informatik, Wissen, Systeme / 004 Datenverarbeitung; Informatik
Licence (German):CC-BY-NC-ND 4.0: Creative Commons: Namensnennung - Nicht kommerziell - Keine Bearbeitung (mit Print on Demand)