Duplicate detection in probabilistic relational databases

  • Many applications such as OCR systems or sensor networks have to deal with uncertain information. One trend in current database research is to accept uncertainty as a ‘fact of life’ and hence to incorporate it into such applications’ results by producing probabilistic output data. To meaningfully integrate probabilistic data from multiple heterogeneous sources or to clean a single probabilistic database, duplicate database entities need to be identified. Duplicate detection has been extensively studied in the past, but conventional duplicate detection approaches are designed for matching database entities that are described by certain values and certainly belong to the considered universe of discourse. In probabilistic databases, however, each database entity can have several alternative values per attribute and its membership to the considered universe can be questionable. As a consequence, conventional duplicate detection approaches cannot be used for probabilistic databases withoutMany applications such as OCR systems or sensor networks have to deal with uncertain information. One trend in current database research is to accept uncertainty as a ‘fact of life’ and hence to incorporate it into such applications’ results by producing probabilistic output data. To meaningfully integrate probabilistic data from multiple heterogeneous sources or to clean a single probabilistic database, duplicate database entities need to be identified. Duplicate detection has been extensively studied in the past, but conventional duplicate detection approaches are designed for matching database entities that are described by certain values and certainly belong to the considered universe of discourse. In probabilistic databases, however, each database entity can have several alternative values per attribute and its membership to the considered universe can be questionable. As a consequence, conventional duplicate detection approaches cannot be used for probabilistic databases without adaptation. In this thesis, we consider the challenge of duplicate detection in probabilistic relational databases. The central research aspect of this thesis is to develop a generic approach that enables detection of probabilistic duplicates in highly diverse application domains by allowing an adjustment to individual needs. The benefit of using a probabilistic database for modeling deduplication results is that we do not necessarily need to resolve uncertainty on duplicate decisions, but instead can incorporate emerging decision uncertainty into the output database. Nevertheless, many commonly used probabilistic representation systems, such as tuple-independent probabilistic databases, are not powerful enough to model uncertainty on duplicate decisions. For that reason, we distinguish between deterministic duplicate detection approaches that completely resolve uncertainty on duplicate decisions by producing a single duplicate clustering as a result and indeterministic duplicate detection approaches that provide a set of possible duplicate clusterings as output. We identify two meaningful strategies for adapting conventional duplicate detection approaches to the uncertainty that is inherent in probabilistic data. According to these strategies, we propose two generic approaches to deterministic duplicate detection in probabilistic databases and present several techniques for reducing their computational complexity. In this context, we develop a similarity measure for discrete probability distributions that can be used as a fast alternative to the Earth Mover’s Distance. Additionally, we formalize the concept of indeterministic duplicate detection, propose approaches for representing an indeterministic deduplication result within a probabilistic database, discuss possible ways to meaningfully process indeterministic deduplication results, and present a clustering approach that can be used to efficiently compute a set of possible clusterings. Moreover, we discuss the meaning of detection quality in the presence of uncertain duplicate decisions, present measures for rating this meaning by numbers, and propose methods to compute these measures in an efficient way. Finally, we present a prototypical implementation and the results of a set of experiments we conducted on several test databases in order to prove the concepts of our proposed approaches.show moreshow less

Export metadata

Statistics

Number of document requests

Additional Services

Share in Twitter Search Google Scholar
Metadaten
Author:Fabian PanseGND
Frontdoor URLhttps://opus.bibliothek.uni-augsburg.de/opus4/117269
URL:https://nbn-resolving.org/urn:nbn:de:gbv:18-74307
Publisher:Universität Hamburg
Place of publication:Hamburg
Type:Book
Language:English
Year of first Publication:2014
Release Date:2024/12/04
Pagenumber:669
Note:
Dissertation, Universität Hamburg, 2014
Institutes:Fakultät für Angewandte Informatik
Fakultät für Angewandte Informatik / Institut für Informatik
Fakultät für Angewandte Informatik / Institut für Informatik / Lehrstuhl für Data Engineering