Identifying document similarity using a fast estimation of the Levenshtein Distance based on compression and signatures

  • Identifying document similarity has many applications, e.g., source code analysis or plagiarism detection. However, identifying similarities is not trivial and can be time complex. For instance, the Levenshtein Distance is a common metric to define the similarity between two documents but has quadratic runtime which makes it impractical for large documents where large starts with a few hundred kilobytes. In this paper, we present a novel concept that allows estimating the Levenshtein Distance: the algorithm first compresses documents to signatures (similar to hash values) using a user-defined compression ratio. Signatures can then be compared against each other (some constrains apply) where the outcome is the estimated Levenshtein Distance. Our evaluation shows promising results in terms of runtime efficiency and accuracy. In addition, we introduce a significance score allowing examiners to set a threshold and identify related documents.

Download full text files

Export metadata

Statistics

Number of document requests

Additional Services

Share in Twitter Search Google Scholar
Metadaten
Author:Peter Coates, Frank BreitingerORCiDGND
URN:urn:nbn:de:bvb:384-opus4-1177118
Frontdoor URLhttps://opus.bibliothek.uni-augsburg.de/opus4/117711
URL:https://dfrws.org/presentation/identifying-document-similarity-using-a-fast-estimation-of-the-levenshtein-distance-based-on-compression-and-signatures/
Parent Title (English):Proceedings of the Digital Forensics Research Conference Europe (DFRWS EU) 2022, March 29 - April 1, 2022, Oxford, UK, hybrid
Publisher:arXiv
Type:Conference Proceeding
Language:English
Date of Publication (online):2024/12/18
Year of first Publication:2022
Publishing Institution:Universität Augsburg
Release Date:2024/12/18
First Page:arXiv:2307.11496
DOI:https://doi.org/10.48550/arXiv.2307.11496
Institutes:Fakultät für Angewandte Informatik
Fakultät für Angewandte Informatik / Institut für Informatik
Fakultät für Angewandte Informatik / Institut für Informatik / Lehrstuhl für Cybersicherheit
Dewey Decimal Classification:0 Informatik, Informationswissenschaft, allgemeine Werke / 00 Informatik, Wissen, Systeme / 004 Datenverarbeitung; Informatik
Licence (German):Deutsches Urheberrecht