On the database lookup problem of approximate matching

Breitinger, Frank; Baier, Harald; White, Douglas

doi:10.1016/j.diin.2014.03.001

Investigating seized devices within digital forensics gets more and more difficult due to the increasing amount of data. Hence, a common procedure uses automated file identification which reduces the amount of data an investigator has to look at by hand. Besides identifying exact duplicates, which is mostly solved using cryptographic hash functions, it is also helpful to detect similar data by applying approximate matching. Let x denote the number of digests in a database, then the lookup for a single similarity digest has the complexity of O(x). In other words, the digest has to be compared against all digests in the database. In contrast, cryptographic hash values are stored within binary trees or hash tables and hence the lookup complexity of a single digest is O(log2(x)) or O(1), respectively. In this paper we present and evaluate a concept to extend existing approximate matching algorithms, which reduces the lookup complexity from O(x) to O(1). Therefore, instead of usingInvestigating seized devices within digital forensics gets more and more difficult due to the increasing amount of data. Hence, a common procedure uses automated file identification which reduces the amount of data an investigator has to look at by hand. Besides identifying exact duplicates, which is mostly solved using cryptographic hash functions, it is also helpful to detect similar data by applying approximate matching. Let x denote the number of digests in a database, then the lookup for a single similarity digest has the complexity of O(x). In other words, the digest has to be compared against all digests in the database. In contrast, cryptographic hash values are stored within binary trees or hash tables and hence the lookup complexity of a single digest is O(log2(x)) or O(1), respectively. In this paper we present and evaluate a concept to extend existing approximate matching algorithms, which reduces the lookup complexity from O(x) to O(1). Therefore, instead of using multiple small Bloom filters (which is the common procedure), we demonstrate that a single, huge Bloom filter has a far better performance. Our evaluation demonstrates that current approximate matching algorithms are too slow (e.g., over 21 min to compare 4457 digests of a common file corpus against each other) while the improved version solves this challenge within seconds. Studying the precision and recall rates shows that our approach works as reliably as the original implementations. We obtain this benefit by accuracy–the comparison is now a file-against-set comparison and thus it is not possible to see which file in the database is matched.… show more

Author:	Frank Breitinger ORCiD GND, Harald Baier, Douglas White
URN:	urn:nbn:de:bvb:384-opus4-1176092
Frontdoor URL	https://opus.bibliothek.uni-augsburg.de/opus4/117609
ISSN:	1742-2876OPAC
Parent Title (English):	Digital Investigation
Publisher:	Elsevier BV
Type:	Article
Language:	English
Year of first Publication:	2014
Publishing Institution:	Universität Augsburg
Release Date:	2024/12/16
Volume:	11
Issue:	Supplement 1
First Page:	S1
Last Page:	S9
DOI:	https://doi.org/10.1016/j.diin.2014.03.001
Institutes:	Fakultät für Angewandte Informatik
	Fakultät für Angewandte Informatik / Institut für Informatik
	Fakultät für Angewandte Informatik / Institut für Informatik / Lehrstuhl für Cybersicherheit
Dewey Decimal Classification:	0 Informatik, Informationswissenschaft, allgemeine Werke / 00 Informatik, Wissen, Systeme / 004 Datenverarbeitung; Informatik
Licence (German):	CC-BY-NC-ND 3.0: Creative Commons - Namensnennung - Nicht kommerziell - Keine Bearbeitung

Open Access

On the database lookup problem of approximate matching

Download full text files

Export metadata

Statistics

Additional Services