How To Reduce The Search Space Of Entity Resolution: With Blocking Or Nearest Neighbor Search?
2022 Β· George Papadakis, Marco Fisichella, Franziska Schoger, et al.
Abstract
Entity Resolution suffers from quadratic time complexity. To increase its time efficiency, three kinds of filtering techniques are typically used for restricting its search space: (i) blocking workflows, which group together entity profiles with identical or similar signatures, (ii) string similarity join algorithms, which quickly detect entities more similar than a threshold, and (iii) nearest-neighbor methods, which convert every entity profile into a vector and quickly detect the closest entities according to the specified distance function. Numerous methods have been proposed for each type, but the literature lacks a comparative analysis of their relative performance. As we show in this work, this is a non-trivial task, due to the significant impact of configuration parameters on the performance of each filtering technique. We perform the first systematic experimental study that investigates the relative performance of the main methods per type over 10 real-world datasets. For each
Authors
(none)
Tags
Stats
Related papers
- Dimensionality-reduction Techniques For Approximate Nearest Neighbor Search: A Survey And Evaluation (2024)0.00
- Deeper -- Deep Entity Resolution (2017)16.53
- Experimental Analysis Of Locality Sensitive Hashing Techniques For High-dimensional Approximate Nearest Neighbor Searches (2020)6.34
- Associative Memories To Accelerate Approximate Nearest Neighbor Search (2016)6.34
- A Scalable Solution To The Nearest Neighbor Search Problem Through Local-search Methods On Neighbor Graphs (2017)3.58
- Worst-case Performance Of Popular Approximate Nearest Neighbor Search Implementations: Guarantees And Limitations (2023)5.84
- A Framework For Similarity Search With Space-time Tradeoffs Using Locality-sensitive Filtering (2016)8.35
- An Algorithm For Reducing Approximate Nearest Neighbor To Approximate Near Neighbor With O(logn) Query Time (2018)3.58