Fast Search With Poor OCR
2019 Β· Taivanbat Badamdorj, Adiel Ben-Shalom, Nachum Dershowitz, et al.
Abstract
The indexing and searching of historical documents have garnered attention in recent years due to massive digitization efforts of important collections worldwide. Pure textual search in these corpora is a problem since optical character recognition (OCR) is infamous for performing poorly on such historical material, which often suffer from poor preservation. We propose a novel text-based method for searching through noisy text. Our system represents words as vectors, projects queries and candidates obtained from the OCR into a common space, and ranks the candidates using a metric suited to nearest-neighbor search. We demonstrate the practicality of our method on typewritten German documents from the WWII era.
Authors
(none)
Tags
Stats
Related papers
- Vectorsearch: Enhancing Document Retrieval With Semantic Embeddings And Optimized Search (2024)0.00
- Fetch-a-set: A Large-scale Ocr-free Benchmark For Historical Document Retrieval (2024)0.00
- Pattern Spotting And Image Retrieval In Historical Documents Using Deep Hashing (2022)2.26
- Semantic Vector Encoding And Similarity Search Using Fulltext Search Engines (2017)6.77
- A Fast Text Similarity Measure For Large Document Collections Using Multi-reference Cosine And Genetic Algorithm (2018)4.52
- CFIR: Fast And Effective Long-text To Image Retrieval For Large Corpora (2024)7.16
- ACORN: Performant And Predicate-agnostic Search Over Vector Embeddings And Structured Data (2024)11.76
- Thinking Fast And Slow: Efficient Text-to-visual Retrieval With Transformers (2021)15.16