A Fast Text Similarity Measure For Large Document Collections Using Multi-reference Cosine And Genetic Algorithm
2018 Β· Hamid Mohammadi, Seyed Hossein Khasteh
Abstract
One of the important factors that make a search engine fast and accurate is a concise and duplicate free index. In order to remove duplicate and near-duplicate documents from the index, a search engine needs a swift and reliable duplicate and near-duplicate text document detection system. Traditional approaches to this problem, such as brute force comparisons or simple hash-based algorithms are not suitable as they are not scalable and are not capable of detecting near-duplicate documents effectively. In this paper, a new signature-based approach to text similarity detection is introduced which is fast, scalable, reliable and needs less storage space. The proposed method is examined on popular text document data-sets such as CiteseerX, Enron, Gold Set of Near-duplicate News Articles and etc. The results are promising and comparable with the best cutting-edge algorithms, considering the accuracy and performance. The proposed method is based on the idea of using reference texts to genera
Authors
(none)
Tags
Stats
Related papers
- Cos-mix: Cosine Similarity And Distance Fusion For Improved Information Retrieval (2024)0.00
- Group Testing For Accurate And Efficient Range-based Near Neighbor Search For Plagiarism Detection (2023)2.26
- A PSO Strategy Of Finding Relevant Web Documents Using A New Similarity Measure (2021)2.26
- Automatic Construction Of Evaluation Sets And Evaluation Of Document Similarity Models In Large Scholarly Retrieval Systems (2016)0.00
- Variational Deep Semantic Hashing For Text Documents (2017)12.25
- Fast Search With Poor OCR (2019)0.00
- Fast And Scalable Gene Embedding Search: A Comparative Study Of FAISS And Scann (2025)2.26
- Advancing Similarity Search With Genai: A Retrieval Augmented Generation Approach (2024)0.00