Optimizing Legal Document Retrieval In Vietnamese With Semi-hard Negative Mining
2025 Β· van-Hoang Le, Duc-Vu Nguyen, Kiet van Nguyen, et al.
Abstract
Large Language Models (LLMs) face significant challenges in specialized domains like law, where precision and domain-specific knowledge are critical. This paper presents a streamlined two-stage framework consisting of Retrieval and Re-ranking to enhance legal document retrieval efficiency and accuracy. Our approach employs a fine-tuned Bi-Encoder for rapid candidate retrieval, followed by a Cross-Encoder for precise re-ranking, both optimized through strategic negative example mining. Key innovations include the introduction of the Exist@m metric to evaluate retrieval effectiveness and the use of semi-hard negatives to mitigate training bias, which significantly improved re-ranking performance. Evaluated on the SoICT Hackathon 2024 for Legal Document Retrieval, our team, 4Huiter, achieved a top-three position. While top-performing teams employed ensemble models and iterative self-training on large bge-m3 architectures, our lightweight, single-pass approach offered a competitive alterna
Authors
(none)
Tags
Stats
Related papers
- Nv-retriever: Improving Text Embedding Models With Effective Hard-negative Mining (2024)0.00
- Llm-augmented Retrieval: Enhancing Retrieval Models Through Language Models And Doc-level Embedding (2024)0.00
- LEMUR: A Corpus For Robust Fine-tuning Of Multilingual Law Embedding Models For Retrieval (2026)0.00
- Enhancing Retrieval Performance: An Ensemble Approach For Hard Negative Mining (2024)0.00
- Mm-embed: Universal Multimodal Retrieval With Multimodal Llms (2024)0.00
- Bica: Effective Biomedical Dense Retrieval With Citation-aware Hard Negatives (2025)0.00
- Hard Negatives, Hard Lessons: Revisiting Training Data Quality For Robust Information Retrieval With Llms (2025)2.26
- Don't Retrieve, Generate: Prompting Llms For Synthetic Training Data In Dense Retrieval (2025)0.00