LEMUR: A Corpus For Robust Fine-tuning Of Multilingual Law Embedding Models For Retrieval
2026 Β· Narges Baba Ahmadi, Jan Strich, Martin Semmann, et al.
Abstract
Large language models (LLMs) are increasingly used to access legal information. Yet, their deployment in multilingual legal settings is constrained by unreliable retrieval and the lack of domain-adapted, open-embedding models. In particular, existing multilingual legal corpora are not designed for semantic retrieval, and PDF-based legislative sources introduce substantial noise due to imperfect text extraction. To address these challenges, we introduce LEMUR, a large-scale multilingual corpus of EU environmental legislation constructed from 24,953 official EUR-Lex PDF documents covering 25 languages. We quantify the fidelity of PDF-to-text conversion by measuring lexical consistency against authoritative HTML versions using the Lexical Content Score (LCS). Building on LEMUR, we fine-tune three state-of-the-art multilingual embedding models using contrastive objectives in both monolingual and bilingual settings, reflecting realistic legal-retrieval scenarios. Experiments across low- and
Authors
(none)
Tags
Stats
Related papers
- Llm-augmented Retrieval: Enhancing Retrieval Models Through Language Models And Doc-level Embedding (2024)0.00
- Evaluating Llm-based Approaches To Legal Citation Prediction: Domain-specific Pre-training, Fine-tuning, Or RAG? A Benchmark And An Australian Law Case Study (2024)0.00
- Transforming Llms Into Cross-modal And Cross-lingual Retrieval Systems (2024)4.52
- LUSIFER: Language Universal Space Integration For Enhanced Multilingual Embeddings With Large Language Models (2025)0.00
- Optimizing Legal Document Retrieval In Vietnamese With Semi-hard Negative Mining (2025)0.00
- LMAR: Language Model Augmented Retriever For Domain-specific Knowledge Indexing (2025)1.57
- Lexsembridge: Fine-grained Dense Representation Enhancement Through Token-aware Embedding Augmentation (2025)2.35
- Mm-embed: Universal Multimodal Retrieval With Multimodal Llms (2024)0.00