CGPT: Cluster-guided Partial Tables With Llm-generated Supervision For Table Retrieval
2026 Β· Tsung-Hsiang Chou, Chen-Jui Yu, Shui-Hsiang Hsu, et al.
Abstract
General-purpose embedding models have demonstrated strong performance in text retrieval but remain suboptimal for table retrieval, where highly structured content leads to semantic compression and query-table mismatch. Recent LLM-based retrieval augmentation methods mitigate this issue by generating synthetic queries, yet they often rely on heuristic partial-table selection and seldom leverage these synthetic queries as supervision to improve the embedding model. We introduce CGPT, a training framework that enhances table retrieval through LLM-generated supervision. CGPT constructs semantically diverse partial tables by clustering table instances using K-means and sampling across clusters to broaden semantic coverage. An LLM then generates synthetic queries for these partial tables, which are used in hard-negative contrastive fine-tuning to refine the embedding model. Experiments across four public benchmarks (MimoTable, OTTQA, FetaQA, and E2E-WTQ) show that CGPT consistently outperfor
Authors
(none)
Tags
Stats
Related papers
- GRAPE: Let GPRO Supervise Query Rewriting By Ranking For Retrieval (2025)0.00
- Efficient Fine-tuning Methodology Of Text Embedding Models For Information Retrieval: Contrastive Learning Penalty (clp) (2024)2.16
- LMAR: Language Model Augmented Retriever For Domain-specific Knowledge Indexing (2025)1.57
- Llm-augmented Retrieval: Enhancing Retrieval Models Through Language Models And Doc-level Embedding (2024)0.00
- Mixed-modality Representation Learning And Pre-training For Joint Table-and-text Retrieval In Openqa (2022)9.02
- GPL: Generative Pseudo Labeling For Unsupervised Domain Adaptation Of Dense Retrieval (2021)17.47
- Don't Retrieve, Generate: Prompting Llms For Synthetic Training Data In Dense Retrieval (2025)0.00
- Evaluating The Effectiveness And Scalability Of Llm-based Data Augmentation For Retrieval (2025)0.00