F2RVLM: Boosting Fine-grained Fragment Retrieval For Multi-modal Long-form Dialogue With Vision Language Model
2025 Β· Hanbo Bi, Zhiqiang Yuan, Zexi Jia, et al.
Abstract
Traditional dialogue retrieval aims to select the most appropriate utterance or image from recent dialogue history. However, they often fail to meet users' actual needs for revisiting semantically coherent content scattered across long-form conversations. To fill this gap, we define the Fine-grained Fragment Retrieval (FFR) task, requiring models to locate query-relevant fragments, comprising both utterances and images, from multimodal long-form dialogues. As a foundation for FFR, we construct MLDR, the longest-turn multimodal dialogue retrieval dataset to date, averaging 25.45 turns per dialogue, with each naturally spanning three distinct topics. To evaluate generalization in real-world scenarios, we curate and annotate a WeChat-based test set comprising real-world multimodal dialogues with an average of 75.38 turns. Building on these resources, we explore existing generation-based Vision-Language Models (VLMs) on FFR and observe that they often retrieve incoherent utterance-image fr
Authors
(none)
Tags
Stats
Related papers
- MMMORRF: Multimodal Multilingual Modularized Reciprocal Rank Fusion (2025)2.26
- Recqr: Incorporating Conversational Query Rewriting To Improve Multimodal Image Retrieval (2026)0.00
- A Little More Like This: Text-to-image Retrieval With Vision-language Models Using Relevance Feedback (2025)0.00
- Fine-grained Late-interaction Multi-modal Retrieval For Retrieval Augmented Visual Question Answering (2023)5.24
- Imageref-vl: Enabling Contextual Image Referencing In Vision-language Models (2025)1.91
- Flickr30k-cfq: A Compact And Fragmented Query Dataset For Text-image Retrieval (2024)3.58
- Vision-deepresearch Benchmark: Rethinking Visual And Textual Search For Multimodal Large Language Models (2026)7.27
- RAVEN: Multitask Retrieval Augmented Vision-language Learning (2024)0.00