CBVS: A Large-scale Chinese Image-text Benchmark For Real-world Short Video Search Scenarios
2024 Β· Xiangshuo Qiao, Xianxin Li, Xiaozhe Qu, et al.
Abstract
Vision-Language Models pre-trained on large-scale image-text datasets have shown superior performance in downstream tasks such as image retrieval. Most of the images for pre-training are presented in the form of open domain common-sense visual elements. Differently, video covers in short video search scenarios are presented as user-originated contents that provide important visual summaries of videos. In addition, a portion of the video covers come with manually designed cover texts that provide semantic complements. In order to fill in the gaps in short video cover data, we establish the first large-scale cover-text benchmark for Chinese short video search scenarios. Specifically, we release two large-scale datasets CBVS-5M/10M to provide short video covers, and the manual fine-labeling dataset CBVS-20K to provide real user queries, which serves as an image-text benchmark test in the Chinese short video search field. To integrate the semantics of cover text in the case of modality mis
Authors
(none)
Tags
Stats
Related papers
- Multivent 2.0: A Massive Multilingual Benchmark For Event-centric Video Retrieval (2024)3.58
- Bivlc: Extending Vision-language Compositionality Evaluation With Text-to-image Retrieval (2024)0.00
- Vision-deepresearch Benchmark: Rethinking Visual And Textual Search For Multimodal Large Language Models (2026)7.27
- Learning The Visualness Of Text Using Large Vision-language Models (2023)4.52
- M2-encoder: Advancing Bilingual Image-text Understanding By Large-scale Efficient Pretraining (2024)0.00
- Lovr: A Benchmark For Long Video Retrieval In Multimodal Contexts (2025)0.00
- Rethinking Benchmarks For Cross-modal Image-text Retrieval (2023)13.11
- Sovabench: A Vehicle Surveillance Action Retrieval Benchmark For Multimodal Large Language Models (2026)0.00