← all datasets

AVCaps

Emerging
3papers using it
2024first seen

AVCaps is a dataset that contains audio, visual, and audio-visual captions for video clips, used to evaluate the semantic alignment of language, audio, and visual modalities in multimodal representation learning.

Papers using AVCaps (3)

AVCaps β€” datasets β€” multimodal