AVCaps
Emerging3papers using it
2024first seen
AVCaps is a dataset that contains audio, visual, and audio-visual captions for video clips, used to evaluate the semantic alignment of language, audio, and visual modalities in multimodal representation learning.