ActivityNet-Caption
EmergingThe ActivityNet Captions dataset connects videos to a series of temporally annotated sentence descriptions. Each sentence covers an unique segment of the video, describing multiple events that occur. These events may occur over very long or short periods of time and are not limited in any capacity, allowing them to co-occur. On average, each of the 20k videos contains 3.65 temporally localized sentences, resulting in a total of 100k sentences. We find that the number of sentences per video follows a relatively normal distribution. Furthermore, as the video duration increases, the number of sentences also increases. Each sentence has an average length of 13.48 words, which is also normally distributed. You can find more details of the dataset under the ActivityNet Captions Dataset section, and under supplementary materials in the paper.
Papers using ActivityNet-Caption (6)
- Attend And Interact: Higher-order Object Interactions For Video UnderstandingGrounded Objects And Interactions For Video CaptioningSAVCHOI: Detecting Suspicious Activities Using Dense Video Captioning With Human Object InteractionsWeakly Supervised Dense Video Captioning via Jointly Usage of Knowledge
Distillation and Cross-modal MatchingLive Video CaptioningConTra: (Con)text (Tra)nsformer for Cross-Modal Video Retrieval