← all datasets

ActivityNet-Caption

Emerging
6papers using it
169HF downloads
3HF likes
2017first seen

The ActivityNet Captions dataset connects videos to a series of temporally annotated sentence descriptions. Each sentence covers an unique segment of the video, describing multiple events that occur. These events may occur over very long or short periods of time and are not limited in any capacity, allowing them to co-occur. On average, each of the 20k videos contains 3.65 temporally localized sentences, resulting in a total of 100k sentences. We find that the number of sentences per video follows a relatively normal distribution. Furthermore, as the video duration increases, the number of sentences also increases. Each sentence has an average length of 13.48 words, which is also normally distributed. You can find more details of the dataset under the ActivityNet Captions Dataset section, and under supplementary materials in the paper.

Papers using ActivityNet-Caption (6)

ActivityNet-Caption — datasets — computer-vision