π Datasets β Awesome Time Series
335 datasets & benchmarks β 15 canonical foundations plus emerging datasets mined from recent papers. Each links to the papers that use it.
Dataset Card for MATH-500 This dataset contains a subset of 500 problems from the MATH benchmark that OpenAI created in their Let's Verify Step by Step paper. See their GitHub repo for the source file: https://github.com/openai/prm800k/tree/main?tab=readme-ov-file#math-splits
GAIA dataset GAIA is a benchmark which aims at evaluating next-generation LLMs (LLMs with augmented capabilities due to added tooling, efficient prompting, access to search, etc). We added gating to prevent bots from scraping the dataset. Please do not reshare the validation or test set in a crawlable format. Data and leaderboard GAIA is made of more than 450 non-trivial question with an unambiguous answer, requiring different levels of tooling and autonomy to⦠See the full description on the dataset page: https://huggingface.co/datasets/gaia-benchmark/GAIA.
This dataset was created using LeRobot. Dataset Structure meta/info.json: { "codebase_version": "v3.0", "robot_type": "panda", "total_episodes": 1693, "total_frames": 273465, "total_tasks": 40, "chunks_size": 1000, "fps": 10.0, "splits": { "train": "0:1693" }, "data_path": "data/chunk-{chunk_index:03d}/file-{file_index:03d}.parquet", "video_path": "videos/{video_key}/chunk-{chunk_index:03d}/file-{file_index:03d}.mp4"β¦ See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceVLA/libero.
AIME 24 American Invitational Mathematics Examination (AIME) 2024 Citation If you use the AIME24 dataset in your research, please consider citing it as follows: @misc{aime24, title={American Invitational Mathematics Examination (AIME) 2024}, author={Zhang, Yifan and Math-AI, Team}, year={2024}, }
β οΈ This dataset is deprecated. It is replaced by longmemeval-cleaned (https://huggingface.co/datasets/xiaowu0162/longmemeval-cleaned) which noisy history sessions that interfere with the answer correctness.
HumanEval-X is a benchmark for the evaluation of the multilingual ability of code generative models. It consists of 820 high-quality human-crafted data samples (each with test cases) in Python, C++, Java, JavaScript, and Go, and can be used for various tasks.
Dataset Card for BEIR Benchmark hotpotqa is one of the datasets from the Question Answering task within BEIR, measuring Wikipedia article retrieval for a given multi-hop query. Dataset Summary BEIR is a heterogeneous benchmark built from 18 diverse datasets representing 9 information retrieval tasks. Fact-checking: FEVER, Climate-FEVER, SciFact Question-Answering: NQ, HotpotQA, FiQA-2018 Bio-Medical IR: TREC-COVID, BioASQ, NFCorpus News Retrieval: TREC-NEWS, Robust04β¦ See the full description on the dataset page: https://huggingface.co/datasets/BeIR/hotpotqa.
Dataset Card for afrixnli Dataset Summary AFRIXNLI is an evaluation dataset comprising translations of a subset of the XNLI dataset into 16 African languages. It includes both validation and test sets across all 18 languages, maintaining the English and French subsets from the original XNLI dataset. Languages There are 18 languages available : Dataset Structure Data Instances The examples look like this for English: from datasets import⦠See the full description on the dataset page: https://huggingface.co/datasets/masakhane/afrixnli.
Warning: The leaderboard above is generated by Hugging Face eval-results and may be incomplete until evaluation_framework: benchflow is accepted and deployed. The audited SkillsBench v1.1 result archive is https://huggingface.co/datasets/benchflow/skillsbench-leaderboard, with compact official exports under leaderboard/skillsbench/v1.1/. Warning: The dataset is a read-only mirror. The primary source for this benchmark is on GitHub: https://github.com/benchflow-ai/skillsbench. Open issues and⦠See the full description on the dataset page: https://huggingface.co/datasets/benchflow/skillsbench.
Dataset Card for "QM9" More Information needed