← all datasets

VoxCeleb2

Canonical

20papers using it

2021first seen

VoxCeleb2 Dataset This is the VoxCeleb2 dataset, a large-scale speaker identification dataset. Dataset Description VoxCeleb2 contains over 1 million utterances for 6,112 celebrities, extracted from videos uploaded to YouTube. Files vox2_dev_mp4_part*: Multipart archive containing MP4 video files vox2_dev_txt: Text file

🔎 Find this dataset

Papers using VoxCeleb2 (20)

Echo: A Joint-Embedding Predictive Architecture for Speaker Diarization and Speech Recognition in a Shared Latent Space2026

CL-UZH submission to the NIST SRE 2024 Speaker Recognition Evaluation2025

Few-Shot Speaker Identification Using Lightweight Prototypical Network with Feature Grouping and Interaction2023 · 14 cites

Multi-View Self-Attention Based Transformer for Speaker Recognition2021 · 3 cites

Can large-scale vocoded spoofed data improve speech spoofing countermeasure with a self-supervised front end?2023 · 2 cites

Self-Supervised Training of Speaker Encoder with Multi-Modal Diverse Positive Pairs2022 · 1 cites

Speaker Recognition Using Isomorphic Graph Attention Network Based Pooling on Self-Supervised Representation2023 · 1 cites

Seeing Through the Conversation: Audio-Visual Speech Separation based on Diffusion Model2023 · 1 cites

Target Speech Diarization with Multimodal Prompts2024 · 1 cites

Mechanisms of Multimodal Synchronization: Insights from Decoder-Based Video-Text-to-Speech Synthesis2024 · 1 cites

Multi-task Voice Activated Framework using Self-supervised Learning2021

More than Words: In-the-Wild Visually-Driven Prosody for Text-to-Speech2021

LipSound2: Self-Supervised Pre-Training for Lip-to-Speech Reconstruction and Lip Reading2021

Auto-AVSR: Audio-Visual Speech Recognition with Automatic Labels2023

A vector quantized masked autoencoder for speech emotion recognition2023

One-Step Knowledge Distillation and Fine-Tuning in Using Large Pre-Trained Self-Supervised Learning Models for Speaker Verification2023

Speaker verification using attentive multi-scale convolutional recurrent network2023

Target Speech Extraction with Pre-trained AV-HuBERT and Mask-And-Recover Strategy2024

Multilingual Audio-Visual Speech Recognition with Hybrid CTC/RNN-T Fast Conformer2024

Libri2Vox Dataset: Target Speaker Extraction with Diverse Speaker Conditions and Synthetic Data2024

VoxCeleb2 dataset — papers, benchmarks & downloads · Speech Audio