Awesome Papers

Papers

Rethinking Memory as Continuously Evolving Connectivity (2026)
Jizhan Fang et al.
13.31
LongAV-Compass: Towards Unified Evaluation of Minute-Scale Audio-Visual Generation Across T2AV, I2AV, and V2AV (2026)
Tengfei Liu et al.
13.04
PianoMotion10M: Dataset and Benchmark for Hand Motion Generation in Piano Performance (2025)
Qijun Gan et al.
8.70
Squeezing Capacity from Multimodal Large Language Models for Subject-driven Generation (2026)
Shuhong Zheng et al.
7.39
TANDEM: Temporal-Aware Neural Detection for Multimodal Hate Speech (2026)
Girish A. Koushik et al.
1.96
From Talking to Singing: A New Challenge for Audio-Visual Deepfake Detection (2026)
Ke Liu et al.
0.00
MTAVG-Bench 2.0: Diagnosing Failure Modes of Cinematic Expressiveness in Multi-Talker Audio-Video Generation (2026)
Haitian Li et al.
0.00
Unified Synthesis of Compositional Speech and Sound from Free-Form Text Prompts (2026)
Yuyue Wang et al.
0.00
EigeNet: Geometry-Informed Multi-Modal Learning for Few-shot Novel View RIR Prediction (2026)
Chong Jing et al.
0.00
Deep CNN Framework for Audio Event Recognition using Weakly Labeled Web Data (2022)
Anurag Kumar et al.
—
Audio-Visual Speech Enhancement Using Multimodal Deep Convolutional Neural Networks (2022)
Jen-Cheng Hou et al.
—
Neural Style Transfer for Audio Spectograms (2024)
Prateek Verma et al.
—
A Lightweight Music Texture Transfer System (2021)
Xutan Peng et al.
—
Scene-Aware Audio Rendering via Deep Acoustic Analysis (2021)
Zhenyu Tang et al.
—
Time-Domain Audio Source Separation Based on Wave-U-Net Combined with Discrete Wavelet Transform (2022)
Tomohiko Nakamura and Hiroshi Saruwatari
—
ASMD: an automatic framework for compiling multimodal datasets with audio and scores (2021)
Federico Simonetta et al.
—
MDCNN-SID: Multi-scale Dilated Convolution Network for Singer Identification (2022)
Xulong Zhang et al.
—
End-to-End Lip Synchronisation Based on Pattern Classification (2021)
You Jin Kim et al.
—
Towards Multimodal MIR: Predicting individual differences from music-induced movement (2021)
Yudhik Agrawal et al.
—
MusiCoder: A Universal Music-Acoustic Encoder Based on Transformers (2021)
Yilun Zhao et al.
—
Music SketchNet: Controllable Music Generation via Factorized Representations of Pitch and Rhythm (2021)
Ke Chen et al.
—
Speech Driven Talking Face Generation from a Single Image and an Emotion Condition (2021)
Sefik Emre Eskimez et al.
—
Detection of AI-Synthesized Speech Using Cepstral & Bispectral Statistics (2021)
Arun Kumar Singh (1) et al.
—
Ensemble Chinese End-to-End Spoken Language Understanding for Abnormal Event Detection from audio stream (2021)
Haoran Wei et al.
—
GSEP: A robust vocal and accompaniment separation system using gated CBHG module and loudness normalization (2021)
Soochul Park and Ben Sangbae Chon
—
Melody Harmonization Using Orderless NADE, Chord Balancing, and Blocked Gibbs Sampling (2021)
Chung-En Sun et al.
—
MusicTM-Dataset for Joint Representation Learning among Sheet Music, Lyrics, and Musical Audio (2021)
Donghuo Zeng et al.
—
Multi-Classifier Interactive Learning for Ambiguous Speech Emotion Recognition (2023)
Ying Zhou et al.
—
Piano Skills Assessment (2021)
Paritosh Parmar et al.
—
Melon Playlist Dataset: a public dataset for audio-based playlist generation and music tagging (2021)
Andres Ferraro et al.
—
Neural Network architectures to classify emotions in Indian Classical Music (2021)
Uddalok Sarkar et al.
—
Downbeat Tracking with Tempo-Invariant Convolutional Neural Networks (2021)
Bruno Di Giorgi et al.
—
Low Bit-Rate Wideband Speech Coding: A Deep Generative Model based Approach (2021)
Gang Min et al.
—
Signal Representations for Synthesizing Audio Textures with Generative Adversarial Networks (2022)
Chitralekha Gupta et al.
—
Audio Transformers (2025)
Prateek Verma and Jonathan Berger
—
MuseMorphose: Full-Song and Fine-Grained Piano Music Style Transfer with One Transformer VAE (2022)
Shih-Lun Wu et al.
—
BERT-like Pre-training for Symbolic Piano Music Classification Tasks (2024)
Yi-Hui Chou et al.
—
Speech2AffectiveGestures: Synthesizing Co-Speech Gestures with Generative Adversarial Affective Expression Learning (2024)
Uttaran Bhattacharya and Elizabeth Childs and Nicholas Rewkowski and Dinesh Manocha
—
FakeAVCeleb: A Novel Audio-Video Multimodal Deepfake Dataset (2022)
Hasam Khalid and Shahroz Tariq and Minha Kim and Simon S. Woo
—
Multimodal analysis of the predictability of hand-gesture properties (2022)
Taras Kucherenko et al.
—
Attention is All You Need? Good Embeddings with Statistics are enough:Large Scale Audio Understanding without Transformers/ Convolutions/ BERTs/ Mixers/ Attention/ RNNs or .... (2022)
Prateek Verma
—
Multimodal Approach for Assessing Neuromotor Coordination in Schizophrenia Using Convolutional Neural Networks (2023)
Yashish M. Siriwardena et al.
—
Singer separation for karaoke content generation (2024)
Hsuan-Yu Lin et al.
—
SingGAN: Generative Adversarial Network For High-Fidelity Singing Voice Generation (2022)
Rongjie Huang et al.
—
Speech Pattern based Black-box Model Watermarking for Automatic Speech Recognition (2022)
Haozhe Chen et al.
—
Theme Transformer: Symbolic Music Generation with Theme-Conditioned Transformer (2022)
Yi-Jen Shih et al.
—
AVA-AVD: Audio-Visual Speaker Diarization in the Wild (2022)
Eric Zhongcong Xu et al.
—
Embedding-based Music Emotion Recognition Using Composite Loss (2023)
Naoki Takashima et al.
—
Zero-shot Audio Source Separation through Query-based Learning from Weakly-labeled Data (2022)
Ke Chen et al.
—
Attribute Inference Attack of Speech Emotion Recognition in Federated Learning Settings (2022)
Tiantian Feng and Hanieh Hashemi and Rajat Hebbar and Murali Annavaram and Shrikanth S. Narayanan
—