Getting The Subtext Without The Text: Scalable Multimodal Sentiment Classification From Visual And Acoustic Modalities
2018 Β· Nathaniel Blanchard, Daniel Moreira, Aparna Bharati, et al.
Abstract
In the last decade, video blogs (vlogs) have become an extremely popular method through which people express sentiment. The ubiquitousness of these videos has increased the importance of multimodal fusion models, which incorporate video and audio features with traditional text features for automatic sentiment detection. Multimodal fusion offers a unique opportunity to build models that learn from the full depth of expression available to human viewers. In the detection of sentiment in these videos, acoustic and video features provide clarity to otherwise ambiguous transcripts. In this paper, we present a multimodal fusion model that exclusively uses high-level video and audio features to analyze spoken sentences for sentiment. We discard traditional transcription features in order to minimize human intervention and to maximize the deployability of our model on at-scale real-world data. We select high-level features for our model that have been successful in nonaffect domains in order t
Authors
(none)
Tags
Stats
Related papers
- Scalevlad: Improving Multimodal Sentiment Analysis Via Multi-scale Fusion Of Locally Descriptors (2021)0.00
- Video-based Cross-modal Auxiliary Network For Multimodal Sentiment Analysis (2022)11.76
- Audio-guided Fusion Techniques For Multimodal Emotion Analysis (2024)4.52
- Semantic Matters: Multimodal Features For Affective Analysis (2025)0.00
- Effectively Obtaining Acoustic, Visual And Textual Data From Videos (2025)0.00
- Enhancing Multimodal Sentiment Analysis For Missing Modality Through Self-distillation And Unified Modality Cross-attention (2024)6.71
- Multi-modal Emotion Recognition By Text, Speech And Video Using Pretrained Transformers (2024)0.00
- Multimodal Emotion Recognition And Sentiment Analysis In Multi-party Conversation Contexts (2025)0.00