Guang Shen, Riwei Lai, Rui Chen, Yu Zhang, Kejia Zhang, Qilong Han and Hongtao Song. WISE: Word-Level Interaction-based Multimodal Fusion for Speech Emotion Recognition. In: Proceedings of the 21st Annual Conference of the International Speech Communication Association (InterSpeech), Shanghai, China, 2020.


Description

While having numerous real-world applications, speech emotion recognition is still a technically challenging problem. How to effectively leverage the inherent multiple modalities in speech data (e.g., audio and text) is key to accurate classification. Existing studies normally choose to fuse multimodal features at the utterance level and largely neglect the dynamic interplay of features from different modalities at a fine-granular level over time. In this paper, we explicitly model dynamic interactions between audio and text at the word level via interaction units between two long short-term memory networks representing audio and text. We also devise a hierarchical representation of audio information from the frame, phoneme and word levels, which largely improves the expressiveness of resulting audio features. We finally propose WISE, a novel word-level interaction-based multimodal fusion framework for speech emotion recognition, to accommodate the aforementioned components. We evaluate WISE on the public benchmark IEMOCAP corpus and demonstrate that it outperforms state-of-the-art methods.

[INFO]