The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences
Publications Copernicus
Download
Citation
Articles | Volume XLVIII-2/W3-2023
https://doi.org/10.5194/isprs-archives-XLVIII-2-W3-2023-89-2023
https://doi.org/10.5194/isprs-archives-XLVIII-2-W3-2023-89-2023
12 May 2023
 | 12 May 2023

IMPROVED AUTOMATIC LIP-READING BASED ON THE EVALUATION OF INTENSITY LEVEL OF SPEAKER’S EMOTION

D. Ivanko, E. Ryumina, and D. Ryumin

Keywords: Automatic lip-reading, Emotion recognition, Intelligent Video Analytics, Computer Vision, Human-Machine Interaction

Abstract. Automatic audio-visual speech recognition systems (AVSRs) have recently achieved tremendous success, especially in limited vocabulary tasks by far surpassing human abilities to recognize speech, especially in acoustically noisy conditions. Automatic speech recognition systems based on processing of audio and video information are being actively researched and developed all over the world. However, scientific studies aimed at analyzing the influence of the speaker's emotional state (anger, disgust, fear, happy, neutral, and sad), and, most importantly, intensity level of emotion (low - LO, medium - MD, high - HI) on automatic lip-reading have not been conducted. In this regard, the relevance of this research topic cannot be overestimated and requires detailed study. In this paper, we present a novel approach for emotional speech lip-reading, that includes evaluation of a speaker’s emotion and its intensity level. The proposed approach uses visual speech data to detect a person’s emotion type and its intensity level and based on this information assigns it to one of the trained emotional lip-reading models. This essentially resolves the multi-emotional lip-reading issue associated with most real-life scenarios. The proposed approach improves the state-of-the-art results due to the consideration of the intensity of the pronounced audio-visual speech up to 8.2% in terms of the accuracy. Current research is the first step in the creation of emotion-robust speech recognition systems and leaves open a wide field for further research.