Publication - Design and Development of a Natural Language Understanding System for Emotional Intelligence in Voice

Design and Development of a Natural Language Understanding System for Emotional Intelligence in Voice

Beatriz Castellanos Rodrigo. (2025). Design and Development of a Natural Language Understanding System for Emotional Intelligence in Voice. Final Career Project (TFG). Universidad Politécnica de Madrid, ETSI Telecomunicación.

Abstract:

In recent years, AI systems are becoming increasingly relevant in our society, enhancing the human-machine relationship. A key aspect of these advancements is the ability of these systems to interpret, manipulate, and understand human language. In this context, emotion and sentiment analysis systems are becoming more and more important: they enable machines to go beyond understanding the textual content of a message, extracting the speaker’s intentions and emotional situation. The aim of this project is to design and develop a system based on emotion classification in voice recordings based on natural language understanding and machine learning techniques. To accomplish this, state-of-the-art deep learning models have been leveraged, focusing on audio-based emotion classification using transformer-based architectures. Wav2Vec2-based models were selected due to their high-quality performance in speech-related tasks. The development process is developed in several stages, starting with the selection and pre-processing of audio datasets. Two datasets, one in Spanish and one in English, were used to train and compare the model’s performance across both languages. A comparative study of Wav2Vec pre-trained models is then carried out for selecting the best for feature extraction. The selected model is fine-tuned for emotion classification task and, the training setup included Trainer and Optuna implementation by Hugging Face and some strategies such as weight decay are applied to maximize performance. Additionally, Whisper performs speech-to-text transcription and translation, and PySentimiento was applied to analyze sentiment characteristics using Whisper references. Evaluation results proposed approach show that fine-tuning and optimized training significantly improved performance in the specific classification task, thanks to the hyperparameter configuration.e

JRESEARCH_BIBTEX:

@mastersthesis{audio2025castellanos,
author = "Rodrigo, Beatriz Castellanos",
abstract = "In recent years, AI systems are becoming increasingly relevant in our society, enhancing the human-machine relationship. A key aspect of these advancements is the ability of these systems to interpret, manipulate, and understand human language. In this context, emotion and sentiment analysis systems are becoming more and more important: they enable machines to go beyond understanding the textual content of a message, extracting the speaker’s intentions and emotional situation.
The aim of this project is to design and develop a system based on emotion classification in voice recordings based on natural language understanding and machine learning techniques. To accomplish this, state-of-the-art deep learning models have been leveraged, focusing on audio-based emotion classification using transformer-based architectures. Wav2Vec2-based models were selected due to their high-quality performance in speech-related tasks. The development process is developed in several stages, starting with the selection and pre-processing of audio datasets. Two datasets, one in Spanish and one in English, were used to train and compare the model’s performance across both languages. A comparative study of Wav2Vec pre-trained models is then carried out for selecting the best for feature extraction.
The selected model is fine-tuned for emotion classification task and, the training setup included Trainer and Optuna implementation by Hugging Face and some strategies such as weight decay are applied to maximize performance. Additionally, Whisper performs speech-to-text transcription and translation, and PySentimiento was applied to analyze sentiment characteristics using Whisper references.
Evaluation results proposed approach show that fine-tuning and optimized training significantly improved performance in the specific classification task, thanks to the hyperparameter configuration.e
",
address = "ETSI Telecomunicaci{\'o}n",
institution = "Universidad Polit{\'e}cnica de Madrid",
keywords = "transformers;emotion analysis;audio processing",
month = "July",
title = "{D}esign and {D}evelopment of a {N}atural {L}anguage {U}nderstanding {S}ystem for {E}motional {I}ntelligence in {V}oice",
type = "TFG",
year = "2025",
}