Development of an Automatic System for Disinformation Detection in the Health Domain using Natural Language Processing Techniques

Henar Iglesias López. (2025). Development of an Automatic System for Disinformation Detection in the Health Domain using Natural Language Processing Techniques. Trabajo Fin de Titulación (TFG). Universidad Politécnica de Madrid, ETSI Telecomunicación.

Abstract:
This bachelor’s thesis deals with the creation of an automated instrument to identify medical misinformation in audiovisual content disseminated on TikTok. In recent years, social networks have acquired a crucial role in the dissemination of health information, leading to an expansion of content that is inaccurate or potentially harmful to public health. The problem of the lack of medical information, particularly on visual and powerful platforms such as TikTok, represents an urgent challenge for preventive medicine, digital education and trust in health institutions. The main purpose of the project has been to develop a modular platform that merges natural language processing (NLP) tools, biomedical language models such as BioBERT, and lexical analysis and visualization methods, in order to automatically identify if a video presents health-related misinformation. The procedure carried out ranges from the down- load of videos to the automatic transcription of the audio, the translation and normalization of the text, its subsequent automatic classification and the interpretative analysis of the results. In the development process, methods such as statistical term study, word cloud creation, readability measures, and interactive visualizations such as Scattertext were used. All this facilitated the identification of lexical and stylistic discrepancies between proven and false content, including the application of emotional, confusing or pseudoscientific language in the disinformative videos. Additionally, a model based on BioBERT was developed to carry out binary categorization of texts, achieving optimal performance despite the restriction of data labeled in Spanish. The final system produces reports that can be interpreted, exported and ready to be examined by specialists. The findings reveal the possibility of employing Artificial Intelligence and language models specific to the medical field to fight misinformation on rising social networks. They also lay the foundation for future more robust applications that incorporate visual components, multiple languages, and human validation.