Development of a Fake News Detection System using Ma- chine Learning and Natural Language Processing Techniques

Beatriz Hernandez-Fonta Codesido. (2024). Development of a Fake News Detection System using Ma- chine Learning and Natural Language Processing Techniques. Final Career Project (TFG). Universidad Politécnica de Madrid.

Abstract:
Over the last few years with the increasing technological innovations the number of sources of information have significantly increased. In today’s society, where we are all widely connected, news spread quickly and anybody can publish a news article. Such an ease has facilitated the growth of the so-called “fake news”. We are at a point where confronted with a piece of news, we have doubts regarding its authenticity. Publishing fake news leads to misinformation, bad decision making and it could be really harmful, especially when it is related to certain topics such as health, politics, religion or even bad press about a certain company or person. That is also known as disinformation and large-scale campaigns about it have become a major challenge for Europe, as a result the commission has developed numerous initiatives to tackle it. The main objective of this project is to explore the use of learning models to detect “fake news”. With the development of these models “fake news” will be identified and stopped from spreading, leading to an improvement on the digital environment making it a more reliable source of information and a safer place to research and acquire knowledge . To make this possible, different sources of data with news already classified as fake or not will be collected. The gathered data was firstly preprocessed using natural language processing techniques such as removing stop words and punctuation, lemmatizing and tokenizing. To continue, different data representations were presented, and at last, using different machine learning techniques various models were created and trained in order to get the best possible accuracy. In order to carry out the procedures mentioned above, Python has been used with several of its libraries but mainly Scikit-Learn. Scikit-Learn is a library which provides multiple Machine Learning algorithms, it’s built upon SciPy and particularly uses NumPy for making the arrays and pandas for the data analysis. A part from that we will make use of Hugging Face to enable implementing the Transformers model BERT.