Design and Development of a Classification and Detection System of Malicious Emails with NLP

Guillermo Canete Riaza. (2021). Design and Development of a Classification and Detection System of Malicious Emails with NLP. Final Career Project. Universidad Politécnica de Madrid.

Abstract:
In the last years Machine Learning has grown enormously, being Natural Language Processing or NLP one of its more important fields. This is the part of machine learning that enables computers to understand human language. Therefore, it can be especially useful for things like chatbots or sentiment analysis in social networks. The goal of the project is to make a program which can classify emails and analyze if they are spam or have a malicious intent. This will be achieved by using NLP, examining the source, destination, subject and main contents of the email. Procedure: First, the source and destination will be compared with a public domain database that contains untrustful mails. Afterwards, the subject and main contents will be analyzed using NLP. The process will be the following: 1. Data obtention and labeling 2. Text preprocessing: Removing punctuation, Stemming and lemmatizing, Removing stop words, Tokenizing 3. Vectorizing 4. Clustering Algorithms 5. Classifying algorithms Technologies The project will be programmed in Python, using PyCharm and Anaconda to manage all the libraries. Some of these are for instance: NLTK (Natural Language Toolkit) and Pandas. References: Natural Language Processing (NLP) for Machine Learning, Badreesh Shetty. Your Guide to Natural Language Processing (NLP), Diego López Yse. Topic Modelling in Python with NLTK and Gensim, Susan Li. Automated Keyword Extraction from Articles using NLP, SowmyaVivek.Machine Learning, NLP. Text classification using scikit-learn, JavedShaikh.