Publication - Design and Development of a Machine Learning Classifier of ISIS Terrorist Texts

Design and Development of a Machine Learning Classifier of ISIS Terrorist Texts

Álvarez de Sotomayor Vergara, M. (2018). Design and Development of a Machine Learning Classifier of ISIS Terrorist Texts. Final Career Project (TFG). Universidad Politécnica de Madrid, ETSIT, Madrid.

Abstract:

Nowadays, terrorism is one of society’s main concerns. One of the main difficulties en- countered by organisms that try to prevent the formation of these cells is the difficulty of controlling such an extensive and open means of communication as Internet. Additionally, interaction of people on social networks is increasing exponentially in recent years, causing an uncontrollable amount of data that only can be processed automatically. Therefore, the main objective of this project is the integration of technology as an enabling tool to detect and classify the texts cited by ISIS in its publications on social networks. This will provide an important value in speeding up the detection of group thought currents, future forms of action or possible recruiting techniques used. In order to achieve the aforementioned objective we have developed a Machine Learning approach which was trained using a compilation of 2,685 religious texts cited by ISIS over a 3 year period.This dataset is a collection of all the religious and ideological texts used in ISIS English-based magazines. In addition, the model has been tested on a second similar prob- lem: ISIS Twitter posts with radical and non-radical content. To implement this classifier a software system that uses natural language processing techniques (NLP) has been devel- oped, written in Python programming language. Regarding the extraction of features we have studied those that provided relevant information to the model considering our typology of input data. These features are as follows: PoS, LDA, word embeddings, embedding-based similarity, and domain-specific word selection. In order to properly evaluate the proposed methods, an extensive evaluation has been made, based in cross-validation. As a conclusion, the scorings obtained reached 81% and 92% for the different problems analyzed during the development of the project. This is because these cases presented different complexities, the first of them had a small dataset (2250, 2) and five possible types of classification. This is why complex algorithms caused the appearance of Overfitting, being the linear classifiers the ones that best adapted to these input data. However, the second problem presented a larger database (34708, 2) and a binary classification, enabling better operation of highly complex algorithms.

JRESEARCH_BIBTEX:

@mastersthesis{design-gsi-maria-masterthesis-2018,
author = "{\'A}lvarez de Sotomayor Vergara, Mar{\'i}a",
abstract = "Nowadays,  terrorism  is  one  of  society’s  main  concerns.   One  of  the  main  difficulties  en-
countered by organisms that try to prevent the formation of these cells is the difficulty of
controlling such an extensive and open means of communication as Internet.  Additionally,
interaction of people on social networks is increasing exponentially in recent years, causing
an uncontrollable amount of data that only can be processed automatically.  Therefore, the
main objective of this project is the integration of technology as an enabling tool to detect
and classify the texts cited by ISIS in its publications on social networks.  This will provide
an important value in speeding up the detection of group thought currents, future forms of
action or possible recruiting techniques used.
In order to achieve the aforementioned objective we have developed a Machine Learning
approach which was trained using a compilation of 2,685 religious texts cited by ISIS over a
3 year period.This dataset is a collection of all the religious and ideological texts used in ISIS
English-based magazines.  In addition, the model has been tested on a second similar prob-
lem:  ISIS Twitter posts with radical and non-radical content.  To implement this classifier
a software system that uses natural language processing techniques (NLP) has been devel-
oped, written in Python programming language.  Regarding the extraction of features we
have studied those that provided relevant information to the model considering our typology
of input data.  These features are as follows: PoS, LDA, word embeddings, embedding-based
similarity, and domain-specific word selection.  In order to properly evaluate the proposed
methods, an extensive evaluation has been made, based in cross-validation.
As a conclusion, the scorings obtained reached 81% and 92% for the different problems
analyzed  during  the  development  of  the  project.   This  is  because  these  cases  presented
different complexities, the first of them had a small dataset (2250, 2) and five possible types
of  classification.   This  is  why  complex  algorithms  caused  the  appearance  of  Overfitting,
being the linear classifiers the ones that best adapted to these input data.  However,  the
second problem presented a larger database (34708, 2) and a binary classification, enabling
better operation of highly complex algorithms.",
address = "ETSIT, Madrid",
institution = "Universidad Polit{\'e}cnica de Madrid",
keywords = "isis;scikit-learn;machine-learning",
month = "July",
title = "{D}esign and {D}evelopment of a {M}achine {L}earning {C}lassifier of {ISIS} {T}errorist {T}exts",
type = "TFG",
year = "2018",
}