Abstract:
Nowadays, terrorism is one of society’s main concerns. One of the main difficulties en-
countered by organisms that try to prevent the formation of these cells is the difficulty of
controlling such an extensive and open means of communication as Internet. Additionally,
interaction of people on social networks is increasing exponentially in recent years, causing
an uncontrollable amount of data that only can be processed automatically. Therefore, the
main objective of this project is the integration of technology as an enabling tool to detect
and classify the texts cited by ISIS in its publications on social networks. This will provide
an important value in speeding up the detection of group thought currents, future forms of
action or possible recruiting techniques used.
In order to achieve the aforementioned objective we have developed a Machine Learning
approach which was trained using a compilation of 2,685 religious texts cited by ISIS over a
3 year period.This dataset is a collection of all the religious and ideological texts used in ISIS
English-based magazines. In addition, the model has been tested on a second similar prob-
lem: ISIS Twitter posts with radical and non-radical content. To implement this classifier
a software system that uses natural language processing techniques (NLP) has been devel-
oped, written in Python programming language. Regarding the extraction of features we
have studied those that provided relevant information to the model considering our typology
of input data. These features are as follows: PoS, LDA, word embeddings, embedding-based
similarity, and domain-specific word selection. In order to properly evaluate the proposed
methods, an extensive evaluation has been made, based in cross-validation.
As a conclusion, the scorings obtained reached 81% and 92% for the different problems
analyzed during the development of the project. This is because these cases presented
different complexities, the first of them had a small dataset (2250, 2) and five possible types
of classification. This is why complex algorithms caused the appearance of Overfitting,
being the linear classifiers the ones that best adapted to these input data. However, the
second problem presented a larger database (34708, 2) and a binary classification, enabling
better operation of highly complex algorithms.