Publication - Development of an Event Detector in Twitter Streams based on Mention-Anomaly Detection for the City of Madrid

Development of an Event Detector in Twitter Streams based on Mention-Anomaly Detection for the City of Madrid

Luis Cristóbal López García. (2019). Development of an Event Detector in Twitter Streams based on Mention-Anomaly Detection for the City of Madrid. Final Career Project (TFG). Universidad Politécnica de Madrid, ETSI Telecomunicación.

Abstract:

Event detection has been a field of research long before social networks reached the high impact they have nowadays. Events were tracked from traditional news web sites, blogs or other information channels. However when microblogging as a form of social media emerged all this landscape changed. In this project we have developed a system capable of detecting the most important events occurred in a city by analyzing data published on social networks. For this, we have adapted and improved an already existing clustering approach named MABED, which relies on the number of interactions between users to measure the impact. Our main contributions to this model has been to improve that impact algorithm accuracy and to provide a new definition of redundancy leading to a better performance on duplicated events. The social network our detector reads is Twitter, considered a valuable source of what is known as Social Data. Information is provided by short length documents posted by users, called tweets. These publications are collected from our Streamer, gathering posts that have just been published in the city of Madrid. In addition to the cluster we have also developed an architecture that turns our project into a system. Streamer is in charge of collecting the data that we feed to our detector. However it first needs to pass through a preprocessing module which filters spam out and lemmatizes the text in order to achieve a better performance. Once the detection task is finished results are saved in a persistence subsystem. These results are finally visualized in a dashboard which interacts with the user and facilitates the cognitive process of the performed analysis. All this data flow is supervised by an orchestrator which assures the correct interaction between modules. The process we have just explained is repeated periodically every half an hour showing top three events with the higher impact that took place in the city of Madrid in the last 24 hours. Key

JRESEARCH_BIBTEX:

@mastersthesis{development-gsi-mastersthesis-20187,
author = "L{\'o}pez Garc{\'i}a, Luis Crist{\'o}bal",
abstract = "Event detection has been a field of research long before social networks reached the high
impact they have nowadays. Events were tracked from traditional news web sites, blogs or
other information channels. However when microblogging as a form of social media emerged
all this landscape changed.
In this project we have developed a system capable of detecting the most important
events occurred in a city by analyzing data published on social networks. For this, we have
adapted and improved an already existing clustering approach named MABED, which relies
on the number of interactions between users to measure the impact. Our main contributions
to this model has been to improve that impact algorithm accuracy and to provide a new
definition of redundancy leading to a better performance on duplicated events.
The social network our detector reads is Twitter, considered a valuable source of what is
known as Social Data. Information is provided by short length documents posted by users,
called tweets. These publications are collected from our Streamer, gathering posts that have
just been published in the city of Madrid.
In addition to the cluster we have also developed an architecture that turns our project
into a system. Streamer is in charge of collecting the data that we feed to our detector.
However it first needs to pass through a preprocessing module which filters spam out and
lemmatizes the text in order to achieve a better performance. Once the detection task is
finished results are saved in a persistence subsystem. These results are finally visualized
in a dashboard which interacts with the user and facilitates the cognitive process of the
performed analysis. All this data flow is supervised by an orchestrator which assures the
correct interaction between modules.
The process we have just explained is repeated periodically every half an hour showing
top three events with the higher impact that took place in the city of Madrid in the last 24
hours.
Key",
address = "ETSI Telecomunicaci{\'o}n",
institution = "Universidad Polit{\'e}cnica de Madrid",
keywords = "event detection;twitter;machine learning;natural language processing",
month = "June",
title = "{D}evelopment of an {E}vent {D}etector in {T}witter {S}treams based on {M}ention-{A}nomaly {D}etection for the {C}ity of {M}adrid",
type = "TFG",
year = "2019",
}