PhD Thesis Final Defense to be held on October 3, 2018 at 13:00

Violos Thesis Image

The examination is open to anyone who wishes to attend (Central Library of NTUA, Room 0.2).

Thesis Title: Text Classification Using the N-Gram Graph Representation Model Over High Frequency Data Streams and applications in Social Media.


A prominent challenge in our information age is the classification over high frequency data streams. In this research, we propose an innovative and high-accurate text stream classification model that is designed in an elastic distributed way and is capable to service text load with fluctuated frequency. In this classification model, text is represented as N-Gram Graphs and the classification process takes place using text preprocessing, graph similarity and feature classification techniques following the supervised machine learning approach.

The work involves the analysis of many variations of the proposed model and its parameters, such as various representations of text as N-Gram Graphs, graph comparisons metrics and classification methods in order to conclude to the most accurate setup. To deal with the scalability, the availability and the timely response in case of high frequency text we employ the Beam programming model. Using the Beam programming model the classification process occurs as a sequence of distinct tasks and facilitates the distributed implementation of the most computational demanding tasks of the inference stage. The proposed model and the various parameters that constitute it are evaluated experimentally and the high frequency stream emulated using many datasets that are commonly used in the literature for text classification.

The model we propose extends to many research fields and it is worth mentioning each of them how they relate to our work. Text categorisation is a research topic that lies in the scientific fields of machine learning and natural language processing, high frequency data streams belongs to the field of big data. To service big data in an efficient and efficacy way we need computer infrastructures proposed by the scientific field of cloud computing. Finally, the text categorisation applications will be used to solve challenges in the discipline of social network analysis.

We discuss how natural language processing techniques are used to categorise, cluster and retrieve texts. The techniques will be presented in chronological order in order to show the evolution of researchers' approaches and how each technique proposed comes to solve problems or improve the previous ones. We present the properties that a categorisation or clustering must meet to be considered good as well as a set of metrics that quantify the accuracy of a categorisation according to these properties. We also present a method of conducting categorisation experiments applying these metrics. This method will be the evaluation method to be used in all the experimental sets that we will present in the following sections.

A method of text categorisation and a text clustering that use the N-Gram graph representation model are presented in two different sections. A number of social networking topics are presented and we propose that the text categorisation model which use the representation model of N-Gram graph provides efficient solutions. We evaluate our model experimentally and we see that many times it overcomes other state of the art methods. The social networking applications where the proposed model is applied are topics community detection, event detection, sentiment analysis, and recommendation systems.

Supervisor: Varvarigou Theodora, Professor

PhD student: Violos John