PhD thesis defense to be held on December 12, 2022, at 12:00 (Virtually in MS Teams)

Picture Credit: Panagiotis Kouris

Thesis title: Automatic Text Summarization: Machine Learning and Semantic Techniques

Abstract: The constantly growing amount of textual information has led to the development of automatic text summarization, which constitutes an important research area in natural language processing. The current research that is conducted in this field is mainly focused on developing machine learning approaches, without, in most cases, considering the combination of machine learning models with other techniques based on natural language processing, which could contribute to further improvement in this field. In view of this research gap, the present dissertation focuses on the field of abstractive text summarization of single documents, examining deep learning architectures and presenting new methodologies that combine machine learning and semantic-based techniques in order to improve automatic text summarization.

The contribution of the dissertation includes; (i) the investigation of deep learning architectures for automatic text summarization, (ii) a novel methodology that is based on semantic content transformations and machine learning to address the problem of managing new content, without sufficient presence in the training set of a machine learning model, (iii) a new framework that combines methodology of semantic content representation and deep learning, towards the production of summaries with semantic content relevance and (iv) a set of metrics that provides a qualitative assessment of the estimated summaries.

The first part covers the investigation of a range of deep learning architectures for estimating a sequence of words that composes the summary of an original text. These architectures include encoder-decoder recurrent neural networks, reinforcement learning, transformer-based architectures, and pre-trained neural language models.

The second part presents a novel framework that is based on semantic content transformations along with machine learning predictions. The proposed framework is capable of dealing with the problem of out-of-vocabulary or rare words, improving the performance of the deep learning models. The framework is composed of three components; a pre-processing task, a machine learning methodology and a post-processing task. The pre-processing task is based on a well-defined theoretical model of semantic-based content generalization, which utilizes ontological knowledge resources, word-sense-disambiguation and named-entity recognition to transform ordinary text into a generalized form. A range of deep learning models is trained on a generalized version of text-summary pairs, learning to predict summaries in a generalized form. The post-processing task utilizes knowledge resources, word embeddings, word-sense disambiguation and heuristic algorithms based on text similarity methods to transform the generalized version of a predicted summary into a final, human-readable form.

The third part includes a novel approach that combines semantic graph representations along with deep learning predictions to generate abstractive summaries of single documents, in an effort to utilize a semantic representation of the unstructured textual content in a machine-readable, structured, and concise manner. The main contribution of this approach includes; the graph-to-summary formulation of the problem of abstractive text summarization using deep learning techniques, the examination of a range of deep learning models, and the investigation of semantic graph-based representation schemes. The overall framework is based on a well-defined methodology for performing semantic graph parsing, graph construction, graph transformations for machine learning models, and deep learning predictions. This approach organizes unstructured textual information through a semantic representation of the content, in an effort to improve machine learning predictions and provide semantically relevant summaries.

Another important contribution is an introduction of a measure for assessing the factual consistency of the generated summaries in an effort to provide a qualitative evaluation. These measures provide a weighted evaluation value, according to the length of an original text and the system summary, by determining the semantic overlap between the information contained in the generated summary and the original text. The new set of metrics can contribute to the evaluation and improvement of automatic text summarization systems.

The approaches presented were theoretically defined, implemented and experimentally investigated. Considering the research conducted, the novel methodology, the positive results and the useful conclusions may contribute to further improvement of intelligent systems in the field of automatic text summarization.

Supervisor: Professor Giorgos Stamou

PhD Student: Panagiotis Kouris