PhD Thesis Final Defense to be held on December 19, 2019, at 15:00

Photo Credit: Antigoni Tsiami

The examination is open to anyone who wishes to attend (Room 1.1.1, Old ECE-NTUA Building).

Thesis Title: Audiovisual saliency modeling and multisensory auditory scene understanding

Abstract: The scope of this work is the investigation and development of a 2D computational audiovisual saliency model based on behavioral findings that will be able to produce accurate human fixation predictions in a 2D audiovisual scene, i.e. in videos. The investigation is carried out with two different ways: with signal processing techniques and with deep learning techniques. Regarding the former, several fusion schemes between visual and auditory saliency models are being investigated, and the resulting models are behaviorally validated through comparisons with results from behavioral experiments and evaluated with audiovisual human eye-tracking data and fMRI data. Results from both behavioral and eye-tracking experiments indicate that audiovisual saliency modeling indeed improves saliency estimation performance.
Regarding deep learning techniques, a new spatio-temporal audiovisual saliency network is developed, that includes a visual saliency network, an audio representation network, a sound localization module, and an audiovisual saliency fusion module. All modules are integrated under a single network that is trained end-to-end. The network performance is evaluated in several eye-tracking databases and results from comparisons with other methods highlight the effectiveness of the presented network, that opens the way for estimating saliency ``in-the-wild".

In parallel, research has been carried out in the direction of auditory scene understanding. Specifically, a speaker localization system has been developed, as well as a baseline distant speech recognition system in Greek and English and a speech understanding/dialog system. These systems have been adapted and applied either to a smart home environment and/or to a multi-sensory human/child - robot interaction application. They are also evaluated through experiments in appropriate databases.

Finally, except for the development and evaluation of new algorithmic methods to successfully address the above problems, an important contribution of this thesis lies in the collection of two new databases: An audiovisual human eye-tracking database employing 20 subjects and 37 videos has been collected, as well as a multi-channel speech database in Greek with data from 20 speakers.

Supervisor: Petros Maragos, Professor

PhD student: Antigoni Tsiami