PhD thesis defense to be held on June 14, 2023, at 12:00 (Multimedia Amphitheater, NTUA Central Library)

Picture Credit: Alkis Kalavasis

Thesis title: Algorithm Design for Reliable Machine Learning

Abstract: In this thesis, we theoretically study questions in the area of Reliable Machine Learning in order to design algorithms that are robust to bias and noise (Robust Machine Learning) and satisfy societal desiderata such as privacy and reproducibility (Responsible Machine Learning).

In the area of Robust Machine Learning, we design computationally efficient algorithms for problems in the fields of Truncated Statistics, Censored Statistics and Robust Statistics. In particular, we provide the first known efficient methods for truncated distribution learning in discrete settings and perfect data sampling from truncated data. Next, we study the fundamental problem of learning from partial/coarse labels. Our main algorithmic result is that essentially any problem learnable from fine grained labels can also be learned efficiently when the coarse data are sufficiently informative. We obtain our result through a generic reduction for answering Statistical Queries (SQ) over fine grained labels given only coarse labels. We also study the central problem in Censored Statistics of Gaussian mean estimation from coarse data. Finally, we consider the problem of learning linear sorting functions in the presence of bounded noise, a problem that generalizes the problem of learning halfspaces with Massart noise.

In the area of Responsible Machine Learning, we study the notion of replicability as an algorithmic property and introduce the notion of replicable policies in the context of stochastic bandits, one of the canonical problems in interactive learning. We show that not only do replicable policies exist, but also they achieve almost the same optimal (non-replicable) regret bounds in terms of the time horizon. Lastly, we establish information-theoretic equivalences between notions of algorithmic stability such as replicability and approximate differential privacy. We do so by focusing on the following question: When two different parties use the same learning rule on their own data, how can we test whether the distributions of the two outcomes are similar? We study the similarity of outcomes of learning rules through the lens of the Total Variation (TV) distance of distributions. We say that a learning rule is TV indistinguishable if the expected TV distance between the posterior distributions of its outputs, executed on two training data sets drawn independently from the same distribution, is small. We first investigate the learnability of hypothesis classes using TV indistinguishable learners. Our main results are information-theoretic equivalences between TV indistinguishability and existing algorithmic stability notions such as replicability and approximate differential privacy.

Supervisor: Professor Dimitris Fotakis

PhD Student: Alkis Kalavasis