Instituto de Investigación
en Matemáticas

Seminario de Doctorado
Seminario de Doctorado

Fair Learning: an optimal transport based approach

Paula Gordaliza (Universidad de Valladolid y Université Paul Sabatier de Toulouse)

Fecha: 04/09/2020 12:50
Lugar: Presentación virtual webex

The aim of this thesis is two-fold. On the one hand, optimal transportation methods are studied for statistical inference purposes. On the other hand, the recent problem of fair learning is addressed through the prism of optimal transport theory. The generalization of applications based on machine learning models in the everyday life and the professional world has been accompanied by concerns about the ethical issues that may arise from the adoption of these technologies. In the first part of the thesis, we motivate the fairness problem by presenting some comprehensive results from the study of the statistical parity criterion through the analysis of the disparate impact index on the real and well-known Adult Income dataset. Importantly, we show that trying to make fair machine learning models may be a particularly challenging task, especially when the training observations contain bias. Then a review of Mathematics for fairness in machine learning is given in a general setting, with some novel contributions in the analysis of the price for fairness in regression and classification. In the latter, we finish this first part by recasting the links between fairness and predictability in terms of probability metrics. We analyze repair methods based on mapping conditional distributions to the Wasserstein barycenter. Finally, we propose a random repair which yields a tradeoff between minimal information loss and a certain amount of fairness. The second part is devoted to the asymptotic theory of the empirical transportation cost. We provide a Central Limit Theorem for the Monge-Kantorovich distance between two empirical distributions with different sizes $n$ and $m$, $W_p(P_n, Q_m) , p \geq 1,$ for observations on $\mathbb{R}$. In the case $p>1$ our assumptions are sharp in terms of moments and smoothness. We prove results dealing with the choice of centering constants. We provide a consistent estimate of the asymptotic variance which enables to build two sample tests and confidence intervals to certify the similarity between two distributions. These are then used to assess a new criterion of data set fairness in classification. Additionally, we provide a moderate deviation principle for the empirical transportation cost in general dimension. Finally, Wasserstein barycenters and variance-like criterion using Wasserstein distance are used in many problems to analyze the homogeneity of collections of distributions and structural relationships between the observations. We propose the estimation of the quantiles of the empirical process of the Wasserstein’s variation using a bootstrap procedure. Then we use these results for statistical inference on a distribution registration model for general deformation functions. The tests are based on the variance of the distributions with respect to their Wasserstein’s barycenters for which we prove central limit theorems, including bootstrap versions.