Trustworthy Machine Learning with Deep Generative Models

Jiang, Dihong

Trustworthy Machine Learning with Deep Generative Models

Files

Jiang_Dihong.pdf (21.34 MB)

Date

2024-09-23

Authors

Jiang, Dihong

Advisor

Yu, Yaoliang
Sun, Sun

Publisher

University of Waterloo

Abstract

The recent decade has witnessed the remarkable progress of machine learning (ML) technologies across diverse domains, especially the deep generative models (DGMs) developed on high-dimensional data. However, as ML systems become integral to critical and sensitive applications, ensuring their trustworthiness becomes paramount beyond mere performance. Trustworthy machine learning encompasses a suite of principles and practices aimed at enhancing the reliability and safety of ML systems. This thesis investigates the intersection of trustworthiness and deep generative models, focusing on two pivotal aspects of trustworthiness: out-of-distribution (OOD) detection and privacy preservation. Generative models serve purposes beyond generating realistic samples. Likelihood-based DGMs, e.g. flow generative models, can additionally compute the likelihood of input data, which can be used as an unsupervised OOD detector by thresholding the likelihood values. However, they have been found to occasionally assign higher likelihoods to OOD data compared to in-distribution (InD) data, raising concerns about their reliability and credibility in OOD detection. This thesis presents new insights demonstrating that flow generative models can reliably detect OOD data by leveraging their bijectivity property. The proposed approach involves comparing two high-dimensional distributions in the latent space. To manage the high dimensionality, we extend a univariate statistical test (e.g., Kolmogorov-Smirnov (KS) test) into higher dimensions using random projections. Our method indicates robustness across various datasets and data modalities. The second focus of this thesis is on privacy preservation in DGMs. Generative models can also be seen as proxies for publishing training data, and there is growing interest in ensuring privacy preservation beyond generation fidelity in modern generative models. Differentially private generative models (DPGMs) offer a solution to protect individual data privacy in the training set of generative models. Existing methods either apply the workhorse algorithm in differentially private deep learning, e.g. differentially private stochastic gradient descent (DP-SGD), to DGMs, or use kernel methods by making the maximum mean discrepancy (MMD) objective differentially private. However, DP-SGD methods suffer from high training costs and poor scalability to stronger differential privacy (DP) guarantees, while kernel methods face the issue of mode collapse in generation. We propose novel methods to sidestep the above challenges for both methods, respectively. To alleviate the training cost overhead and scalability issues on small privacy budgets for DP-SGD, we propose to train a flow generative model in a lower-dimensional latent space, which significantly reduces the model size and thereby avoids unnecessary computation in the full pixel space. This design is more resilient to larger noise perturbations, and enables us to be the first related work to present high-resolution image generations with DP constraints. On the other hand, to improve the model utility of MMD methods, we propose to make the MMD objective differentially private without truncating the reproducing kernel Hilbert Space (RKHS). To do so, we extend Rényi differential privacy (RDP) from vectors to functions, and then propose many useful building blocks to facilitate its use in practice, e.g. Gaussian mechanism, composition theorems, and post-processing theorem. The proposed method consistently outperforms state-of-the-art methods by a large margin, especially under stronger DP guarantees. The thesis is expected to provide new insights into the application of flow generative models in OOD detection, highlight practical challenges in training generative models with DP-SGD on high-dimensional datasets, bridge the gap between RDP and functional mechanisms, and expand the family of DPGM.