Towards Honest, Practicable and Efficient Private Learning

Mohapatra, Shubhankar

Towards Honest, Practicable and Efficient Private Learning

Files

Mohapatra_Shubhankar.pdf (30.11 MB)

Date

2025-11-24

Authors

Mohapatra, Shubhankar

Advisor

He, Xi

Publisher

University of Waterloo

Abstract

Protecting our personal information is a major challenge in today's data-driven world. When scientists and companies analyze large datasets, they need a way to ensure our individual privacy isn't compromised. This thesis focuses on Differential Privacy, a powerful, mathematical guarantee that places a strict, verifiable limit on how much personal information can be leaked, even if an attacker has the worst-case advantage. Researchers have developed various sophisticated algorithms to accomplish useful tasks, like building machine learning models or generating realistic synthetic data, while maintaining Differential Privacy. Crucially, these operations must be conducted within a predetermined, strict limit, often referred to as the "privacy budget." This budget mathematically quantifies the total acceptable loss of privacy for the entire process, enforcing a crucial trade-off between data utility and individual protection. All routine procedures of the machine learning pipeline, including data cleaning, hyperparameter tuning, and model training, must be performed within the budget. Several tools can perform these tasks in disjunction when the dataset is non-private. However, these tools do not translate easily to differential privacy and often do not consider the cumulative privacy costs. In this thesis, we explore various pragmatic problems that a data science practitioner may face when deploying a differentially private learning framework from data collection to model training. In particular, we are interested in real-world data quality problems, such as missing data, inconsistent data, and incorrectly labeled data, as well as machine learning pipeline requirements, including hyperparameter tuning. We envision building a general-purpose private learning framework that can handle real data as input and can be used in learning tasks such as generating a highly accurate private machine learning model or creating a synthetic version of the dataset with end-to-end differential privacy guarantees. We envision our work will make differentially private learning more accessible to data science practitioners and easily deployable in day-to-day applications.