Efficiently Training Deep Learning Models on Elastic and Heterogeneous Cloud Resources

Guo, Runsheng

Efficiently Training Deep Learning Models on Elastic and Heterogeneous Cloud Resources

dc.contributor.author	Guo, Runsheng
dc.date.accessioned	2026-05-12T17:58:44Z
dc.date.available	2026-05-12T17:58:44Z
dc.date.issued	2026-05-12
dc.date.submitted	2026-05-08
dc.description.abstract	Deep Neural Networks (DNNs) have demonstrated remarkable success across diverse domains, but their training requires substantial computational resources and is typically parallelized across large GPU clusters. However, such clusters are prohibitively expensive for most organizations to own and manage. Hence, instead of owning and managing their own clusters, organizations often rent clusters on cloud platforms to meet their training needs. While cloud environments offer elastic scalability and heterogeneous hardware options, they also introduce significant challenges for efficient distributed DNN training. Specifically, existing training frameworks lack support for dynamic reconfiguration during training, limiting the exploitation of cloud elasticity. Additionally, most systems assume homogeneous clusters, which rarely reflect the heterogeneous GPU clusters that organizations commonly use due to hardware availability constraints. Furthermore, heterogeneous network conditions in cloud environments create communication bottlenecks that limit the scalability of existing approaches. This thesis presents three systems that collectively address these limitations to enable efficient distributed DNN training on elastic and heterogeneous cloud resources. First, Hydrozoa leverages cloud elasticity through serverless containers, enabling dynamic scaling and configuration changes during training without the traditional pitfalls of serverless computing. By combining data and model parallelism with fine-grained resource provisioning, Hydrozoa achieves cost-effective training while eliminating cluster management overhead. Second, Cephalo addresses heterogeneous GPU clusters by independently balancing compute and memory resources across GPUs with different capabilities. Unlike existing approaches that tie workload assignment to computational speed, Cephalo separately optimizes compute distribution through proportional batch sizing and memory utilization through intelligent partitioning of training state, activation checkpointing, and gradient accumulation strategies. Third, Zorse tackles heterogeneous network conditions, which are particularly common in heterogeneous clusters, by efficiently combining memory-efficient data parallelism with pipeline parallelism. Through interleaved pipelining, parameter and activation offloading, and heterogeneous pipeline parallelism configurations, Zorse achieves both communication and memory efficiency for training large DNN models across diverse network topologies. The experimental evaluation demonstrates that these systems significantly improve training efficiency and resource utilization compared to existing approaches. Hydrozoa reduces training costs while providing seamless scalability, Cephalo simultaneously achieves high compute and memory utilization in heterogeneous clusters, and Zorse maintains high throughput under varying network conditions. Together, these contributions make distributed DNN training more accessible, cost-effective, and efficient in modern cloud environments, advancing the state of the art in large-scale machine learning infrastructure.
dc.identifier.uri	https://hdl.handle.net/10012/23293
dc.language.iso	en
dc.pending	false
dc.publisher	University of Waterloo	en
dc.subject	distributed DNN training
dc.subject	large language models
dc.subject	cloud computing
dc.subject	heterogeneous computing
dc.subject	deep learning
dc.subject	deep neural networks
dc.title	Efficiently Training Deep Learning Models on Elastic and Heterogeneous Cloud Resources
dc.type	Doctoral Thesis
uws-etd.degree	Doctor of Philosophy
uws-etd.degree.department	David R. Cheriton School of Computer Science
uws-etd.degree.discipline	Computer Science
uws-etd.degree.grantor	University of Waterloo	en
uws-etd.embargo.terms	0
uws.contributor.advisor	Daudjee, Khuzaima
uws.contributor.affiliation1	Faculty of Mathematics
uws.peerReviewStatus	Unreviewed	en
uws.published.city	Waterloo	en
uws.published.country	Canada	en
uws.published.province	Ontario	en
uws.scholarLevel	Graduate	en
uws.typeOfResource	Text	en

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Guo_Runsheng.pdf
Size:: 2.59 MB
Format:: Adobe Portable Document Format

Download

License bundle

Now showing 1 - 1 of 1

Name:: license.txt
Size:: 6.4 KB
Format:: Item-specific license agreed upon to submission
Description:

Download

Collections

Theses