Efficiently Training Deep Learning Models on Elastic and Heterogeneous Cloud Resources

dc.contributor.authorGuo, Runsheng
dc.date.accessioned2026-05-12T17:58:44Z
dc.date.available2026-05-12T17:58:44Z
dc.date.issued2026-05-12
dc.date.submitted2026-05-08
dc.description.abstractDeep Neural Networks (DNNs) have demonstrated remarkable success across diverse domains, but their training requires substantial computational resources and is typically parallelized across large GPU clusters. However, such clusters are prohibitively expensive for most organizations to own and manage. Hence, instead of owning and managing their own clusters, organizations often rent clusters on cloud platforms to meet their training needs. While cloud environments offer elastic scalability and heterogeneous hardware options, they also introduce significant challenges for efficient distributed DNN training. Specifically, existing training frameworks lack support for dynamic reconfiguration during training, limiting the exploitation of cloud elasticity. Additionally, most systems assume homogeneous clusters, which rarely reflect the heterogeneous GPU clusters that organizations commonly use due to hardware availability constraints. Furthermore, heterogeneous network conditions in cloud environments create communication bottlenecks that limit the scalability of existing approaches. This thesis presents three systems that collectively address these limitations to enable efficient distributed DNN training on elastic and heterogeneous cloud resources. First, Hydrozoa leverages cloud elasticity through serverless containers, enabling dynamic scaling and configuration changes during training without the traditional pitfalls of serverless computing. By combining data and model parallelism with fine-grained resource provisioning, Hydrozoa achieves cost-effective training while eliminating cluster management overhead. Second, Cephalo addresses heterogeneous GPU clusters by independently balancing compute and memory resources across GPUs with different capabilities. Unlike existing approaches that tie workload assignment to computational speed, Cephalo separately optimizes compute distribution through proportional batch sizing and memory utilization through intelligent partitioning of training state, activation checkpointing, and gradient accumulation strategies. Third, Zorse tackles heterogeneous network conditions, which are particularly common in heterogeneous clusters, by efficiently combining memory-efficient data parallelism with pipeline parallelism. Through interleaved pipelining, parameter and activation offloading, and heterogeneous pipeline parallelism configurations, Zorse achieves both communication and memory efficiency for training large DNN models across diverse network topologies. The experimental evaluation demonstrates that these systems significantly improve training efficiency and resource utilization compared to existing approaches. Hydrozoa reduces training costs while providing seamless scalability, Cephalo simultaneously achieves high compute and memory utilization in heterogeneous clusters, and Zorse maintains high throughput under varying network conditions. Together, these contributions make distributed DNN training more accessible, cost-effective, and efficient in modern cloud environments, advancing the state of the art in large-scale machine learning infrastructure.
dc.identifier.urihttps://hdl.handle.net/10012/23293
dc.language.isoen
dc.pendingfalse
dc.publisherUniversity of Waterlooen
dc.subjectdistributed DNN training
dc.subjectlarge language models
dc.subjectcloud computing
dc.subjectheterogeneous computing
dc.subjectdeep learning
dc.subjectdeep neural networks
dc.titleEfficiently Training Deep Learning Models on Elastic and Heterogeneous Cloud Resources
dc.typeDoctoral Thesis
uws-etd.degreeDoctor of Philosophy
uws-etd.degree.departmentDavid R. Cheriton School of Computer Science
uws-etd.degree.disciplineComputer Science
uws-etd.degree.grantorUniversity of Waterlooen
uws-etd.embargo.terms0
uws.contributor.advisorDaudjee, Khuzaima
uws.contributor.affiliation1Faculty of Mathematics
uws.peerReviewStatusUnrevieweden
uws.published.cityWaterlooen
uws.published.countryCanadaen
uws.published.provinceOntarioen
uws.scholarLevelGraduateen
uws.typeOfResourceTexten

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Guo_Runsheng.pdf
Size:
2.59 MB
Format:
Adobe Portable Document Format

License bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
license.txt
Size:
6.4 KB
Format:
Item-specific license agreed upon to submission
Description:

Collections