Improving Cluster Scheduling Resiliency to Network Faults
Loading...
Date
2023-05-31
Authors
Qunaibi, Sara
Advisor
Al-Kiswany, Samer
Journal Title
Journal ISSN
Volume Title
Publisher
University of Waterloo
Abstract
We present a comprehensive empirical study of the impact partial network partitions have on cluster managers in data analysis frameworks. Our study shows that modern scheduling approaches are vulnerable to partial network partitions. Partial partitions can lead to a complete cluster pause or a significant loss of performance.
To overcome the shortcoming of the state-of-the-art schedulers, we design the topology-aware scheduler (TAS). TAS incorporates the current network connectivity information when making a scheduling decision, to allocate fully connected nodes for a given application. TAS effectively hides partial partitions from applications. Our evaluation of a TAS prototype shows that it can tolerate partial network partitions, eliminate application halting or significant loss of performance.
Description
Keywords
fault tolerance, network partitions, computer networks, cloud computing