While machine learning frameworks (such as TensorFlow, PyTorch or Caffe) ease the development of DNNs and training jobs, they do not assist the user in provisioning and sharing cloud resources, or in the integration of DNN training workloads into existing datacenters.
In fact, most users need to try different configurations of a job (e.g., number of server/worker nodes, mini-batch size, network capacity) to check the resulting training performance (e.g., throughput measured as examples/second).
At scale, when resources must be shared among hundreds of jobs, this approach quickly becomes infeasible. At a larger scale, when multiple datacenters need to manage deep learning workloads, different degrees of affinity of their resources create economic incentives to collaborate, as in cloud federations.
Our approach will develop models to predict metrics (such as training throughput) needed to guide the allocation of job resources.
Our project plans to design scheduling algorithms for parallel jobs (such as as deep learning training jobs) and to evaluate them in the field, running Kubernetes in clusters at USC and in the cloud.
Workload characterization and strategic models of individual datacenters will allow us to evaluate the incentives for their cooperation.
This project is committed to diversity in research and education, involving undergraduate and graduate students, coupled with an existing extensive K-12 outreach effort. Our experimental testbed will support both research and education at USC.
- Dec. 18, 2019: Our work on throughput prediction of distributed SGD in TensorFlow was accepted at ICPE.
- Sep. 28, 2018: Our work on performance analysis and scheduling of asynchronous SGD was presented at MASCOTS.
Project supported by the
NSF CCF-1816887 award