Deconstructing
Distributed
Deep Learning
Our project develops performance models and scheduling algorithms for distributed deep learning jobs. We analyze the incentives for federations of datacenters with different resource affinities to this type of job. Key research results are put into practice through Kubernetes and TensorFlow.

Motivation

While machine learning frameworks (such as TensorFlow, PyTorch or Caffe) ease the development of DNNs and training jobs, they do not assist the user in provisioning and sharing cloud resources, or in the integration of DNN training workloads into existing datacenters.

In fact, most users need to try different configurations of a job (e.g., number of server/worker nodes, mini-batch size, network capacity) to check the resulting training performance (e.g., throughput measured as examples/second).

At scale, when resources must be shared among hundreds of jobs, this approach quickly becomes infeasible. At a larger scale, when multiple datacenters need to manage deep learning workloads, different degrees of affinity of their resources create economic incentives to collaborate, as in cloud federations.

Research Plan

Our approach will develop models to predict metrics (such as training throughput) needed to guide the allocation of job resources.

Our project plans to design scheduling algorithms for parallel jobs (such as as deep learning training jobs) and to evaluate them in the field, running Kubernetes in clusters at USC and in the cloud.

Workload characterization and strategic models of individual datacenters will allow us to evaluate the incentives for their cooperation.

This project is committed to diversity in research and education, involving undergraduate and graduate students, coupled with an existing extensive K-12 outreach effort. Our experimental testbed will support both research and education at USC.

News


Project supported by the
NSF CCF-1816887 award