[Papers] ZeRO-Offload: Democratizing Billion-Scale Model Training

2021. 4. 3. 19:19Papers

[Link to Paper] ZeRO-Offload

PAPER SUMMARY

PROBLEM

  • Training large models requires having enough GPU devices so that the GPU memory can hold model states for training (despite pipeline parallelism, model parallelism ...)
  • Using a lot of GPUs is cost-burdening, making it difficult for people to attempt training

SOLUTION

  • Democratize large model training by ZeRO-Offload, which exploits both CPU memory and compute for offloading
  • offload gradients, optimizer states, and optimizer computation to CPU / keep paramteters and forward & backward computation on GPU

 

 


BACKGROUND

⭐️  How memory is consumed in large model training, some methods of efficiently handling this, and how ZeRO works for backup.

Memory Consumption in Large model training

DNN model training classified in two parts;

  1. Model States : parameters, gradients, and optimizer states
  2. Residual States: activations, buffers, fragmented memory etc.

Among these two, Model States are the main bottleneck for large model training because;

  • for each parameter, using mized precision, two copies are kept in fp16 (2 bytes), and fp32 (4 bytes)
  • the corresponding gradient for each parameter in fp16 (2 bytes)
  • for each parameter, Adam optimizer keeps the momentum in fp32 (4 bytes) and variance in fp32 (4 bytes)

 

$\therefore$ for a model with $M$ parameters, it requires $16M$ bytes of memory

$\exists$ two categories when attempting to fit these model and residual states on GPUs;

  • Scale-out training
  • Scale-up training

 

Scale-out large model training

Uses aggregate memory of multiple GPUs to satisfy memory requirement

  • Model Parallelism: partitions model vertically (intra-layer partitioning)
  • Pipeline Parallelism: partitions model horizontally (inter-layer partitioning)
  • ZeRO: splits training batch across GPUs (similar to DP), partition model across GPUs instead of replicating, communicates to gather individual parameters

 

However all three require multiple GPUs that make up for the required memory

ZeRO-Offload offloads model states to CPU memory instead

 

 

Scale-up large model training

How existing works scale up model size in a single GPU;

  • Recomputing from checkpoints $\therefore$ no need to save all activation states
  • Using low or mixed precision
  • Using external memory such as CPU memory $\Leftarrow$ approach of ZeRO-Offload

 

ZeRO powered data parallel training

Three Stages of ZeRO;

  1. ZeRO-1 : partition optimizer states 
  2. ZeRO-2: partition optimizer states, gradients 
  3. ZeRO-3: partition all model states

ZeRO-Offload works with ZeRO-2;

  • GPU stores replica of all parameters, but update a mutually exclusive portion at the end of each training step $\Rightarrow$ only store the optimizer states and gradients needed for that portion
  • Forward pass: each GPU computes loss WRT different mini-batch 
  • Backward pass: each gradient is computed $\Rightarrow$ averaged using reduce 
  • After backward pass, each GPU updates its portion of parameters and optimizer states using averaged gradients
  • Perform all-gather

 

 


Offload Strategy

⭐️ How ZeRO-Offload handles a graph for offloading
  • ZeRO-Offload uses a data-flow graph and partitions this gaph for efficient offloading (specified for mixed precision training with Adam optimizer)

DL Training as a data flow graph

DL training workload in a graph

  • Offload strategy represented as on the right;
    • computation is executed / data node is stored on the device (CPU/GPU) that owns the partition

 

  • Graph partitioning is done based on;
    1. CPU computation overhead
    2. Communication overhead
    3. Memory savings

 

1. Limiting CPU computation

  • CPU computation throughput is much slower than GPU computation throughput $\Rightarrow$ avoid offloading compute-intensive components to CPU
  • General complexity of DL training per iteration : $O(MB) \Rightarrow$ this should be done on GPU
  • norm calculations, weight updates (complexity $O(M)$) $\Rightarrow$ offloaded to CPU

 

2. Minimizing Communication Volume

  • PCI-E bandwidth btw CPU and GPU $\ll$ CPU memory bandwidth $\ll$ GPU memory bandwidth
  • $\therefore$ minimize communication between CPU and GPU!!
  • Create fp32 super-node : keep fp32 model state is one partition (a node named UpdateSuper)
  • parameter16 is assigned with FWD-BWD super node on GPU

 

3. Maximizing Memory savings

  • maximum memory saving can be done by offloading both gradient16 and Update Super to CPU

 

⭐️ GPU : FWD-BWD Super node, parameter 16
⭐️ CPU : Update Super node (consists of fp32 states), gradient 16

 

 


ZeRO-Offload Schedule

⭐️ Computation & communication schedule for implementing ZeRO-Offload on single GPU system, and further explanations on multi-GPUs

 

Single GPU Schedule

  • Forward pass: compute loss. fp16 parameters are already on GPU, so no CPU communication is needed
  • Backward pass: gradient for parameters are computed at different points in backward graph
  • ZeRO-Offload transfers these gradients to CPU memory (transferring can overlap with backward propagation on remaining backward graph) $\Rightarrow$ possible to hide communication cost
  • Update fp32 parameters and remaining optimizer states on CPU
  • Copy updated fp32 parameters from CPU to fp16 parameters on GPU

 

Scaling to Multi-GPUs

Calculates gradients using ZeRO-2, while offloading partitioned gradients, optimizer states and parameter updates to CPU.

 

 

 


Optimized CPU Execution

⭐️ How speedup of CPU execution is achieved for parameter updates

 

1. Implementing the CPU Optimizer

Faster implementation of Adam optimizer using SIMD vector instruction, loop unrolling, and OMP multithreading

 

2. One-step Delayed Parameter Update

CPU computation may become bottleneck when $GPU computation time \leq CPU compute time$

$\Rightarrow$ one-step delayed parameter update (DPU) : overlap CPU and GPU compute

 

Delayed parameter update during training process

  • First $N-1$ steps : trained without DPU to avoid destabilizing training in early stages (gradients change rapidly)
  • Step $N$ : obtain gradients from GPU, skip CPU optimizer step, do not update fp16 parameters on GPU either
  • Step $N+1$ : compute parameter updates on CPU using gradients from step $N$, compute forward and backward pass on GPU in parallel using parameters updated at step $N-1$

 

$\therefore$ model at $(i+1)$ step is trained using parameters updated with gradients from $(i-1)$ step, to overlap CPU compute with GPU compute