Papers(2)
-
[Papers] Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
[Link to Paper] Megatron-LM PAPER SUMMARY PROBLEM Current large NLP models require additional memory management techniques SOLUTION Model parallel approach using intra-layer model parallelism BACKGROUND Data Parallelism in Deep Learning Training minibatch is split across multiple workers Problem: Model must fit entirely on one worker Model Parallelism in Deep Learning Memory usage and computatio..
2021.04.10 -
[Papers] ZeRO-Offload: Democratizing Billion-Scale Model Training
[Link to Paper] ZeRO-Offload PAPER SUMMARY PROBLEM Training large models requires having enough GPU devices so that the GPU memory can hold model states for training (despite pipeline parallelism, model parallelism ...) Using a lot of GPUs is cost-burdening, making it difficult for people to attempt training SOLUTION Democratize large model training by ZeRO-Offload, which exploits both CPU memor..
2021.04.03