[Notes] List of Papers
2021. 4. 3. 16:47γNotes
π Systems
- Parallax: Sparsity-aware Data Parallel Training of Deep Neural Networks
- ZeRO-offload: Democratizing Billion-Scale Model Training
- Bandwidth Efficient All-reduce Operation on Tree Topologies
- Efficient Barrier and AllReduce on InfiniBand Clusters using Hardware Multicast and Adaptive Algorithms
- Scaling Distributed Machine Learning with the Parameter Server
- Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
- GPipe: Easy-Scaling with Mircro-Batch Pipeline Parallelism
- ZeRO: Memory Optimization Towards Training A Trillion Parameter Models
- Parallelized Stochastic Gradient Descent
- Measuring the effects of Data Parallelism on Neural Network Training
- PipeDream: Fast and Efficient Pipeline Parallel DNN Training
- Training Deep Nets with Sublinear Memory Cost
π Frameworks
- TensorFlow: A System for Large-Scale Machine Learning
π Models & Deep Learning
- Attention is All You Need
- Improving Language Understanding by Generative Pre-Training
- Language Models are Unsupervised Multitask Learners
- Language Models are Few-Shot Learners
- Adam: A method for stochastic optimization
- Gaussian Error Linear Units (GeLUs)