Distributed Machine Learning
See this for the different paradigms: https://colossalai.org/docs/concepts/paradigms_of_parallelism/
Resources
- Data-Parallel Distributed Training of Deep Learning Models
- Pipeline-Parallelism: Distributed Training via Model Partitioning
Really good blog from JAX
- https://jax-ml.github.io/scaling-book/training/#tensor-parallelism
- First cited by this https://fleetwood.dev/posts/domain-specific-architectures, which I mention in AI Inference
There are 2 ways to parallelize: https://docs.oneflow.org/en/v0.4.0/extended_topics/model_mixed_parallel.html