🛠️ Steven Gong

Search

SearchSearch

Aug 11, 2025, 1 min read

Distributed Machine Learning

See this for the different paradigms: https://colossalai.org/docs/concepts/paradigms_of_parallelism/

These seem like a great set of walkthroughs: https://uvadlc-notebooks.readthedocs.io/en/latest/tutorial_notebooks/scaling/JAX/single_gpu_techniques.html

Resources

  • Data-Parallel Distributed Training of Deep Learning Models
  • Pipeline-Parallelism: Distributed Training via Model Partitioning

Really good blog from JAX

  • https://jax-ml.github.io/scaling-book/training/#tensor-parallelism
    • First cited by this https://fleetwood.dev/posts/domain-specific-architectures, which I mention in AI Inference

There are 2 ways to parallelize: https://docs.oneflow.org/en/v0.4.0/extended_topics/model_mixed_parallel.html

Graph View

Backlinks

  • Data Parallelism
  • Model Parallelism

Created with Quartz, © 2025

  • Blog
  • LinkedIn
  • Twitter
  • GitHub