Mixture-of-Experts (MoE)

I’m lacking good mental model for how this works.

I think the stanford CS336 lecture on MoE is probably good.

Some interesting ideas that I need to wrap my head around:

  • Fine-grained experts vs coarse experts
    • if you’re restricted to the same number of parameters, perhaps it is better to further split them up, rather than have a larger parameter FFN
      • because you are adding those numbers at the end, so maybe there’s a way to represent richer information?
    • Relation to more # of shared experts?

Resources