Mixture-of-Experts (MoE)
I’m lacking good mental model for how this works.
I think the stanford CS336 lecture on MoE is probably good.
Some interesting ideas that I need to wrap my head around:
- Fine-grained experts vs coarse experts
- if you’re restricted to the same number of parameters, perhaps it is better to further split them up, rather than have a larger parameter FFN
- because you are adding those numbers at the end, so maybe there’s a way to represent richer information?
- Relation to more # of shared experts?
- if you’re restricted to the same number of parameters, perhaps it is better to further split them up, rather than have a larger parameter FFN
Resources