Muon is Scalable for LLM Training https://kellerjordan.github.io/posts/muon/#why-is-it-good-to-orthogonalize-the-update