Why Does the Official Muon Include an Extra max(1, ⋅) Compared to the MuP Version?
The official Muon optimizer adds a max(1,⋅) truncation to stabilize updates during early training when inputs are isotropic, but the MuP scaling factor aligns better with steepest descent theory in later stages as features become anisotropic. Practitioners should prefer the MuP version or use a dynamic decay schedule transitioning from KellerJordan to MuP.
入选理由:KellerJordan版Muon的max(1,⋅)源于din>dout且输入各向同性时的RMS近似推导。