Skip to content

Better testing for muon#30

Open
segyges wants to merge 2 commits intomicrosoft:mainfrom
segyges:swap-tests-against-f64-in-cpu
Open

Better testing for muon#30
segyges wants to merge 2 commits intomicrosoft:mainfrom
segyges:swap-tests-against-f64-in-cpu

Conversation

@segyges
Copy link

@segyges segyges commented Mar 7, 2026

Test Newton-Schulz kernels against fp64 reference instead of cuBLAS

The previous tests only checked how close the Triton Newton-Schulz result was to cuBLAS. This broke on my machine. These tests run Triton and cuBLAS against a numpy fp64 reference reveals that the Triton kernels are at least as accurate as cuBLAS for bf16/f16, and only marginally worse for f32.

Notably, the Triton output is bit-exact with torch.bmm on batched inputs, so they use the same reduction order and unbatched torch takes a different reduction path.

Changes

  • Triton kernels (ns_line_1, ns_line_2): Add INPUT_PRECISION parameter — uses "ieee" for f32 inputs, "tf32" otherwise. This avoids silent mantissa truncation when the kernels are used standalone with f32 data. NOOP when running bf16, these compilation paths will simply never be hit.
  • Tests: Rewritten to compare Triton and cuBLAS against a numpy fp64 ground truth. For bf16/f16 the Triton kernels must match or beat cuBLAS. For f32, empirically-determined multipliers account for the reduction-tree gap. f16 coverage added.
  • End-to-end test: Tolerance tightened from 0.1 to 0.02 (empirical max ~7.8e-3).

@segyges
Copy link
Author

segyges commented Mar 7, 2026

@microsoft-github-policy-service agree company="Overworld"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant