Open
Conversation
Author
|
@microsoft-github-policy-service agree company="Overworld" |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Test Newton-Schulz kernels against fp64 reference instead of cuBLAS
The previous tests only checked how close the Triton Newton-Schulz result was to cuBLAS. This broke on my machine. These tests run Triton and cuBLAS against a numpy fp64 reference reveals that the Triton kernels are at least as accurate as cuBLAS for bf16/f16, and only marginally worse for f32.
Notably, the Triton output is bit-exact with
torch.bmmon batched inputs, so they use the same reduction order and unbatched torch takes a different reduction path.Changes
ns_line_1,ns_line_2): AddINPUT_PRECISIONparameter — uses"ieee"for f32 inputs,"tf32"otherwise. This avoids silent mantissa truncation when the kernels are used standalone with f32 data. NOOP when running bf16, these compilation paths will simply never be hit.