Conversation
dad4c68 to
aff53ff
Compare
4881d1b to
ac81c85
Compare
Greptile OverviewGreptile SummaryThis PR implements grouped Key Changes:
Important Notes:
Confidence Score: 4/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant API as nvte_group_quantize_dbias
participant GQ as group_quantize
participant Kernel as group_quantize_mxfp8_kernel
participant Workspace as Workspace (float32)
participant Reduce as grouped_reduce_dbias
participant DBias as GroupedTensor dbias
API->>GQ: input, output, dbias, workspace
GQ->>GQ: Validate dbias shape [num_tensors, K]
GQ->>GQ: Allocate workspace [M/128, K] if needed
GQ->>Kernel: Launch quantization kernel
Kernel->>Workspace: Write partial reductions (per 128-row chunks)
Kernel-->>GQ: Return
GQ->>Reduce: group_reduce_dbias_kernel
Note over Reduce: For each tensor_id (blockIdx.y)
Reduce->>Reduce: Compute workspace offset
Reduce->>Workspace: Read partial sums for tensor
Reduce->>Reduce: Sum across rows
Reduce->>DBias: Write [tensor_id, :] result
Reduce-->>GQ: Return
GQ-->>API: Complete
Last reviewed commit: 2eeb836 |
|
/te-ci |
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
for more information, see https://pre-commit.ci
3c30d6c to
53e99c3
Compare
for more information, see https://pre-commit.ci
| ? (first_logical_dim / num_tensors) | ||
| : first_dims_ptr[tensor_id]; | ||
|
|
||
| const size_t rows = tensor_rows / chunk_dim_Y; |
There was a problem hiding this comment.
Verify that tensor_rows is always divisible by chunk_dim_Y (128), otherwise this division silently truncates and skips tail row reduction.
| if (global_dim_X % CHUNK_DIM_X != 0) { | ||
| NVTE_DEVICE_ERROR( | ||
| "The grouped tensor must be divisible by 128x128 tiles without a tail tile."); | ||
| } |
There was a problem hiding this comment.
Let's see the performance impact of having this here.
There was a problem hiding this comment.
On B300, the difference is within measurement noise. Over 3 runs, nsys shows ~59.69 µs with the check vs. ~59.62 µs without.
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
|
|
||
| const size_t dbias_in_offset_Y = (shape_rep == ShapeRepresentation::SAME_BOTH_DIMS) | ||
| ? (tensor_id * (tensor_rows / chunk_dim_Y)) | ||
| : (offsets_ptr[tensor_id] / cols / chunk_dim_Y); |
There was a problem hiding this comment.
For VARYING_FIRST_DIM, the offset computation offsets_ptr[tensor_id] / cols / chunk_dim_Y assumes the data offset is divisible by cols * chunk_dim_Y. However, when tensors have varying first dimensions, the cumulative offset offsets_ptr[tensor_id] equals the sum of M_i * K for all previous tensors. If any M_i % chunk_dim_Y != 0, this division will truncate and compute an incorrect workspace offset, causing data corruption.
The kernel in group_quantize_mxfp8.cuh:109-111 validates each tensor's first dimension is divisible by 128, which ensures M_i % chunk_dim_Y == 0, but the workspace offset depends on the sum of all previous tensor sizes being correctly aligned. Verify this is always satisfied for VARYING_FIRST_DIM case.
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
|
/te-ci |
Description
This PR adds a new kernel that computes
dbiasseparately for each tensor in a group and outputs a groupeddbiastensor containing per-tensordbiasvalues.Fixes # (issue)
Type of change
Changes
Checklist: