Skip to content

[Common] MOE Split dBias#2674

Open
Oleg-Goncharov wants to merge 6 commits intoNVIDIA:mainfrom
Oleg-Goncharov:pr_split_dbias
Open

[Common] MOE Split dBias#2674
Oleg-Goncharov wants to merge 6 commits intoNVIDIA:mainfrom
Oleg-Goncharov:pr_split_dbias

Conversation

@Oleg-Goncharov
Copy link
Collaborator

Description

This PR adds a new kernel that computes dbias separately for each tensor in a group and outputs a grouped dbias tensor containing per-tensor dbias values.

Fixes # (issue)

Type of change

  • Documentation change (change only to the documentation, either a fix or a new content)
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Infra/Build change
  • Code refactoring

Changes

  • Added the grouped dbias kernel

Checklist:

  • I have read and followed the contributing guidelines
  • The functionality is complete
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Feb 11, 2026

Greptile Overview

Greptile Summary

This PR implements grouped dbias computation for MXFP8 quantization, allowing per-tensor bias gradients to be computed separately and stored in a grouped tensor format.

Key Changes:

  • Added group_reduce_dbias_kernel in common.cuh to reduce workspace data separately for each tensor in a group
  • Modified group_quantize to accept GroupedTensor* dbias instead of single Tensor*
  • Updated all API signatures across activation functions (GeLU, ReLU, SiLU, etc.) to use grouped dbias
  • Tests now validate per-tensor dbias outputs and include additional test cases

Important Notes:

  • Grouped dbias is only supported for tensors with constant last dimension (SAME_BOTH_DIMS and VARYING_FIRST_DIM shape representations)
  • The workspace offset computation for VARYING_FIRST_DIM relies on the validation that each tensor's first dimension is divisible by 128 (enforced in get_tensor_rows_num)
  • Expected dbias output shape is now [num_tensors, last_logical_dim] instead of [last_logical_dim]

Confidence Score: 4/5

  • This PR is safe to merge with minor considerations about the workspace offset computation
  • The implementation is well-structured with proper validation, updated tests, and clear documentation. The workspace offset computation for VARYING_FIRST_DIM is correct due to the 128-divisibility validation, but relies on this invariant being maintained. Test coverage includes multiple shape configurations and the API changes are consistent across all activation functions.
  • Pay close attention to transformer_engine/common/cast/core/common.cuh (workspace offset computation logic)

Important Files Changed

Filename Overview
transformer_engine/common/cast/core/common.cuh Added group_reduce_dbias_kernel to reduce dbias separately for each tensor in a group, outputting grouped dbias tensor
transformer_engine/common/cast/mxfp8/group_quantize_mxfp8.cuh Modified group_quantize to accept grouped dbias tensor and call grouped_reduce_dbias instead of single-tensor reduction
transformer_engine/common/include/transformer_engine/cast.h Updated API signatures to accept NVTEGroupedTensor dbias parameter and added documentation about grouped dbias limitations
tests/cpp/operator/test_cast_mxfp8_grouped.cu Updated tests to use grouped dbias tensors, compute per-tensor dbias references, and added new test cases

Sequence Diagram

sequenceDiagram
    participant API as nvte_group_quantize_dbias
    participant GQ as group_quantize
    participant Kernel as group_quantize_mxfp8_kernel
    participant Workspace as Workspace (float32)
    participant Reduce as grouped_reduce_dbias
    participant DBias as GroupedTensor dbias

    API->>GQ: input, output, dbias, workspace
    GQ->>GQ: Validate dbias shape [num_tensors, K]
    GQ->>GQ: Allocate workspace [M/128, K] if needed
    GQ->>Kernel: Launch quantization kernel
    Kernel->>Workspace: Write partial reductions (per 128-row chunks)
    Kernel-->>GQ: Return
    GQ->>Reduce: group_reduce_dbias_kernel
    Note over Reduce: For each tensor_id (blockIdx.y)
    Reduce->>Reduce: Compute workspace offset
    Reduce->>Workspace: Read partial sums for tensor
    Reduce->>Reduce: Sum across rows
    Reduce->>DBias: Write [tensor_id, :] result
    Reduce-->>GQ: Return
    GQ-->>API: Complete
Loading

Last reviewed commit: 2eeb836

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

9 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

@Oleg-Goncharov
Copy link
Collaborator Author

/te-ci

Oleg-Goncharov and others added 2 commits February 12, 2026 14:57
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

9 files reviewed, 3 comments

Edit Code Review Agent Settings | Greptile

? (first_logical_dim / num_tensors)
: first_dims_ptr[tensor_id];

const size_t rows = tensor_rows / chunk_dim_Y;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Verify that tensor_rows is always divisible by chunk_dim_Y (128), otherwise this division silently truncates and skips tail row reduction.

Comment on lines 147 to 150
if (global_dim_X % CHUNK_DIM_X != 0) {
NVTE_DEVICE_ERROR(
"The grouped tensor must be divisible by 128x128 tiles without a tail tile.");
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's see the performance impact of having this here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On B300, the difference is within measurement noise. Over 3 runs, nsys shows ~59.69 µs with the check vs. ~59.62 µs without.

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

9 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile


const size_t dbias_in_offset_Y = (shape_rep == ShapeRepresentation::SAME_BOTH_DIMS)
? (tensor_id * (tensor_rows / chunk_dim_Y))
: (offsets_ptr[tensor_id] / cols / chunk_dim_Y);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For VARYING_FIRST_DIM, the offset computation offsets_ptr[tensor_id] / cols / chunk_dim_Y assumes the data offset is divisible by cols * chunk_dim_Y. However, when tensors have varying first dimensions, the cumulative offset offsets_ptr[tensor_id] equals the sum of M_i * K for all previous tensors. If any M_i % chunk_dim_Y != 0, this division will truncate and compute an incorrect workspace offset, causing data corruption.

The kernel in group_quantize_mxfp8.cuh:109-111 validates each tensor's first dimension is divisible by 128, which ensures M_i % chunk_dim_Y == 0, but the workspace offset depends on the sum of all previous tensor sizes being correctly aligned. Verify this is always satisfied for VARYING_FIRST_DIM case.

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

10 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

10 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

@Oleg-Goncharov
Copy link
Collaborator Author

/te-ci

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants