[Common] MOE Split dBias by Oleg-Goncharov · Pull Request #2674 · NVIDIA/TransformerEngine

Oleg-Goncharov · 2026-02-11T17:14:18Z

Description

This PR adds a new kernel that computes dbias separately for each tensor in a group and outputs a grouped dbias tensor containing per-tensor dbias values.

Fixes # (issue)

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Added the grouped dbias kernel

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

greptile-apps · 2026-02-11T17:38:52Z

Greptile Overview

Greptile Summary

This PR implements grouped dbias computation for MXFP8 quantization, allowing per-tensor bias gradients to be computed separately and stored in a grouped tensor format.

Key Changes:

Added group_reduce_dbias_kernel in common.cuh to reduce workspace data separately for each tensor in a group
Modified group_quantize to accept GroupedTensor* dbias instead of single Tensor*
Updated all API signatures across activation functions (GeLU, ReLU, SiLU, etc.) to use grouped dbias
Tests now validate per-tensor dbias outputs and include additional test cases

Important Notes:

Grouped dbias is only supported for tensors with constant last dimension (SAME_BOTH_DIMS and VARYING_FIRST_DIM shape representations)
The workspace offset computation for VARYING_FIRST_DIM relies on the validation that each tensor's first dimension is divisible by 128 (enforced in get_tensor_rows_num)
Expected dbias output shape is now [num_tensors, last_logical_dim] instead of [last_logical_dim]

Confidence Score: 4/5

This PR is safe to merge with minor considerations about the workspace offset computation
The implementation is well-structured with proper validation, updated tests, and clear documentation. The workspace offset computation for VARYING_FIRST_DIM is correct due to the 128-divisibility validation, but relies on this invariant being maintained. Test coverage includes multiple shape configurations and the API changes are consistent across all activation functions.
Pay close attention to transformer_engine/common/cast/core/common.cuh (workspace offset computation logic)

Important Files Changed

Filename	Overview
transformer_engine/common/cast/core/common.cuh	Added `group_reduce_dbias_kernel` to reduce dbias separately for each tensor in a group, outputting grouped dbias tensor
transformer_engine/common/cast/mxfp8/group_quantize_mxfp8.cuh	Modified `group_quantize` to accept grouped dbias tensor and call `grouped_reduce_dbias` instead of single-tensor reduction
transformer_engine/common/include/transformer_engine/cast.h	Updated API signatures to accept `NVTEGroupedTensor` dbias parameter and added documentation about grouped dbias limitations
tests/cpp/operator/test_cast_mxfp8_grouped.cu	Updated tests to use grouped dbias tensors, compute per-tensor dbias references, and added new test cases

Sequence Diagram

sequenceDiagram
    participant API as nvte_group_quantize_dbias
    participant GQ as group_quantize
    participant Kernel as group_quantize_mxfp8_kernel
    participant Workspace as Workspace (float32)
    participant Reduce as grouped_reduce_dbias
    participant DBias as GroupedTensor dbias

    API->>GQ: input, output, dbias, workspace
    GQ->>GQ: Validate dbias shape [num_tensors, K]
    GQ->>GQ: Allocate workspace [M/128, K] if needed
    GQ->>Kernel: Launch quantization kernel
    Kernel->>Workspace: Write partial reductions (per 128-row chunks)
    Kernel-->>GQ: Return
    GQ->>Reduce: group_reduce_dbias_kernel
    Note over Reduce: For each tensor_id (blockIdx.y)
    Reduce->>Reduce: Compute workspace offset
    Reduce->>Workspace: Read partial sums for tensor
    Reduce->>Reduce: Sum across rows
    Reduce->>DBias: Write [tensor_id, :] result
    Reduce-->>GQ: Return
    GQ-->>API: Complete

_{Last reviewed commit: 2eeb836}

greptile-apps

_{9 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

transformer_engine/common/cast/mxfp8/group_quantize_mxfp8.cuh

Oleg-Goncharov · 2026-02-11T17:59:30Z

/te-ci

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

for more information, see https://pre-commit.ci

greptile-apps

_{9 files reviewed, 3 comments}

_{Edit Code Review Agent Settings | Greptile}

transformer_engine/common/cast/mxfp8/group_quantize_mxfp8.cuh

greptile-apps · 2026-02-12T15:24:43Z

transformer_engine/common/cast/core/common.cuh

+                                 ? (first_logical_dim / num_tensors)
+                                 : first_dims_ptr[tensor_id];
+
+  const size_t rows = tensor_rows / chunk_dim_Y;


Verify that tensor_rows is always divisible by chunk_dim_Y (128), otherwise this division silently truncates and skips tail row reduction.

transformer_engine/common/cast/core/common.cuh

ptrendx · 2026-02-12T17:38:22Z

transformer_engine/common/cast/mxfp8/group_quantize_mxfp8.cuh

+    if (global_dim_X % CHUNK_DIM_X != 0) {
+      NVTE_DEVICE_ERROR(
+          "The grouped tensor must be divisible by 128x128 tiles without a tail tile.");
+    }


Let's see the performance impact of having this here.

On B300, the difference is within measurement noise. Over 3 runs, nsys shows ~59.69 µs with the check vs. ~59.62 µs without.

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

greptile-apps

_{9 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-02-13T14:16:59Z

transformer_engine/common/cast/core/common.cuh

+
+  const size_t dbias_in_offset_Y = (shape_rep == ShapeRepresentation::SAME_BOTH_DIMS)
+                                       ? (tensor_id * (tensor_rows / chunk_dim_Y))
+                                       : (offsets_ptr[tensor_id] / cols / chunk_dim_Y);


For VARYING_FIRST_DIM, the offset computation offsets_ptr[tensor_id] / cols / chunk_dim_Y assumes the data offset is divisible by cols * chunk_dim_Y. However, when tensors have varying first dimensions, the cumulative offset offsets_ptr[tensor_id] equals the sum of M_i * K for all previous tensors. If any M_i % chunk_dim_Y != 0, this division will truncate and compute an incorrect workspace offset, causing data corruption.

The kernel in group_quantize_mxfp8.cuh:109-111 validates each tensor's first dimension is divisible by 128, which ensures M_i % chunk_dim_Y == 0, but the workspace offset depends on the sum of all previous tensor sizes being correctly aligned. Verify this is always satisfied for VARYING_FIRST_DIM case.

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

greptile-apps

_{10 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps

_{10 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

Oleg-Goncharov · 2026-02-13T18:32:56Z

/te-ci

Oleg-Goncharov force-pushed the pr_split_dbias branch from dad4c68 to aff53ff Compare February 11, 2026 17:17

Oleg-Goncharov closed this Feb 11, 2026

Oleg-Goncharov force-pushed the pr_split_dbias branch from 4881d1b to ac81c85 Compare February 11, 2026 17:24

Oleg-Goncharov reopened this Feb 11, 2026

Oleg-Goncharov added enhancement New feature or request MoE cpu_overhead labels Feb 11, 2026

greptile-apps bot reviewed Feb 11, 2026

View reviewed changes

transformer_engine/common/cast/mxfp8/group_quantize_mxfp8.cuh Outdated Show resolved Hide resolved

Oleg-Goncharov requested a review from ptrendx February 11, 2026 19:30

Oleg-Goncharov and others added 2 commits February 12, 2026 14:57

Implemented the kernel with split dbias

d33114b

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

53e99c3

for more information, see https://pre-commit.ci

Oleg-Goncharov force-pushed the pr_split_dbias branch from 3c30d6c to 53e99c3 Compare February 12, 2026 15:19

[pre-commit.ci] auto fixes from pre-commit.com hooks

43a7a44

for more information, see https://pre-commit.ci

greptile-apps bot reviewed Feb 12, 2026

View reviewed changes

ptrendx reviewed Feb 12, 2026

View reviewed changes

Relaxed constraints on the last dimension

3ca3f6b

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

greptile-apps bot reviewed Feb 13, 2026

View reviewed changes

Added notes on group tensor restrictions into documentation

39794e2

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

greptile-apps bot reviewed Feb 13, 2026

View reviewed changes

Merge branch 'main' into pr_split_dbias

2eeb836

greptile-apps bot reviewed Feb 13, 2026

View reviewed changes

Conversation

Oleg-Goncharov commented Feb 11, 2026

Description

Type of change

Changes

Checklist:

Uh oh!

greptile-apps bot commented Feb 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Overview

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Oleg-Goncharov commented Feb 11, 2026

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

greptile-apps bot Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ptrendx Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

Oleg-Goncharov Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

Oleg-Goncharov commented Feb 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

greptile-apps bot commented Feb 11, 2026 •

edited

Loading