Skip to content

Improve handling of the NVTE_CUDA_ARCHS#2665

Open
ptrendx wants to merge 6 commits intoNVIDIA:mainfrom
ptrendx:pr_get_arch_cmake2
Open

Improve handling of the NVTE_CUDA_ARCHS#2665
ptrendx wants to merge 6 commits intoNVIDIA:mainfrom
ptrendx:pr_get_arch_cmake2

Conversation

@ptrendx
Copy link
Member

@ptrendx ptrendx commented Feb 9, 2026

Description

Improve handling of the NVTE_CUDA_ARCHS env variable:

  • add the regular architectures to the build of the sources with specific architectures to enable some support for GPU architectures in the family that were not specialized directly.
  • automatically add sm75 to the build in case the CMAKE_CUDA_ARCHITECTURES becomes empty (which currently should only happen when cmake < 4.0.2 and sm120 is the only selected architecture)

Type of change

  • Documentation change (change only to the documentation, either a fix or a new content)
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Infra/Build change
  • Code refactoring

Changes

Please list the changes introduced in this PR:

  • Change A
  • Change B

Checklist:

  • I have read and followed the contributing guidelines
  • The functionality is complete
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>
@ptrendx ptrendx requested a review from ksivaman February 9, 2026 23:09
@greptile-apps
Copy link
Contributor

greptile-apps bot commented Feb 9, 2026

Greptile Overview

Greptile Summary

This PR refactors the CUDA architecture handling to improve compatibility across GPU families. The key changes are:

Architecture Compilation Strategy:

  • sm_100/101: Now compiles arch-specific sources for both generic (sm_100) and specialized variants (sm_100a, sm_103a). Previously only compiled specialized variants. This enables broader support for GPUs in the Blackwell family.
  • sm_110/120: Intelligently routes to family-specific variants (110f, 120f) when CMake 4.0.2+ supports 'f' suffix natively, otherwise falls back to manual NVTE_GENERIC_ARCHS + NVTE_SPECIFIC_ARCHS handling.
  • Fallback for old CMake: Adds sm_75 when CMAKE_CUDA_ARCHITECTURES becomes empty on CMake < 4.0.2 (only happens when sm_120 is sole selected arch).

Safety Check Changes:

  • NVTE_ARCH_SPECIFIC_TARGETS is now unconditionally set to TRUE (line 104), which defines NVTE_HAS_ARCH_SPECIFIC_TARGETS=1 for all arch-specific sources.
  • This disables the compile-time static_assert in ptx.cuh that catches misuse of arch/family-specific features when compiling for generic architectures.
  • The assert is replaced with a runtime compatibility check that returns false if the arch doesn't match.

Critical Issue:
The unconditional flag on line 104 disables compile-time safety checks even when NVTE_SPECIFIC_ARCHS is empty (e.g., building only sm_70/80/89/90). The flag should be conditional on whether arch-specific targets are actually being compiled.

Confidence Score: 3/5

  • Safe to merge with one critical logic issue that disables compile-time safety checks unconditionally
  • The architecture routing logic is correct and well-structured for CMake version compatibility. The sm_100/101 double-compilation is intentional for family compatibility. However, line 104 unconditionally disables compile-time safety checks that prevent misuse of arch-specific features, even when no arch-specific compilation is happening. This reduces build safety for common configurations.
  • Pay close attention to transformer_engine/common/CMakeLists.txt line 104 - the unconditional flag should be conditional

Important Files Changed

Filename Overview
transformer_engine/common/CMakeLists.txt Improves arch handling for sm_100/110/120 families by keeping generic archs in CMAKE_CUDA_ARCHITECTURES for broader compatibility, but unconditionally disables compile-time safety checks (line 104)
transformer_engine/common/util/ptx.cuh Adds runtime fallback for arch-specific checks when NVTE_HAS_ARCH_SPECIFIC_TARGETS=1, replacing compile-time static_assert with runtime compatibility check

Sequence Diagram

sequenceDiagram
    participant User as User/CMake
    participant CMakeLists as CMakeLists.txt
    participant Compiler as CUDA Compiler
    participant ArchSpecific as Arch-Specific Sources
    participant Generic as Generic Sources

    User->>CMakeLists: Set CMAKE_CUDA_ARCHITECTURES (e.g., 100, 110, 120)
    
    Note over CMakeLists: Process architecture mappings
    
    alt Arch 100/101 (new behavior)
    CMakeLists->>CMakeLists: Keep "100" in CMAKE_CUDA_ARCHITECTURES
    CMakeLists->>CMakeLists: Add "100a" to NVTE_SPECIFIC_ARCHS
    end
    
    alt Arch 110/120 with CMake >= 4.0.2
    CMakeLists->>CMakeLists: Replace "110" with "110f" in CMAKE_CUDA_ARCHITECTURES
    CMakeLists->>CMakeLists: Replace "120" with "120f" in CMAKE_CUDA_ARCHITECTURES
    end
    
    alt Arch 110/120 with CMake < 4.0.2
    CMakeLists->>CMakeLists: Remove from CMAKE_CUDA_ARCHITECTURES
    CMakeLists->>CMakeLists: Add to NVTE_GENERIC_ARCHS + NVTE_SPECIFIC_ARCHS
    end
    
    CMakeLists->>CMakeLists: Set NVTE_ARCH_SPECIFIC_TARGETS = TRUE
    
    Note over CMakeLists,ArchSpecific: Compilation phase
    
    CMakeLists->>Generic: Compile with CMAKE_CUDA_ARCHITECTURES
    CMakeLists->>Generic: Add --generate-code for NVTE_GENERIC_ARCHS
    
    CMakeLists->>ArchSpecific: Compile with CMAKE_CUDA_ARCHITECTURES (e.g., sm_100)
    CMakeLists->>ArchSpecific: Add --generate-code for NVTE_SPECIFIC_ARCHS (e.g., sm_100a)
    CMakeLists->>ArchSpecific: Define NVTE_HAS_ARCH_SPECIFIC_TARGETS=1
    
    ArchSpecific->>Compiler: Disable static_assert (use runtime check instead)
    
    Compiler->>User: Build artifacts with both generic and specific arch code
Loading

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 files reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

Comment on lines 43 to 56
@@ -54,31 +52,57 @@ endif()
# Check for architecture 101 (if we see this we are in toolkit <= 12.9)
list(FIND CMAKE_CUDA_ARCHITECTURES "101" arch_101_index)
if(NOT arch_101_index EQUAL -1)
list(REMOVE_ITEM CMAKE_CUDA_ARCHITECTURES "101")
list(APPEND NVTE_GENERIC_ARCHS "101")
list(APPEND NVTE_SPECIFIC_ARCHS "101a")
endif()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arch 100/101 not removed

When "100" / "101" are present in CMAKE_CUDA_ARCHITECTURES, this block only appends 100a/101a to NVTE_SPECIFIC_ARCHS but never removes the base arch from CMAKE_CUDA_ARCHITECTURES nor adds it to NVTE_GENERIC_ARCHS. As a result, the build will still compile all sources for sm_100 / sm_101 (via CMAKE_CUDA_ARCHITECTURES) while also compiling arch-specific sources for sm_100a / sm_101a (via --generate-code), which defeats the “generic + specific” split and can cause unexpected extra build work / incorrect targeting. This differs from the 110/120 handling below, which removes the base arch and routes it into the explicit lists.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@greptileai That is because arches 110/120 are routed to the "f" (like 120f) variant where the entire family is covered, whereas in the case of 100, it routes to "a" variants (like 100a) - this is intentional behavior, but in order to support different Arches in the same family, we need to still compile for the regular 100 then.

Comment on lines +93 to 101
if(CMAKE_VERSION VERSION_LESS 4.0.2)
if(NOT CMAKE_CUDA_ARCHITECTURES)
message(WARNING
"CMAKE_CUDA_ARCHITECTURES is empty after replacing arch-specific targets. "
"Please upgrade to CMake 4.0.2+ for native 'f' architecture support. "
"Adding sm_75 target in addition to the specified target to avoid configuration "
"errors - this will result in longer build time, but does not affect correctness.")
set(CMAKE_CUDA_ARCHITECTURES 75)
endif()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fallback wipes user archs

On CMake < 4.0.2, if the arch list becomes empty you set(CMAKE_CUDA_ARCHITECTURES 75) (dropping the user’s originally requested arch entirely). Given the warning text says “add sm_75 target in addition to the specified target”, this looks like a functional mismatch: the fallback should preserve the originally selected arch(es) and append 75, not replace everything with 75.

Copy link
Member Author

@ptrendx ptrendx Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@greptileai If the list is empty, then setting is the same as appending.

@ptrendx
Copy link
Member Author

ptrendx commented Feb 11, 2026

/te-ci

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

Comment on lines +104 to 106
set(NVTE_ARCH_SPECIFIC_TARGETS TRUE)

# cuDNN frontend API
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Always-on arch-specific define

set(NVTE_ARCH_SPECIFIC_TARGETS TRUE) makes NVTE_HAS_ARCH_SPECIFIC_TARGETS=1 get added to all transformer_engine_cuda_arch_specific_sources (lines ~247-255) even when NVTE_SPECIFIC_ARCHS is empty (e.g., user selects only 70/80/89/90). That flips the behavior in util/ptx.cuh to skip the compile-time static_assert that catches “compiled generic but using arch/family-specific features”, so those mistakes will now silently pass in exactly the builds where no arch-specific codegen flags are being injected for these sources.

@ptrendx
Copy link
Member Author

ptrendx commented Feb 12, 2026

/te-ci

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

endif()
endif()

set(NVTE_ARCH_SPECIFIC_TARGETS TRUE)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unconditional flag disables compile-time safety checks. NVTE_ARCH_SPECIFIC_TARGETS is set to TRUE even when NVTE_SPECIFIC_ARCHS is empty (e.g., user builds only sm_70/80/89/90). This causes NVTE_HAS_ARCH_SPECIFIC_TARGETS=1 to be defined for all arch-specific sources (lines 249-258), which disables the static_assert in ptx.cuh lines 34-38 and 56-60. The compile-time check that catches misuse of arch/family-specific features in generic builds is now always off, even when no arch-specific code generation is happening.

Suggested change
set(NVTE_ARCH_SPECIFIC_TARGETS TRUE)
if(NVTE_SPECIFIC_ARCHS)
set(NVTE_ARCH_SPECIFIC_TARGETS TRUE)
endif()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant