Improve handling of the NVTE_CUDA_ARCHS#2665
Conversation
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>
Greptile OverviewGreptile SummaryThis PR refactors the CUDA architecture handling to improve compatibility across GPU families. The key changes are: Architecture Compilation Strategy:
Safety Check Changes:
Critical Issue: Confidence Score: 3/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant User as User/CMake
participant CMakeLists as CMakeLists.txt
participant Compiler as CUDA Compiler
participant ArchSpecific as Arch-Specific Sources
participant Generic as Generic Sources
User->>CMakeLists: Set CMAKE_CUDA_ARCHITECTURES (e.g., 100, 110, 120)
Note over CMakeLists: Process architecture mappings
alt Arch 100/101 (new behavior)
CMakeLists->>CMakeLists: Keep "100" in CMAKE_CUDA_ARCHITECTURES
CMakeLists->>CMakeLists: Add "100a" to NVTE_SPECIFIC_ARCHS
end
alt Arch 110/120 with CMake >= 4.0.2
CMakeLists->>CMakeLists: Replace "110" with "110f" in CMAKE_CUDA_ARCHITECTURES
CMakeLists->>CMakeLists: Replace "120" with "120f" in CMAKE_CUDA_ARCHITECTURES
end
alt Arch 110/120 with CMake < 4.0.2
CMakeLists->>CMakeLists: Remove from CMAKE_CUDA_ARCHITECTURES
CMakeLists->>CMakeLists: Add to NVTE_GENERIC_ARCHS + NVTE_SPECIFIC_ARCHS
end
CMakeLists->>CMakeLists: Set NVTE_ARCH_SPECIFIC_TARGETS = TRUE
Note over CMakeLists,ArchSpecific: Compilation phase
CMakeLists->>Generic: Compile with CMAKE_CUDA_ARCHITECTURES
CMakeLists->>Generic: Add --generate-code for NVTE_GENERIC_ARCHS
CMakeLists->>ArchSpecific: Compile with CMAKE_CUDA_ARCHITECTURES (e.g., sm_100)
CMakeLists->>ArchSpecific: Add --generate-code for NVTE_SPECIFIC_ARCHS (e.g., sm_100a)
CMakeLists->>ArchSpecific: Define NVTE_HAS_ARCH_SPECIFIC_TARGETS=1
ArchSpecific->>Compiler: Disable static_assert (use runtime check instead)
Compiler->>User: Build artifacts with both generic and specific arch code
|
| @@ -54,31 +52,57 @@ endif() | |||
| # Check for architecture 101 (if we see this we are in toolkit <= 12.9) | |||
| list(FIND CMAKE_CUDA_ARCHITECTURES "101" arch_101_index) | |||
| if(NOT arch_101_index EQUAL -1) | |||
| list(REMOVE_ITEM CMAKE_CUDA_ARCHITECTURES "101") | |||
| list(APPEND NVTE_GENERIC_ARCHS "101") | |||
| list(APPEND NVTE_SPECIFIC_ARCHS "101a") | |||
| endif() | |||
There was a problem hiding this comment.
Arch 100/101 not removed
When "100" / "101" are present in CMAKE_CUDA_ARCHITECTURES, this block only appends 100a/101a to NVTE_SPECIFIC_ARCHS but never removes the base arch from CMAKE_CUDA_ARCHITECTURES nor adds it to NVTE_GENERIC_ARCHS. As a result, the build will still compile all sources for sm_100 / sm_101 (via CMAKE_CUDA_ARCHITECTURES) while also compiling arch-specific sources for sm_100a / sm_101a (via --generate-code), which defeats the “generic + specific” split and can cause unexpected extra build work / incorrect targeting. This differs from the 110/120 handling below, which removes the base arch and routes it into the explicit lists.
There was a problem hiding this comment.
@greptileai That is because arches 110/120 are routed to the "f" (like 120f) variant where the entire family is covered, whereas in the case of 100, it routes to "a" variants (like 100a) - this is intentional behavior, but in order to support different Arches in the same family, we need to still compile for the regular 100 then.
| if(CMAKE_VERSION VERSION_LESS 4.0.2) | ||
| if(NOT CMAKE_CUDA_ARCHITECTURES) | ||
| message(WARNING | ||
| "CMAKE_CUDA_ARCHITECTURES is empty after replacing arch-specific targets. " | ||
| "Please upgrade to CMake 4.0.2+ for native 'f' architecture support. " | ||
| "Adding sm_75 target in addition to the specified target to avoid configuration " | ||
| "errors - this will result in longer build time, but does not affect correctness.") | ||
| set(CMAKE_CUDA_ARCHITECTURES 75) | ||
| endif() |
There was a problem hiding this comment.
Fallback wipes user archs
On CMake < 4.0.2, if the arch list becomes empty you set(CMAKE_CUDA_ARCHITECTURES 75) (dropping the user’s originally requested arch entirely). Given the warning text says “add sm_75 target in addition to the specified target”, this looks like a functional mismatch: the fallback should preserve the originally selected arch(es) and append 75, not replace everything with 75.
There was a problem hiding this comment.
@greptileai If the list is empty, then setting is the same as appending.
|
/te-ci |
| set(NVTE_ARCH_SPECIFIC_TARGETS TRUE) | ||
|
|
||
| # cuDNN frontend API |
There was a problem hiding this comment.
Always-on arch-specific define
set(NVTE_ARCH_SPECIFIC_TARGETS TRUE) makes NVTE_HAS_ARCH_SPECIFIC_TARGETS=1 get added to all transformer_engine_cuda_arch_specific_sources (lines ~247-255) even when NVTE_SPECIFIC_ARCHS is empty (e.g., user selects only 70/80/89/90). That flips the behavior in util/ptx.cuh to skip the compile-time static_assert that catches “compiled generic but using arch/family-specific features”, so those mistakes will now silently pass in exactly the builds where no arch-specific codegen flags are being injected for these sources.
|
/te-ci |
| endif() | ||
| endif() | ||
|
|
||
| set(NVTE_ARCH_SPECIFIC_TARGETS TRUE) |
There was a problem hiding this comment.
Unconditional flag disables compile-time safety checks. NVTE_ARCH_SPECIFIC_TARGETS is set to TRUE even when NVTE_SPECIFIC_ARCHS is empty (e.g., user builds only sm_70/80/89/90). This causes NVTE_HAS_ARCH_SPECIFIC_TARGETS=1 to be defined for all arch-specific sources (lines 249-258), which disables the static_assert in ptx.cuh lines 34-38 and 56-60. The compile-time check that catches misuse of arch/family-specific features in generic builds is now always off, even when no arch-specific code generation is happening.
| set(NVTE_ARCH_SPECIFIC_TARGETS TRUE) | |
| if(NVTE_SPECIFIC_ARCHS) | |
| set(NVTE_ARCH_SPECIFIC_TARGETS TRUE) | |
| endif() |
Description
Improve handling of the NVTE_CUDA_ARCHS env variable:
Type of change
Changes
Please list the changes introduced in this PR:
Checklist: