Add challenge 81: INT4 Weight-Only Quantized MatMul (Medium) by claude[bot] · Pull Request #216 · AlphaGPU/leetgpu-challenges

claude · 2026-03-12T04:12:22Z

Summary

Adds challenge 81: INT4 Weight-Only Quantized MatMul (W4A16), a medium-difficulty inference kernel challenge
Solvers implement the fundamental dequantization + GEMM kernel powering modern LLM inference frameworks (AWQ, GPTQ, llama.cpp, vLLM)
The challenge requires: (1) bit manipulation to unpack two INT4 values from each uint8 byte, (2) group-wise float16 scale application, and (3) mixed-precision matrix multiplication (INT4 weights × FP16 activations → FP16 output)

What makes this interesting

Bit manipulation: high nibble (bits 7:4) holds one weight, low nibble (bits 3:0) holds the next — solvers must correctly unpack and apply the offset-8 encoding (signed = nibble - 8)
Group quantization: each block of group_size weights shares a single float16 scale, requiring careful index arithmetic (k // group_size)
Mixed precision: activations are FP16, weights are dequantized on-the-fly, accumulation should be FP32 for accuracy
Genuinely distinct from existing challenge 32 (INT8×INT8→INT8, per-tensor scale) and challenge 64 (FP32 block-scale weight dequantization, no matmul)

Checklist

challenge.html

Starts with <p> (problem description)
Has <h2> sections for: Implementation Requirements, Example, Constraints
First example matches generate_example_test() values
Examples use LaTeX \begin{bmatrix} for matrices
Constraints includes performance test size bullet
SVG visualization included (dark theme, shows packing format and dataflow)

challenge.py

class Challenge inherits ChallengeBase
__init__ calls super().__init__() with all required fields
reference_impl has assertions on shape, dtype, and device
All 6 methods present
generate_functional_test returns 10 cases covering edge, power-of-2, non-power-of-2, realistic, and zero inputs
Performance test (M=4096, N=4096, K=4096, group_size=128) fits 5× in 16 GB VRAM (~360 MB total)

Starter files

All 6 files present: .cu, .pytorch.py, .triton.py, .jax.py, .cute.py, .mojo
Exactly 1 parameter description comment per file
CUDA/Mojo use "device pointers" (no parenthetical — medium challenge)
Python frameworks use "tensors on the GPU"; JAX has # return output tensor directly
Starters compile/run but produce no correct output

General

Directory follows 81_int4_matmul convention
Linting passes: pre-commit run --all-files
Validated with run_challenge.py --action run — solution passes

🤖 Generated with Claude Code

Adds a W4A16 quantized matrix multiplication challenge modelling the core dequantization + GEMM kernel used in all modern LLM inference frameworks (AWQ, GPTQ, llama.cpp, vLLM). Solvers must unpack packed INT4 weights from uint8 bytes, apply group-wise float16 scales, and compute the mixed-precision matrix product against float16 activations. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

claude bot requested review from ishaan-arya, kunal-mansukhani and shxjames as code owners March 12, 2026 04:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add challenge 81: INT4 Weight-Only Quantized MatMul (Medium)#216

Add challenge 81: INT4 Weight-Only Quantized MatMul (Medium)#216
claude[bot] wants to merge 1 commit intomainfrom
add-challenge-81-int4-matmul

claude bot commented Mar 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

0 participants

Conversation

claude bot commented Mar 12, 2026

Summary

What makes this interesting

Checklist

challenge.html

challenge.py

Starter files

General

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

0 participants