Skip to content

Add challenge 81: INT4 Weight-Only Quantized MatMul (Medium)#216

Open
claude[bot] wants to merge 1 commit intomainfrom
add-challenge-81-int4-matmul
Open

Add challenge 81: INT4 Weight-Only Quantized MatMul (Medium)#216
claude[bot] wants to merge 1 commit intomainfrom
add-challenge-81-int4-matmul

Conversation

@claude
Copy link
Contributor

@claude claude bot commented Mar 12, 2026

Summary

  • Adds challenge 81: INT4 Weight-Only Quantized MatMul (W4A16), a medium-difficulty inference kernel challenge
  • Solvers implement the fundamental dequantization + GEMM kernel powering modern LLM inference frameworks (AWQ, GPTQ, llama.cpp, vLLM)
  • The challenge requires: (1) bit manipulation to unpack two INT4 values from each uint8 byte, (2) group-wise float16 scale application, and (3) mixed-precision matrix multiplication (INT4 weights × FP16 activations → FP16 output)

What makes this interesting

  • Bit manipulation: high nibble (bits 7:4) holds one weight, low nibble (bits 3:0) holds the next — solvers must correctly unpack and apply the offset-8 encoding (signed = nibble - 8)
  • Group quantization: each block of group_size weights shares a single float16 scale, requiring careful index arithmetic (k // group_size)
  • Mixed precision: activations are FP16, weights are dequantized on-the-fly, accumulation should be FP32 for accuracy
  • Genuinely distinct from existing challenge 32 (INT8×INT8→INT8, per-tensor scale) and challenge 64 (FP32 block-scale weight dequantization, no matmul)

Checklist

challenge.html

  • Starts with <p> (problem description)
  • Has <h2> sections for: Implementation Requirements, Example, Constraints
  • First example matches generate_example_test() values
  • Examples use LaTeX \begin{bmatrix} for matrices
  • Constraints includes performance test size bullet
  • SVG visualization included (dark theme, shows packing format and dataflow)

challenge.py

  • class Challenge inherits ChallengeBase
  • __init__ calls super().__init__() with all required fields
  • reference_impl has assertions on shape, dtype, and device
  • All 6 methods present
  • generate_functional_test returns 10 cases covering edge, power-of-2, non-power-of-2, realistic, and zero inputs
  • Performance test (M=4096, N=4096, K=4096, group_size=128) fits 5× in 16 GB VRAM (~360 MB total)

Starter files

  • All 6 files present: .cu, .pytorch.py, .triton.py, .jax.py, .cute.py, .mojo
  • Exactly 1 parameter description comment per file
  • CUDA/Mojo use "device pointers" (no parenthetical — medium challenge)
  • Python frameworks use "tensors on the GPU"; JAX has # return output tensor directly
  • Starters compile/run but produce no correct output

General

  • Directory follows 81_int4_matmul convention
  • Linting passes: pre-commit run --all-files
  • Validated with run_challenge.py --action run — solution passes

🤖 Generated with Claude Code

Adds a W4A16 quantized matrix multiplication challenge modelling the
core dequantization + GEMM kernel used in all modern LLM inference
frameworks (AWQ, GPTQ, llama.cpp, vLLM). Solvers must unpack packed
INT4 weights from uint8 bytes, apply group-wise float16 scales, and
compute the mixed-precision matrix product against float16 activations.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

0 participants