Skip to content

Conversation

@ChingTsai
Copy link
Collaborator

@ChingTsai ChingTsai commented Feb 9, 2026

Description

Changes

  • Added Qwen2 implementation.
    • Implemented Qwen2 layers, these are mostly identical to Qwen3 but include the ability to apply attention bias.
    • Enabled attention bias specifically for QKV (excluding O) when using Qwen2. ref
    • Renamed Qwen3 to Qwen in hf_shape and param_mapping, and adding conversion for attention bias weights
    • Added end-to-end testing scripts.

b/471703114

Tests

Logit Verification

JAX_PLATFORMS=cpu python3 -m tests.utils.forward_pass_logit_checker src/maxtext/configs/base.yml run_name=forward_pass_test_unscanned model_name=qwen2.5-7b tokenizer_path=Qwen/Qwen2.5-7B-Instruct load_parameters_path=${CHECKPOINT_PATH} max_prefill_predict_length=4 max_target_length=4 dataset_type=synthetic scan_layers=true per_device_batch_size=1 skip_jax_distributed_system=True --max_kl_div=0.017 --run_hf_model=True weight_dtype=bfloat16 --hf_model_path=Qwen/Qwen2.5-7B-Instruct

Checklist

Before submitting this PR, please make sure (put X in square brackets):

  • I have performed a self-review of my code. For an optional AI review, add the gemini-review label.
  • I have necessary comments in my code, particularly in hard-to-understand areas.
  • I have run end-to-end tests tests and provided workload links above if applicable.
  • I have made or will make corresponding changes to the doc if needed, including adding new documentation pages to the relevant Table of Contents (toctree directive) as explained in our documentation.

@ChingTsai ChingTsai force-pushed the jimmytsai/bringup-qwen2-5 branch 2 times, most recently from 2c556cf to 7f84e0a Compare February 10, 2026 03:04
@github-actions
Copy link

🤖 Hi @ChingTsai, I've received your request, and I'm working on it now! You can track my progress in the logs for more details.

Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📋 Review Summary

This Pull Request introduces the implementation of Qwen2 models, including new decoder layers, weight mappings, and configuration updates. The changes integrate Qwen2 into the existing MaxText framework, extending its model compatibility.

🔍 General Feedback

  • The generalization of Qwen3 mappings and hook functions to a unified Qwen approach in hf_shape.py and param_mapping.py is a good practice, improving code reusability and maintainability.
  • New configuration files for Qwen2.5 models are well-structured and consistent with existing model configurations.
  • Ensure consistent handling of attention biases across the model definition and weight mapping to prevent potential runtime issues.

@RissyRan
Copy link
Collaborator

Thanks for bringing up new models! We usually verify implementation using this script against the HF version. Please let us know if you meet any issues.

@RissyRan
Copy link
Collaborator

cc @parambole who is working on Qwen3 for helping review PRs

@ChingTsai ChingTsai force-pushed the jimmytsai/bringup-qwen2-5 branch from 7f84e0a to dcc4282 Compare February 11, 2026 08:54
@ChingTsai ChingTsai force-pushed the jimmytsai/bringup-qwen2-5 branch from dcc4282 to 88f6034 Compare February 11, 2026 09:05
@ChingTsai
Copy link
Collaborator Author

ChingTsai commented Feb 11, 2026

Thanks for bringing up new models! We usually verify implementation using this script against the HF version. Please let us know if you meet any issues.

Hi @RissyRan, I noticed that the 7b scanned checkpoint has a higher max KL divergence of 0.016245 (see logs). I've updated the threshold (0.015 -> 0.017) to allow this to pass, but please let me know if this level of divergence is a concern.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants