Simplify MTP #470

jlamypoirier · 2026-02-11T21:10:06Z

✨ Description

Refactor MTP models to make them closer to non-MTP models, I.e. so that the common subset of config parameters, module name and parameter names matches exactly. This avoids lots of situations where we would otherwise have to take different code paths depending on whether MTP is enabled or not.

The MTP config is now just a standard LM config, with prediction_heads enabling it. This is essentially identical to what it used to be (before #370). The MTP block is configured from the decoder, using the last layer config, which removes a bit of generality but makes things way simpler.

As for the modules and weights, next-token-prediction head is standardized to base_model.head, while base_model.multi_token_prediction` optionally contains the MTP stuff. This makes it easier to compare weights between MTP and non-MTP models, and to use logit distillation (which should now fully support MTP).

jlamypoirier added 2 commits February 11, 2026 15:50

Simplify MTP

15c0e43

misc

f803e82

jlamypoirier marked this pull request as ready for review February 11, 2026 21:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Simplify MTP #470

Simplify MTP #470

Uh oh!

jlamypoirier commented Feb 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Simplify MTP #470

Are you sure you want to change the base?

Simplify MTP #470

Uh oh!

Conversation

jlamypoirier commented Feb 11, 2026

✨ Description

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant