Skip to content

use MTP if model_dir and draft_model_dir are equal#424

Open
suspicious-pineapple wants to merge 1 commit into
theroyallab:mainfrom
suspicious-pineapple:patch-2
Open

use MTP if model_dir and draft_model_dir are equal#424
suspicious-pineapple wants to merge 1 commit into
theroyallab:mainfrom
suspicious-pineapple:patch-2

Conversation

@suspicious-pineapple

Copy link
Copy Markdown

Why should this feature be added?
this seems to be the minimal set of changes needed to make MTP work, on latest exl3 dev branch.

Examples
MTP is enabled if the main model is the same as the draft model. otherwise it behaves normally
..maybe this would more sanely be exposed as a config option?

Additional context
tested with https://huggingface.co/turboderp/Qwen3.6-27B-MTP-exl3 (gotta download the safetensors file and put it in the model dir, i assume it will be included by default in future quants, where supported)

this seems to be the minimal set of changes needed to make MTP work

tested with <https://huggingface.co/turboderp/Qwen3.6-27B-MTP-exl3>
@randoentity

Copy link
Copy Markdown
Contributor

I can't get this to work yet.

I'm on exllamav3 9c5009efaa2cda8ed341369123bb4acfe18ae300
tabbyAPI 2e50555 + patch-2

Using https://huggingface.co/turboderp/Qwen3.6-27B-MTP-exl3 and UnstableLlama_Qwen3.6-27B-exl3-8.00bpw
3 draft module layers get loaded but it raises an error on generation.

AI generated report below:

Bug Report: AttributeError during MTP Draft Model Generation

Description

When initiating a chat completion with Multi-Token Prediction (MTP) enabled via the ExLlamaV3 backend, the generation process crashes. The error indicates that a linear module's inner component (self.inner) is None when atte
mpting to perform a forward pass during draft model iteration.

Steps to Reproduce

  1. Configure TabbyAPI with the ExLlamaV3 backend.
  2. Load a model/architecture that utilizes MTP or a draft model.
  3. Send a chat completion request to trigger streaming generation.
  4. Monitor the server logs.

Expected Behavior

The model should successfully iterate through draft tokens using MTP and stream the completion without crashing.

Actual Behavior

The server raises an AttributeError: 'NoneType' object has no attribute 'forward' and aborts the generation request.

Error Log & Traceback Analysis

Critical Error:

AttributeError: 'NoneType' object has no attribute 'forward'
File "exllamav3/exllamav3/modules/linear.py", line 426, in forward
    x = self.inner.forward(x, params, out_dtype)

Call Stack Highlights:

  1. tabbyAPI/backends/exllamav3/model.py initiates generate_gen.
  2. exllamav3/exllamav3/generator/generator.py calls iterate_draftmodel_mtp_gen.
  3. At generator.py:525, it attempts: batch_logits = self.model.modules[self.model.logit_layer_idx].forward(batch_state, params)
  4. The forward pass enters linear.py:426 where self.inner is unexpectedly None.

Potential Causes

  • The draft model's linear layers were not correctly initialized or loaded from the state dictionary.
  • Architecture mismatch between the loaded model weights and the ExLlamaV3 module definition for MTP layers.
  • Missing or corrupted weight tensors for the specific logit layer index used in MTP drafting.

Environment

  • Python Version: 3.13
  • Backend: ExLlamaV3
  • Application: TabbyAPI
  • Date: 2026-06-10

Note: This bug report was drafted with the assistance of AI based on the provided traceback log.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants