Support GLM5 and GLM5.1 in MinT: LoRA training for DSA and MTP

GLM5 support sounded routine at first. It wasn't. What looked like a simple model bring-up turned into a comprehensive work across training, serving, and checkpoint conversion.

When we first planned for GLM5 [1] support in MinT, it looked straightforward enough. Add the new architecture, a few runtime patches, bridge the checkpoints, then ship it! That was the theory.

In practice, GLM5 is difficult because too many parts of the stack have to agree on the same model at the same time. It is not only just another huge MoE model, but also involves Multi-Head Latent Attention (MLA) [2], DeepSeek Sparse Attention (DSA) [3], and Multi-Token Prediction (MTP) [2]. The training stack, the serving stack, and the checkpoint bridge can each be working and still disagree with each other. That was the actual project.

As a result, we solved three separate problems:

  1. Training works in the way we expected, especially around DSA and MTP.
  2. Inference supports serving LoRA for the DSA architecture.
  3. Bridge supports checkpoint format across frameworks correctly.

34778e0a-5420-812a-9c23-d18a0ac38f6f-c1881eab.png

Figure 1. Training loss curves for the model and MTP of GLM5.1 with LoRA adapters.

Training

Upgrading to transformers 5

Native GLM5 support arrived together with the move to transformers 5.2.0, and GLM5's tokenizers need transformers 5 to load. In VeRL PR #5445, we shipped the required model-loading and tokenization behavior patches.

Aligning rollout and training

The first real challenge was that training results no longer aligned with rollout.

The Megatron side needs DSA fixes in Megatron-LM PR #3026. The indexer RoPE path was changed to split the indexer state as x_pe, x_nope, apply RoPE only to the positional slice, and concatenate back in that same order, which matches the DeepSeek indexer implementation. DSA stopped using fused qk layernorm plus linear projections and instead wires explicit q_layernorm and kv_layernorm with fuse_input_layernorm=False, so the indexer really sees normalized q and k.

However, these fixes were not enough. The train-inference mismatch improved, but it remained high.

We found that small differences between rollout and training are usually just numerical noise with ordinary dense attention, but DSA is less for giving. A small difference there can change which tokens are involved at all. Once that happens, you are no longer using the same tokens that produced the samples.

GLM5 technical report [1] gets specific here: it uses IcePop [4] to mitigate general training-inference mismatch during RL, and in its DSA RL insights it says deterministic torch.topk was needed to mitigate the mismatch in DSA indexer token selection, with the indexer frozen by default during RL. In practice, that means the sparse path has to line up much more tightly than people are used to: top-k behavior, indexer behavior, and the exact tokens used during rollout and training.

Unlike using replay for MoE routers [5], replaying every DSA indexer is expensive in both engineering effort and compute, which is why IcePop ended up mattering. In VeRL PR #5722, we extended the semantics of rollout correction. If the threshold is a float, it still uses truncated importance sampling like before. If it is passed with both lower bound and upper bound threshold, we switch to IcePop and zero out importance weights outside the trusted range.

Making DSA fit long-context training

It was not enough for DSA to run in a toy setup. It had to survive the long-context training configuration we actually use, especially Token–Head–Dim (THD) packed layout for attention inputs and Context Parallelism (CP) for sequence sharding to save GPU memory and make training more efficient [6].

We shipped Megatron-LM PR #3674 for THD and CP support. We added TileLang [7] fused kernels for the DSA indexer and the sparse MLA path, where it requires absorbed MLA [8] which is not supported. To implement absorbed MLA, we pass position_ids and up_v_weight into DSA core attention and add get_absorb_query_key_value_tensors(), which materializes the effective linear_kv_up_proj weight, including LoRA delta if the layer is wrapped, splits that weight into up_k_weight and up_v_weight, normalizes the KV latent with the same LayerNorm or RMSNorm semantics as the wrapped projection, gathers sequence parallel latents when needed, then rewrites query and key into the absorbed layout. Query becomes content projected into kv_lora_rank plus positional channels. Key becomes KV latent plus positional channels. Value is set to None, and DSA receives up_v_weight so it can project values later. It also supports chunked unfused sparse attention fallback in case the TileLang fused kernels are not available.

The PR also adds fused indexer helpers such as _fused_qk_topk_lighting() and _fused_qk_topk_lighting_with_streaming_sparse_kl(), keeps a dense indexer-loss fallback through FusedDSAIndexerLoss.apply, and only takes the TileLang path when the imports succeed and row wise bounds are available.

There are also a few config-level changes because DeepSeek V3.2 [3] has some implementation differences compared with GLM5.

Stabilizing MTP support

MTP touches model structure, checkpoint conversion, and the training loss path at the same time. That means support can look fine right up until upstream changes one piece and the others quietly fall out of sync.

The first step was VeRL PR #5323, where we enabled Megatron-Bridge-backed MTP models. That moved MTP into the normal training path instead of leaving it in a special feature with some specific versions of dependencies.

To match upstream behavior in Megatron-Core MTP loss handling, we further introduced VeRL PR #5587, where we use upstream process_mtp_loss when available, fall back on older behavior when that API is missing, and still keep the PPO requirement that the model return logits instead of always collapsing everything into a built-in cross-entropy loss.

We also had to keep the MLA patch aligned with newer Megatron versions. In VeRL PR #6005, once the relevant MLA flash-attention forward fix landed upstream, the local patch only needed to stay active for Megatron-Core versions without the change. Once a model family depends on several fast-moving upstreams, version mismatches start consuming real engineering time.

Inference

On the inference side, vLLM's LoRA path for GLM5 faced unique challenges, as MLA has its own projection structure, DSA adds indexer-related modules, MoE adds grouped expert weights. These mean even when the LoRA weights are present, their loading can still fail due to wrong runtime assumptions on module type or wrapper layout.

That is what we fix in vLLM PR #35077, and the fixes are spread across the LoRA stack.

Some of the failures were straightforward. LoRA registration could fail on modules like fused_qkv_a_proj. MLA post-load processing could break because the LoRA wrapper hid quant_method behind base_layer. MoE LoRA could end up on a backend path that rewrote the weight layout into something the adapter path could no longer interpret correctly.

After fixing these, nothing looked especially broken, and adapters started loading, but we soon discovered that the result was incorrect compared with the LoRA merged inference result.

We eventually found the root cause: some of the GLM5 LoRA targets implemented in vLLM are not plain linear layers. Modules such as fused_qkv_a_proj and GateLinear carry custom forward() logic, custom kernels, output dtype handling, or tensor-parallel communication behavior. A generic LoRA wrapper can break these modules by routing them through the wrong implementation. In our patch, BaseLinearLayerWithLoRA adds _apply_base_forward(), which preserves the base layer's own forward() and only applies the LoRA delta afterward. MergedColumnParallelLinearWithLoRA uses that route only for effectively unsharded subclasses of MergedColumnParallelLinear, so custom fused modules keep their subclass-specific behavior instead of being forced through the generic merged-column path. ReplicatedLinearWithLoRA makes the same change for ReplicatedLinear subclasses.

Bridge

Lining up weight sync

Megatron, Hugging Face, and the vLLM LoRA target module tree can name the same weight in different ways, so we also had to make LoRA mapping about DSA consistent. In VeRL PR #5462, we added DSA-specific target modules such as linear_wq_b, linear_wk, and linear_weights_proj, so the indexer modules actually participate in the LoRA mapping.

From DeepSeek V3.2 to GLM5

Early on, before native support of GLM5 was supported in the transformers library, we need compatibility paths just to get GLM5 booting. That is what suggested in THUDM/slime PR #1599, where GLM5 was temporarily supported through the DeepSeek V3.2 path.

That is what our GLM5 bridge work in Megatron-Bridge PR #2469 and PR #2913 was about. The key change was not just that GLM5 had a bridge entry. We need dedicated glm_moe_dsa bridge/provider path and mapped core DSA and MLA settings, MoE structure, and MTP layers. Megatron-Bridge PR #2644 was also needed to construct MTP mappings. By storing hf_pretrained and hf_config directly on the bridge instance, downstream bridges can inspect Hugging Face model state early enough to decide what kind of conversion they need to do.

Facilitating grouped experts conversion

Transformers 5 support introduced fused expert layouts [9], where one framework may store a group of experts in fused form, while another expects them split apart. Import and export then become structural operations: slice here, regroup there, concatenate on the way back.

Grouped LoRA export needed the same kind of fix. In Megatron-Bridge PR #3341, grouped expert export was changed, so adapter weights are merged before grouped tensors are accumulated and transposed. If you do that merge too late, the exported checkpoint can still look valid while no longer matching the adapter you actually trained, as the adapters was actually not getting into the merged weight.

Why GLM5.1 was easier

After all of that, GLM5.1 [10] was not the hard part. That is because GLM5.1 stays in the same GlmMoeDsaForCausalLM architecture. Once training, inference, and checkpoint conversion all understand that family, GLM5.1 is mostly just a new checkpoint.

Closing

By the end of this work, the stack changes for both GLM5 and GLM5.1 were in place in MinT. If you are interested in training or fine-tuning GLM5 on MinT, please get in touch with us at sales@mindlab.ltd.

On the training side, we had to make DSA RL less sensitive to train-infer mismatch and make the sparse attention work under THD and CP for long-context training friendlier. On the inference side, we had to make vLLM to load DSA LoRA correctly. On the bridge side, we had to make checkpoint conversion preserve the actual structure of the model.

Once the stack can move DSA, MLA, MoE, MTP, and LoRA through training, inference, and checkpoint conversion without changing what the model means, future bring-up of new models gets much easier. That is why this work matters beyond one model release.

References

[1] GLM-5: from Vibe Coding to Agentic Engineering (GLM-5-Team et al, 2026)

[2] DeepSeek-V3 Technical Report (DeepSeek-AI et al, 2024)

[3] DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models (DeepSeek-AI et al, 2025)

[4] Every Step Evolves: Scaling Reinforcement Learning for Trillion-Scale Thinking Model (Ling Team et al, 2025)

[5] Router Replay R3: Why It Failed and How We Fixed It (Jiang et al, 2026)

[6] Scalable Training of Mixture-of-Experts Models with Megatron Core (Yan et al, 2026)

[7] TileLang: A Composable Tiled Programming Model for AI Systems (Wang et al, 2025)

[8] FlashMLA: Efficient Multi-head Latent Attention Kernels (DeepSeek-AI et al, 2025)

[9] Transformers v5 MoE Weight Loading (ByteDance-Seed et al, 2026)

[10] GLM-5.1 Model Card (GLM-5-Team et al, 2025)

Author

Mind Lab

Core Contributors

Songlin Jiang, Yiwen Lu, Qihan Liu, Nolan Ho, Andrew Chen, Pony Ma

Team

Andrew Chen, Kaijie Chen, Song Cao, Yuan Cheng, Kaixuan Fan, Huan Feng, Nolan Ho, Chongru Huang, Songlin Jiang, Fancy Kong, Jingdi Lei, Xiang Lei, Alyssa Li, Lucian Li, Rui Li, Tianchen Li, Nan Liu, Qihan Liu, Xiang Liu, Yiwen Lu, Runze Lv, Pony Ma, Wenbin Wang, Rio Yang, Shiro Yang, Jiarui Yao, Ruijian Ye, Salmon Zhan, Anya Zhang, Di Zhang, Ruijia Zhang, Shiqi Zhang, Changhai Zhou, Xinyue Zhu, Yihui Zhuang and Mindverse Team

Names are listed alphabetically within team.

Citation

Please cite this work using the BibTeX citation:

@misc{songlinjiang2026supportglm5inmint, author = {Songlin Jiang and Yiwen Lu and Qihan Liu and Nolan Ho and Andrew Chen and Pony Ma and {Mind Lab}}, title = {Support GLM5 and GLM5.1 in MinT: LoRA training for DSA and MTP}, year = {2026}, howpublished = {Mind Lab: A Lab for Experiential Intelligence}, note = {https://macaron.im/mindlab/research/support-glm5-and-glm51-in-mint-lora-training-for-dsa-and-mtp} }
Share to
FacebookLinkedInX

Mind Lab © 2025 · contact@mindlab.ltd