Principled RL for Diffusion LLMs Emerges from a Sequence-Level Perspective

1. The Challenge: Token-Level RL Doesn't Fit dLLMs

⚠️ The Problem: Mainstream RL algorithms (e.g., GRPO) rely on autoregressive factorization \(\pi(y|x) = \prod_{k=1}^{L} \pi(y^k|x, y^{<k})\) with token-level importance ratios. But dLLMs generate sequences non-autoregressively, making these conditionals intractable.

❌ Existing Heuristic Approaches for Approximating \(\log \pi(y^k|x, y^{<k})\)

d1: Mean-field \(\log p(y^k|x)\)

❌ Ignores token context

UniGRPO/Coupled-GRPO: Token ELBO \(\mathcal{L}^k(y|x)\)

❌ Breaks ELBO integrity

💡 Our Insight: Token-level decomposition fundamentally doesn't fit diffusion language models. We must adapt the algorithm to respect sequence-level generation.

2. ESPO: Sequence-Level RL with ELBO Approximation

Core Insight: Treat the generation of an entire sequence as a single action, and use the ELBO \(\mathcal{L}(y|x)\) as a tractable proxy for sequence log-likelihood \(\log \pi(y|x)\).

1️⃣ Sequence-Level Action Space

Use sequence-level importance ratio with per-token normalization:

\(\rho_{\mathrm{seq}}^{(i)} = \exp \left( \frac{1}{L} (\mathcal{L}_{\theta}(y^{(i)}|x) - \mathcal{L}_{\theta_{\text{old}}}(y^{(i)}|x))\right)\)

Per-token normalization ensures stability across different sequence lengths.

2️⃣ Stable KL-Divergence Estimation

Use the k2 estimator to avoid exponential instability:

\(\widehat{\mathbb{KL}}_{\text{k2}} = \tfrac{1}{2}\left( \mathcal{L}_\theta(y^{(i)}|x) - \mathcal{L}_{\text{ref}}(y^{(i)}|x) \right)^2\)

Polynomial form ensures stable gradients for long sequences.

Complete ESPO Objective Function
                                \[
                                \begin{aligned}
                                \mathcal{J}_{\mathrm{ESPO}}(\pi_\theta) = \mathbb{E}_{x \sim \mathcal{D}, y^{(1)}, \dots, y^{(G)} \sim \pi_{\theta_{\text{old}}}(\cdot|x)} \Bigg[ 
                                &\frac{1}{G} \sum_{i=1}^{G} \min \Big(
                                \rho_{\mathrm{seq}}^{(i)} \hat{A}^{(i)}, \text{clip} ( \rho_{\mathrm{seq}}^{(i)}, 1 - \epsilon, 1 + \epsilon ) \hat{A}^{(i)} 
                                \Big) - \beta \cdot \widehat{\mathbb{KL}}_{\text{k2}}
                                \Bigg]
                                \end{aligned}
                                \]
                            

Empirical Validation

Action Space & Likelihood: Sequence-level + ELBO (blue) achieves stable and superior performance.

KL Estimator: The k2 estimator (blue) enables stable learning, while k1/k3 fail.

3. Experimental Results

ESPO consistently outperforms token-level baselines across diverse benchmarks, with particularly strong gains on planning tasks requiring global reasoning and consistency.

📐 Math & Planning Tasks

Dramatic improvements on math and planning tasks: ESPO achieves 60+ point gains on Countdown and Sudoku, and consistent improvements on GSM8K and MATH benchmarks, demonstrating the effectiveness of sequence-level optimization for tasks requiring global consistency.

💻 Coding Benchmarks

Consistent gains on coding tasks: ESPO outperforms baselines on HumanEval and MBPP benchmarks.

Bibtex

If you find ESPO useful in your research, please cite our paper:

@article{ou2025principledrldiffusionllms,
      title={Principled RL for Diffusion LLMs Emerges from a Sequence-Level Perspective}, 
      author={Jingyang Ou and Jiaqi Han and Minkai Xu and Shaoxuan Xu and Jianwen Xie and Stefano Ermon and Yi Wu and Chongxuan Li},
      journal={arXiv preprint arXiv:2512.03759},
      year={2025},
}