⚠️ The Problem: Mainstream RL algorithms (e.g., GRPO) rely on autoregressive factorization \(\pi(y|x) = \prod_{k=1}^{L} \pi(y^k|x, y^{<k})\) with token-level importance ratios. But dLLMs generate sequences non-autoregressively, making these conditionals intractable.
❌ Existing Heuristic Approaches for Approximating \(\log \pi(y^k|x, y^{<k})\)
d1: Mean-field \(\log p(y^k|x)\)
❌ Ignores token context
UniGRPO/Coupled-GRPO: Token ELBO \(\mathcal{L}^k(y|x)\)
❌ Breaks ELBO integrity
💡 Our Insight: Token-level decomposition fundamentally doesn't fit diffusion language models. We must adapt the algorithm to respect sequence-level generation.