Normalizing Trajectory Models
Generated by JarvisForResearchers Bot on 2026-05-12
Why we featured this paper
Not yet indexed in S2 — assumed brand-new preprint
TL;DR
Normalizing Trajectory Models (NTM) reframes the diffusion reverse conditional \(p(x_s | x_t)\) as a conditional normalizing flow. This allows for exact likelihood training, enabling high-quality image generation even when compressed to a few coarse transitions, overcoming the Gaussian assumption bottleneck inherent in standard diffusion models.
The Problem
Diffusion models fundamentally rely on approximating the reverse conditional probability \(p(x_s | x_t)\) as Gaussian. This assumption is a significant constraint, particularly when the generative process is compressed—for instance, when reducing the number of required sampling steps from hundreds to just a few coarse transitions. When this compression occurs, the true reverse conditional distribution becomes highly non-Gaussian, causing the standard Gaussian approximation to fail and leading to a fundamental degradation in generation quality.
The existing literature has attempted to address this: distillation methods and consistency models bypass the likelihood framework entirely. DDGAN replaces the Gaussian assumption with an implicit distribution learned adversarially, but this approach introduces known issues such as mode-seeking behavior and training instability. Critically, no prior work has successfully combined the requirement for few-step generation with the rigorous framework of an exact likelihood model for the reverse process.
Key Contributions
We introduce a framework that models the non-Gaussian reverse conditional \(p(x_s | x_t)\) by employing an invertible transporter coupled with a Gaussian predictor. This design yields an exact log-likelihood while simultaneously bridging the gap between representation learning and rigorous probabilistic modeling. Furthermore, we provide a specific finetuning recipe that allows initialization from existing pretrained diffusion or flow-matching models using an identity transporter and zero-initialized scale correction, thereby preserving the quality of the prior training. Finally, we demonstrate a score-based trajectory denoising mechanism that leverages the exact likelihood and Markov covariance to jointly correct generated trajectories, which can then be distilled into a learned denoiser \(g_\phi\) capable of achieving high quality in just four steps without requiring additional training data.
How It Works
Figure 1 Text-to-image generation with NTM with 4 denoising steps. We show samples from models trained
from scratch at 256×256, and from models obtained by finetuning pretrained flow-matching checkpoints at 512×512.
NTM operates by mapping both the target state \(x_s\) and the noisy state \(x_t\) into a shared, lower-dimensional latent \(u\)-space via a shared invertible transporter, \(f_T\). Once in this latent space, a stochastic predictor, \(f_P\), generates an estimate \(\hat{u}_s\) conditioned on the noisier representation \(u_t\), a standard Gaussian noise variable \(z \sim \mathcal{N}(0, I)\), and any conditioning information \(y\). By structuring \(f_T\) as a stack of TarFlow blocks incorporating spatial NVP coupling, and defining \(f_P\) as an affine map, the distributional distance \(D\) between the true and predicted distributions is recast precisely as the negative log-likelihood, \(L_{NTM} = -\log p(x_s | x_t)\). This formulation permits end-to-end training by minimizing \(L_{NTM}\), which decomposes into a sum over trajectory steps, thus enabling high-fidelity generation with minimal steps.
Shared Transporter \(f_T\)
The Shared Transporter \(f_T\) is an invertible, same-dimensional mapping. It is implemented as a stack of TarFlow blocks utilizing spatial NVP coupling. Its function is to map the input data points, \(x_s\) and \(x_t\), into corresponding latent representations, \(u_s\) and \(u_t\), respectively. This transformation is crucial as it allows the subsequent predictor to operate in a space where the distributional properties are more amenable to tractable likelihood computation.
Stochastic Predictor \(f_P\)
The Stochastic Predictor \(f_P\) is responsible for estimating the target latent state. It generates \(\hat{u}_s\) as a function of the noisy latent representation \(u_t\), an independent latent variable \(z\) drawn from a standard normal distribution, and any provided condition \(y\). The functional form is \(\hat{u}_s = f_P(u_t, z, y)\). This predictor acts as the core generative step within the NTM framework.
NTM Loss (\(L_{NTM}\))
The NTM Loss, \(L_{NTM}\), is defined as the exact negative log-likelihood of the trajectory. It is mathematically expressed as: $\(L_{NTM} = \sum_{k=1}^T \left( \frac{1}{2}\|z_k\|^2 + \sum_n \left( \log \sigma_P(k,n) + \sum_\ell \log \sigma(k,\ell,n)_T \right) \right)\)$ Minimizing this loss drives the model to accurately capture the true conditional probability \(p(x_s | x_t)\) across the trajectory steps.
Learned Denoiser \(g_\phi\)
The Learned Denoiser \(g_\phi\) is a specialized, lightweight Transformer architecture. It is trained to amortize the complex, score-based trajectory denoising process into a single forward pass. It takes the initial latent state \(u_{t_0}\) and the text embeddings \(y\) as input to output an estimated clean sample \(\hat{x}_{den}^0\).
Results
| Metric | Value | Baseline | Source |
|---|---|---|---|
| GenEval | 0.82 | STARFlow (0.56) | Table 1 |
| DPG-Bench | 79.64 | STARFlow | Table 1 |
| GenEval | 0.76 | N/A | Table 1 |
Why This Matters
NTM provides a significant advancement by achieving state-of-the-art performance in few-step generation while uniquely retaining the mathematical rigor of an exact likelihood over the entire generative trajectory. The framework's support for efficient initialization via the identity transporter allows researchers to leverage substantial prior investment in diffusion or flow-matching models. Furthermore, the ability to distill the multi-step trajectory score denoising into a single forward pass via \(g_\phi\) offers a practical pathway to deploying high-quality generative models with low inference latency.
Limitations & Open Questions
A primary limitation of the standard NTM sampling procedure is that it necessitates \(T\) sequential predictor steps, each requiring AR decoding, which inherently introduces significant latency into the inference pipeline. Additionally, the initial training phase requires the construction of a stochastic forward trajectory, as defined in Equation (2.3), to properly model the joint trajectory distribution, which is a non-trivial data generation step.
Citation
Paper: 2605.08078