DiT-IC CVPR26

Visual Comparison

1 / 2 | bpp: 0.0373

Reconstruction

Original

Abstract

Diffusion-based image compression has recently shown outstanding perceptual fidelity, yet its practicality is hindered by prohibitive sampling overhead and high memory usage. Most existing diffusion codecs employ UNet architectures, where hierarchical downsampling forces diffusion to operate in shallow latent spaces (typically with only 8× spatial downscaling), resulting in excessive computation. In contrast, conventional VAE-based codecs work in much deeper latent domains (16× –64× downscaled), motivating a key question: Can diffusion operate effectively in such compact latent spaces without compromising reconstruction quality?

To address this, we introduce DiT-IC—an Aligned Diffusion Transformer for Image Compression—which replaces the UNet with a Diffusion Transformer capable of performing diffusion in latent space entirely at 32× downscaled resolution. DiT-IC adapts a pretrained text-to-image multi-step DiT into a single-step reconstruction model through three key alignment mechanisms:

a variance-guided reconstruction flow that adapts denoising strength to latent uncertainty for efficient reconstruction;

a self-distillation alignment that enforces consistency with encoder-defined latent geometry to enable one-step diffusion;

a latent-conditioned guidance that replaces text prompts with semantically aligned latent conditions, enabling text-free inference.

With these designs, DiT-IC achieves state-of-the-art perceptual quality while offering up to 30× faster decoding and drastically lower memory usage than existing diffusion-based codecs. Remarkably, it can reconstruct 2048×2048 images on a 16 GB laptop GPU.

Method Overview

We propose a neural compression framework with autoregressive modeling and adaptive entropy estimation achieving superior rate-distortion trade-off.

Variance-Guided Flow Matching: Unlike standard diffusion that starts from Gaussian noise, compression reconstruction begins from a quantized latent $\mathbf{y}_t$ containing structured noise. The local variance $\sigma(\mathbf{y}_t)$ measures spatial uncertainty, which we map to pseudo-timesteps $t = \mathcal{F}(\sigma)$ for spatially adaptive one-step flow matching.
Self-Distillation Alignment: DiT-IC distills the multi-step diffusion process into a single forward pass by aligning its denoised latent with the frozen encoder representation, while jointly optimizing the diffusion transformer and decoder. $\mathbf{y}_\theta(\cdot)$ denotes the denoised output, and $\mathcal{L}_{\text{rd}}$ indicates the rate–distortion loss.
Latent-Conditioned Guidance: We replace text-based guidance in DiT with a latent-conditioned projection derived from the compressed representation by aligning projected latent and text embeddings, enabling text-free conditioning at inference.

BibTeX

@inproceedings{shi2026ditic,
  title={DiT-IC: Aligned Diffusion Transformer for Efficient Image Compression},
  author={Shi Junqi, Lu Ming, Li Xingchen, Ke Anle, Zhang Ruiqi and Ma Zhan},
  booktitle={CVPR},
  year={2026}
}