Neural Video Coding using Multiscale Motion Compensation and Spatiotemporal Context Model


We propose an end-to-end deep neural video coding framework (NVC), which uses variational autoencoders (VAEs) with joint spatial and temporal prior aggregation (PA) to exploit the correlations in intra-frame pixels, inter-frame motions and inter-frame compensation residuals, respectively. Novel features of NVC include: 1) To estimate and compensate motion over a large range of magnitudes, we propose an unsupervised multiscale motion compensation network (MS-MCN) together with a pyramid decoder in the VAE for coding motion features that generates multiscale flow fields, 2) we design a novel adaptive spatiotemporal context model for efficient entropy coding for motion information, 3) we adopt nonlocal attention modules (NLAM) at the bottlenecks of the VAEs for implicit adaptive feature extraction and activation, leveraging its high transformation capacity and unequal weighting with joint global and local information, and 4) we introduce multi-module optimization and a multi-frame training strategy to minimize the temporal error propagation among P-frames. NVC is evaluated for the low-delay causal settings and compared with H.265/HEVC, H.264/AVC and the other learnt video compression methods following the common test conditions, demonstrating consistent gains across all popular test sequences for both PSNR and MS-SSIM distortion metrics.


Our NVC framework is designed for low-delay applications.As with all modern video encoders, the proposed NVC com-presses the first frame in each group of pictures as an intra-frame using a VAE based compression engine (neuro-Intra). Itcodes remaining frames using motion compensated prediction. It uses the VAE compressor (neuro-Motion) to generate the multiscale motion field between thecurrent frame and the reference frame. Then, MS-MCN takesmultiscale compressed flows, warps the multiscale features ofthe reference frame, and combines these warped features togenerate the predicted frame. The prediction residual is thencoded using another VAE-based compressor (neuro-Res).


Variational Autoencoder with (Spatiotemporal and hyper) Prior Aggregation (PA) Engine

The general architecture of the avriational autoencoder VAE with prior aggregation. Here, the PA engine aggergates spatiotemporal and hyper prior based on Gaussian distribution for entropy coding.


Pyramid Decoder and Multiscale Motion Compensation

We apply the pyramid flow decoder in the neuro-Motion and generate the multiscale compressed flows. Then we use multiscale motion compensation (coarse-to-fine warping both in feature and pixel domain) to obtain the predicted frame.


Adaptive Spatiotemporal Context Modeling for Entropy Coding

The proposed PA engine for neuro-Motion consists of a spatio-temporal-hyper aggregation module (STHAM) and a it temporal updating module (TUM). At timestamp $t$, STHAM accumulates all the accessible priors and estimate the mean and standard deviation of the assumed Gaussian distribution for each new quantized motion feature. To generate the temporal priors, TUM is applied to current quantized features recurrently using a standard ConvLSTM.

Rate-Distortion Performance (PSNR & MS-SSIM)

PSNR vs. Rate Performance

MS-SSIM vs. Rate Performance

BD-Rate Reduction (PSNR & MS-SSIM)

Visual Comparison

Code and Model

The code and model will be released gradually.

a) Intra model (baseline).

b) Intra model (Outperform VVC).

c) Inter model (Multi-scale).

d) Inter model (++ Spatiotemporal Context Model).

e) Residual model.

f) C++ implementation of seperate encoder and decoder using arithmetic coding.


More interseting works can be found in our NJU Vision LAB homepage.


If you find our paper useful, please cite:


You can contact Haojie Liu by sending mail to