Lingze Personal website for life and research

[PAPER] RePaint. Inpainting using Denoising Diffusion Probabilistic Models

Paper: CVPR 2022 [Link]

This note is generated by claude code when I go through published code to understand the algorithm introduced in paper.

Thinking Problem: how to convert this RePainting algorithm to the tabular data, considering the In-Context learning paradigm, where we take the tabular input data as a 2-D images, where pixel is the feature cell.

Notation Reference

Symbol Meaning
$x^g$ Clean ground truth image (no noise)
$m$ Binary mask (1=keep, 0=inpaint)
$x_t$ Generated noisy image at timestep $t$
$x_t^g$ Ground truth with $t$-level noise added
$\bar{\alpha}_t$ Noise schedule parameter (defines noise level)
$\beta_t$ Noise variance at step $t$
$\mu_\theta, \sigma_\theta$ Model-predicted mean and std
$\epsilon$ Gaussian noise ~ $\mathcal{N}(0, I)$

1. Task Definition

Problem: Image Inpainting

Fill missing regions in an image with realistic content.

Input & Output

Component Notation Shape Description
Ground Truth (clean) $x^g$ (1, 3, 256, 256) Original clean RGB image, normalized to [-1, 1]
Mask $m$ (1, 3, 256, 256) Binary: 1 = keep (known), 0 = inpaint (unknown)
Output $x_0$ (1, 3, 256, 256) Complete image with filled regions

Example

Input Image (x):      Mask (m):          Output (x_0):
┌─────────────┐    ┌─────────────┐     ┌─────────────┐
│   ╔═══╗     │    │   ╔═══╗     │     │   ╔═══╗     │
│   ║   ║ face│    │   ║ 0 ║ keep│     │   ║ ✓ ║ face│
│   ╚═══╝     │    │   ╚═══╝     │     │   ╚═══╝     │
└─────────────┘    └─────────────┘     └─────────────┘
                    center missing      center filled!

2. Pre-trained Diffusion Model

What is it?

A neural network (UNet) trained to remove noise from images.

Training (already done, we just use the pre-trained model):

  • Take clean images → add noise progressively → train model to denoise

Key Point: The model is unconditional (trained only for denoising, NOT for inpainting)

Model Interface

INPUT:
  - x_t: Noisy image at timestep t, shape (1, 3, 256, 256)
  - t: Timestep (scalar), range [0, 250]
       Higher t = more noise

OUTPUT:
  - 6 channels: (1, 6, 256, 256)
    • Channels 0-2: Mean prediction
    • Channels 3-5: Variance prediction

  These are processed to get:
    μ_θ(x_t, t): Mean of p(x_{t-1} | x_t)
    σ_θ(x_t, t): Std of p(x_{t-1} | x_t)

How to Use It

Sample one denoising step: $x_t \to x_{t-1}$

\[x_{t-1} = \mu_\theta(x_t, t) + \sigma_\theta(x_t, t) \cdot \epsilon, \quad \epsilon \sim \mathcal{N}(0, I)\]

Noise Schedule: $\bar{\alpha}_t$

Pre-computed values that define noise level at each timestep:

  • At $t=0$: $\bar{\alpha}_0 \approx 1$ → clean image
  • At $t=250$: $\bar{\alpha}_{250} \approx 0$ → pure noise

Usage: Add $t$-level noise to clean ground truth image: \(x_t^g = \sqrt{\bar{\alpha}_t} \cdot x^g + \sqrt{1-\bar{\alpha}_t} \cdot \epsilon\)


UNet Architecture (Brief Background)

The diffusion model uses a UNet - a U-shaped convolutional neural network.

Structure:

Input (256×256×3) + Timestep t
        ↓
    [Encoder: Downsample]
    256×256 → 128×128 → 64×64 → 32×32 → 16×16
    channels increase: 256 → 512 → 1024
        ↓
    [Bottleneck: Process at lowest resolution]
    16×16 with 1024 channels
        ↓
    [Decoder: Upsample]
    16×16 → 32×32 → 64×64 → 128×128 → 256×256
    channels decrease: 1024 → 512 → 256
        ↓
Output (256×256×6)

Skip connections: Encoder ═══════► Decoder
                  (preserve details)

Key Components:

  • ResBlocks: Residual blocks that learn to denoise, conditioned on timestep $t$
  • Attention: Self-attention at certain resolutions (32×32, 16×16, 8×8) to focus on important regions
  • Skip connections: Copy features from encoder to decoder to preserve fine details
  • Timestep embedding: Converts timestep $t$ into a vector that tells each layer “how much noise to remove”

Why U-shape: Downsample to capture global context → process → upsample to reconstruct details.


3. RePaint Algorithm

Core Innovation

Use the unconditional pre-trained model for inpainting by conditioning at each denoising step

Two key techniques:

  1. Blend ground truth at each step (with correct noise level)
  2. Resample multiple times (jump back and denoise again)

3.1 Key Technique 1: Conditioning with Ground Truth

The Problem

At timestep $t$:

  • Unknown regions ($x_t$): generated from noise, have noise level $t$
  • Known regions: should come from clean ground truth $x^g$

Challenge: Can’t mix clean $x^g$ with noisy $x_t$ — different noise levels!

The Solution

Step 1: Add noise to ground truth

Add $t$-level noise to clean ground truth $x^g$ to match the noise level of $x_t$:

\[x_t^g = \sqrt{\bar{\alpha}_t} \cdot x^g + \sqrt{1-\bar{\alpha}_t} \cdot \epsilon\]

Now $x_t^g$ (noisy ground truth) has the same noise level as $x_t$ (generated noisy image).

Step 2: Blend with mask

\[x_t^{\text{input}} = m \cdot x_t^g + (1-m) \cdot x_t\]
  • $m \cdot x_t^g$: Known regions from noisy ground truth
  • $(1-m) \cdot x_t$: Unknown regions from generation

Step 3: Denoise with model

\((\mu_\theta, \sigma_\theta) = \text{model}(x_t^{\text{input}}, t)\) \(x_{t-1} = \mu_\theta + \sigma_\theta \cdot \epsilon'\)


3.2 Key Technique 2: Resampling

The Problem

Even with correct noise levels, a single denoising pass may produce:

  • Visible seams at boundaries
  • Poor harmony between known and unknown regions

The Solution: Jump Forward and Denoise Again

Resampling process:

  1. Denoise: $x_t \to x_{t-1} \to … \to x_{t-10}$
  2. Jump forward (add noise back): $x_{t-10} \to … \to x_t$
  3. Denoise again: $x_t \to x_{t-1} \to … \to x_{t-10}$
  4. Repeat 10 times

Parameters (from config):

jump_length: 10        # Jump 10 steps forward
jump_n_sample: 10      # Repeat 10 times

Schedule pattern at t=240:

Timestep sequence:
240 → 241 → 242 → ... → 250 → 249 → ... → 240 → 241 → ...
 ↓     ↑                 ↑     ↓            ↓     ↑
denoise  jump forward        denoise back   denoise  jump again

Repeat 10 times before continuing to t=239

Why it works: Each cycle gives the model another chance to harmonize the regions.

Analogy: Like a painter who:

  • Paints → blurs everything → paints again → blurs again
  • Each cycle makes the blend smoother

Add Noise Formula

Jump forward $x_{t-1} \to x_t$:

\[x_t = \sqrt{1-\beta_t} \cdot x_{t-1} + \sqrt{\beta_t} \cdot \epsilon\]

4. Summary

Pipeline

Input: $x^g$ (clean ground truth), $m$ (mask)

Step 1: Initialize $x_{250} \sim \mathcal{N}(0, I)$

Step 2: Generate schedule with jump points at $t = 0, 10, 20, …, 240$

Step 3: For each $(t_{\text{last}}, t_{\text{cur}})$ in schedule:

  • If $t_{\text{cur}} < t_{\text{last}}$ (Denoise):
    1. $x_t^g = \sqrt{\bar{\alpha}_t} \cdot x^g + \sqrt{1-\bar{\alpha}_t} \cdot \epsilon$
    2. $x_t^{\text{input}} = m \cdot x_t^g + (1-m) \cdot x_t$
    3. $(\mu_\theta, \sigma_\theta) = \text{model}(x_t^{\text{input}}, t)$
    4. $x_{t-1} = \mu_\theta + \sigma_\theta \cdot \epsilon’$
  • If $t_{\text{cur}} > t_{\text{last}}$ (Resample):
    1. $x_t = \sqrt{1-\beta_t} \cdot x_{t-1} + \sqrt{\beta_t} \cdot \epsilon$

Output: $x_0$ (final inpainted image)


The Innovation

Use unconditional pre-trained diffusion models for inpainting by:

  1. Conditioning: At each denoising step, blend noisy ground truth into known regions
  2. Resampling: Jump forward and denoise again (10 times) to improve harmony

The Formula

At each timestep $t$:

\[x_t^{\text{conditioned}} = m \cdot \underbrace{(\sqrt{\bar{\alpha}_t} \cdot x^g + \sqrt{1-\bar{\alpha}_t} \cdot \epsilon)}_{x_t^g: \text{ noisy ground truth}} + (1-m) \cdot \underbrace{x_t}_{\text{generated noisy image}}\]

This ensures both regions have matching noise levels.

Three Steps to Remember

  1. Add noise to clean ground truth $x^g$ to get $x_t^g$ at noise level $t$
  2. Blend known regions (from $x_t^g$) and unknown regions (from $x_t$)
  3. Denoise with pre-trained model

Repeat with resampling for better results!


Callback to Problem

Basically, we have formulated the tabular learning problem analogy in the image inpainting problem.

the inpainting algorithm part introduced in this paper (two key technique: conditioning & resampling ) we can directly transfer to the tabular data domain (including missing value imputation / regression or classification tasks).

However, the main problem is there is no pre-trained diffusion model in tabular learning domain

Additionally, the pre-trained diffusion model in this paper has another constrain, the width / height of generate image is fixed (e.g., 256x256), which should be a limitation for tabular data with varying feature dimensions and row lengths, if there is a diffusion model in tabular domain.