Paper: CVPR 2022 [Link]
This note is generated by claude code when I go through published code to understand the algorithm introduced in paper.
Thinking Problem: how to convert this RePainting algorithm to the tabular data, considering the In-Context learning paradigm, where we take the tabular input data as a 2-D images, where pixel is the feature cell.
Notation Reference
| Symbol | Meaning |
|---|---|
| $x^g$ | Clean ground truth image (no noise) |
| $m$ | Binary mask (1=keep, 0=inpaint) |
| $x_t$ | Generated noisy image at timestep $t$ |
| $x_t^g$ | Ground truth with $t$-level noise added |
| $\bar{\alpha}_t$ | Noise schedule parameter (defines noise level) |
| $\beta_t$ | Noise variance at step $t$ |
| $\mu_\theta, \sigma_\theta$ | Model-predicted mean and std |
| $\epsilon$ | Gaussian noise ~ $\mathcal{N}(0, I)$ |
1. Task Definition
Problem: Image Inpainting
Fill missing regions in an image with realistic content.
Input & Output
| Component | Notation | Shape | Description |
|---|---|---|---|
| Ground Truth (clean) | $x^g$ | (1, 3, 256, 256) |
Original clean RGB image, normalized to [-1, 1] |
| Mask | $m$ | (1, 3, 256, 256) |
Binary: 1 = keep (known), 0 = inpaint (unknown) |
| Output | $x_0$ | (1, 3, 256, 256) |
Complete image with filled regions |
Example
Input Image (x): Mask (m): Output (x_0):
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ ╔═══╗ │ │ ╔═══╗ │ │ ╔═══╗ │
│ ║ ║ face│ │ ║ 0 ║ keep│ │ ║ ✓ ║ face│
│ ╚═══╝ │ │ ╚═══╝ │ │ ╚═══╝ │
└─────────────┘ └─────────────┘ └─────────────┘
center missing center filled!
2. Pre-trained Diffusion Model
What is it?
A neural network (UNet) trained to remove noise from images.
Training (already done, we just use the pre-trained model):
- Take clean images → add noise progressively → train model to denoise
Key Point: The model is unconditional (trained only for denoising, NOT for inpainting)
Model Interface
INPUT:
- x_t: Noisy image at timestep t, shape (1, 3, 256, 256)
- t: Timestep (scalar), range [0, 250]
Higher t = more noise
OUTPUT:
- 6 channels: (1, 6, 256, 256)
• Channels 0-2: Mean prediction
• Channels 3-5: Variance prediction
These are processed to get:
μ_θ(x_t, t): Mean of p(x_{t-1} | x_t)
σ_θ(x_t, t): Std of p(x_{t-1} | x_t)
How to Use It
Sample one denoising step: $x_t \to x_{t-1}$
\[x_{t-1} = \mu_\theta(x_t, t) + \sigma_\theta(x_t, t) \cdot \epsilon, \quad \epsilon \sim \mathcal{N}(0, I)\]Noise Schedule: $\bar{\alpha}_t$
Pre-computed values that define noise level at each timestep:
- At $t=0$: $\bar{\alpha}_0 \approx 1$ → clean image
- At $t=250$: $\bar{\alpha}_{250} \approx 0$ → pure noise
Usage: Add $t$-level noise to clean ground truth image: \(x_t^g = \sqrt{\bar{\alpha}_t} \cdot x^g + \sqrt{1-\bar{\alpha}_t} \cdot \epsilon\)
UNet Architecture (Brief Background)
The diffusion model uses a UNet - a U-shaped convolutional neural network.
Structure:
Input (256×256×3) + Timestep t
↓
[Encoder: Downsample]
256×256 → 128×128 → 64×64 → 32×32 → 16×16
channels increase: 256 → 512 → 1024
↓
[Bottleneck: Process at lowest resolution]
16×16 with 1024 channels
↓
[Decoder: Upsample]
16×16 → 32×32 → 64×64 → 128×128 → 256×256
channels decrease: 1024 → 512 → 256
↓
Output (256×256×6)
Skip connections: Encoder ═══════► Decoder
(preserve details)
Key Components:
- ResBlocks: Residual blocks that learn to denoise, conditioned on timestep $t$
- Attention: Self-attention at certain resolutions (32×32, 16×16, 8×8) to focus on important regions
- Skip connections: Copy features from encoder to decoder to preserve fine details
- Timestep embedding: Converts timestep $t$ into a vector that tells each layer “how much noise to remove”
Why U-shape: Downsample to capture global context → process → upsample to reconstruct details.
3. RePaint Algorithm
Core Innovation
Use the unconditional pre-trained model for inpainting by conditioning at each denoising step
Two key techniques:
- Blend ground truth at each step (with correct noise level)
- Resample multiple times (jump back and denoise again)
3.1 Key Technique 1: Conditioning with Ground Truth
The Problem
At timestep $t$:
- Unknown regions ($x_t$): generated from noise, have noise level $t$
- Known regions: should come from clean ground truth $x^g$
Challenge: Can’t mix clean $x^g$ with noisy $x_t$ — different noise levels!
The Solution
Step 1: Add noise to ground truth
Add $t$-level noise to clean ground truth $x^g$ to match the noise level of $x_t$:
\[x_t^g = \sqrt{\bar{\alpha}_t} \cdot x^g + \sqrt{1-\bar{\alpha}_t} \cdot \epsilon\]Now $x_t^g$ (noisy ground truth) has the same noise level as $x_t$ (generated noisy image).
Step 2: Blend with mask
\[x_t^{\text{input}} = m \cdot x_t^g + (1-m) \cdot x_t\]- $m \cdot x_t^g$: Known regions from noisy ground truth
- $(1-m) \cdot x_t$: Unknown regions from generation
Step 3: Denoise with model
\((\mu_\theta, \sigma_\theta) = \text{model}(x_t^{\text{input}}, t)\) \(x_{t-1} = \mu_\theta + \sigma_\theta \cdot \epsilon'\)
3.2 Key Technique 2: Resampling
The Problem
Even with correct noise levels, a single denoising pass may produce:
- Visible seams at boundaries
- Poor harmony between known and unknown regions
The Solution: Jump Forward and Denoise Again
Resampling process:
- Denoise: $x_t \to x_{t-1} \to … \to x_{t-10}$
- Jump forward (add noise back): $x_{t-10} \to … \to x_t$
- Denoise again: $x_t \to x_{t-1} \to … \to x_{t-10}$
- Repeat 10 times
Parameters (from config):
jump_length: 10 # Jump 10 steps forward
jump_n_sample: 10 # Repeat 10 times
Schedule pattern at t=240:
Timestep sequence:
240 → 241 → 242 → ... → 250 → 249 → ... → 240 → 241 → ...
↓ ↑ ↑ ↓ ↓ ↑
denoise jump forward denoise back denoise jump again
Repeat 10 times before continuing to t=239
Why it works: Each cycle gives the model another chance to harmonize the regions.
Analogy: Like a painter who:
- Paints → blurs everything → paints again → blurs again
- Each cycle makes the blend smoother
Add Noise Formula
Jump forward $x_{t-1} \to x_t$:
\[x_t = \sqrt{1-\beta_t} \cdot x_{t-1} + \sqrt{\beta_t} \cdot \epsilon\]4. Summary
Pipeline
Input: $x^g$ (clean ground truth), $m$ (mask)
Step 1: Initialize $x_{250} \sim \mathcal{N}(0, I)$
Step 2: Generate schedule with jump points at $t = 0, 10, 20, …, 240$
Step 3: For each $(t_{\text{last}}, t_{\text{cur}})$ in schedule:
- If $t_{\text{cur}} < t_{\text{last}}$ (Denoise):
- $x_t^g = \sqrt{\bar{\alpha}_t} \cdot x^g + \sqrt{1-\bar{\alpha}_t} \cdot \epsilon$
- $x_t^{\text{input}} = m \cdot x_t^g + (1-m) \cdot x_t$
- $(\mu_\theta, \sigma_\theta) = \text{model}(x_t^{\text{input}}, t)$
- $x_{t-1} = \mu_\theta + \sigma_\theta \cdot \epsilon’$
- If $t_{\text{cur}} > t_{\text{last}}$ (Resample):
- $x_t = \sqrt{1-\beta_t} \cdot x_{t-1} + \sqrt{\beta_t} \cdot \epsilon$
Output: $x_0$ (final inpainted image)
The Innovation
Use unconditional pre-trained diffusion models for inpainting by:
- Conditioning: At each denoising step, blend noisy ground truth into known regions
- Resampling: Jump forward and denoise again (10 times) to improve harmony
The Formula
At each timestep $t$:
\[x_t^{\text{conditioned}} = m \cdot \underbrace{(\sqrt{\bar{\alpha}_t} \cdot x^g + \sqrt{1-\bar{\alpha}_t} \cdot \epsilon)}_{x_t^g: \text{ noisy ground truth}} + (1-m) \cdot \underbrace{x_t}_{\text{generated noisy image}}\]This ensures both regions have matching noise levels.
Three Steps to Remember
- Add noise to clean ground truth $x^g$ to get $x_t^g$ at noise level $t$
- Blend known regions (from $x_t^g$) and unknown regions (from $x_t$)
- Denoise with pre-trained model
Repeat with resampling for better results!
Callback to Problem
Basically, we have formulated the tabular learning problem analogy in the image inpainting problem.
the inpainting algorithm part introduced in this paper (two key technique: conditioning & resampling ) we can directly transfer to the tabular data domain (including missing value imputation / regression or classification tasks).
However, the main problem is there is no pre-trained diffusion model in tabular learning domain
Additionally, the pre-trained diffusion model in this paper has another constrain, the width / height of generate image is fixed (e.g., 256x256), which should be a limitation for tabular data with varying feature dimensions and row lengths, if there is a diffusion model in tabular domain.