TabPFN for downstream unsupervised task [tabpfn-extension]

This is the note generated by claude code in my reading the source code of tabpfn-extension. We understand TabPFN’s unsupervised learning capabilities: Imputation, Synthetic Data Generation, and Outlier Detection. I remains what I concerns in the whole reading process.

Core Approach: Unsupervised → Supervised
TabPFN Regression Architecture
Task 1: Imputation
Task 2: Synthetic Data Generation
Task 3: Outlier Detection
Comparison

1. Core Approach: Unsupervised → Supervised

Key Insight

All three unsupervised tasks leverage the pre-trained TabPFN by converting unsupervised problems into supervised learning problems.

The Main Pattern

# Unsupervised: Work with unlabeled data X

# TabPFN Solution: Treat features as labels!
X_input = X[:, conditioning_features]  # Some features as inputs
y_target = X[:, target_feature]        # One feature as "label"

model.fit(X_input, y_target)  # Standard supervised learning!

Example:

Original data: [Age, Income, Credit] (no labels)
Reframe: Use [Age, Income] to predict Credit
Learn: P(Credit | Age, Income) via supervised learning

Core Engine: `density_()` Function

The shared workhorse for all three tasks:

def density_(X_predict, X_fit, conditional_idx, column_idx):
    """
    Converts unsupervised → supervised.
    Learns: P(feature_column | features_conditional)
    """
    # 1. Extract features
    X_train = X_fit[:, conditional_idx]  # Input features
    y_train = X_fit[:, column_idx]       # Target feature

    # 2. Select model (classifier vs regressor)
    model = tabpfn_clf if categorical else tabpfn_reg

    # 3. Fit and return
    model.fit(X_train, y_train)
    return model, X_test, y_test

This function is called repeatedly with different feature combinations, building up the solution through multiple supervised problems!

2. TabPFN Regression Architecture

The Surprising Truth: Regression = 50-Class Classification

TabPFN doesn’t do traditional regression. Instead, it performs classification over discretized value ranges.

How It Works

Step 1: Discretize Target Range into 50 Bins

Target range [100K, 500K] → 50 equal bins
Bin 0:  [100K, 108K]
Bin 1:  [108K, 116K]
...
Bin 49: [492K, 500K]

Step 2: Train as 50-Way Classifier

Convert continuous values to bin indices (class labels)
Train with cross-entropy loss (standard classification)
Output: Probabilities over 50 bins

Step 3: Two-Stage Sampling

Stage 1: Sample which bin (categorical sampling from probabilities)
         e.g., Bin 2 selected with 50% probability

Stage 2: Sample value uniformly within bin
         e.g., uniform(5.0, 7.5) → 6.2

Key Properties

Property	Value	Notes
Number of bins	50 (typical)	Fixed in architecture
Bin ranges	Dynamic	Adapt to dataset
Output	Probability distribution	Uncertainty quantification
Loss	Cross-entropy	Classification loss

Advantages: Uncertainty quantification, multi-modal distributions, transformer-friendly Trade-off: Discretization error (±half bin width) vs. uncertainty

Task 1: Imputation

Goal

Fill missing values (NaN) in datasets.

Strategy

Use all available features to predict each missing feature, processing column-by-column with row filtering.

Key Steps

Identify columns with NaN - Find which features have missing values
For each column with NaN:
- Condition on ALL other features (maximize information)
- Filter rows: Only process rows that have NaN in this specific column
- Generate multiple permutations of conditioning features
- Average predictions across permutations
- Sample values (low temperature = deterministic)
- Fill NaN positions in this column
Move to next column and repeat

Example with Row Filtering

Input Data:
  Row 0: [25, NaN, 750]  ← Has NaN in column 1
  Row 1: [30, 60K, NaN]  ← Has NaN in column 2
  Row 2: [35, 70K, 800]  ← Complete, no NaN

Process Column 1 (Income):
  - Filter: Only Row 0 has NaN in column 1
  - Condition on: [Age=25, Credit=750] (all other features)
  - Predict: Income ≈ 52K
  - Fill: Row 0, Column 1 = 52K

Process Column 2 (Credit):
  - Filter: Only Row 1 has NaN in column 2
  - Condition on: [Age=30, Income=60K] (all other features)
  - Predict: Credit ≈ 745
  - Fill: Row 1, Column 2 = 745

Result:
  Row 0: [25, 52K, 750]  ✓ Imputed
  Row 1: [30, 60K, 745]  ✓ Imputed
  Row 2: [35, 70K, 800]  ✓ Unchanged

Key Characteristics

Column-wise processing: Iterate through columns with NaN
Row filtering: Only predict for rows with NaN in current column
Conditioning: Use ALL other features (maximum information)
Temperature: 0.000000001 (deterministic “best guess”)
Efficiency: Skip rows without missing values in current column

Task 2: Synthetic Data Generation

Goal

Generate realistic new samples from scratch.

Strategy

Sequential (autoregressive) generation: Generate features one-by-one, conditioning only on previously generated features.

Key Steps

Start with all-NaN matrix
Generate features sequentially (left to right)
Each feature conditions on ONLY previous features
Sample with higher temperature (diverse results)

How to Generate the First Feature?

Special case: The first feature has no previous features to condition on.

Solution: Learn the marginal distribution P(X₀)

Fit TabPFN with random noise as input and first feature as target
Model learns to ignore the meaningless input
Effectively learns: “What values does this feature typically take?”
Sample from this learned marginal distribution

Training: model.fit(random_noise, X_train[:, 0])
→ Learns P(Age) from training data

Prediction: model.predict(random_noise_test)
→ Samples Age values following training distribution

Example

Initial:  [NaN, NaN, NaN]

Step 1 - First Feature (Age):
  - No conditioning (no previous features)
  - Learn P(Age) using random noise as input
  - Sample: Age = 27
  Result:   [27, NaN, NaN]

Step 2 - Second Feature (Income):
  - Condition on: Age=27
  - Learn P(Income | Age=27)
  - Sample: Income = 53K
  Result:   [27, 53K, NaN]

Step 3 - Third Feature (Credit):
  - Condition on: Age=27, Income=53K
  - Learn P(Credit | Age=27, Income=53K)
  - Sample: Credit = 720
  Result:   [27, 53K, 720]

Synthetic sample generated!

Why Sequential?

When generating from scratch, future features don’t exist yet, so we can only condition on what’s been generated so far. It is like causal inference.

Computational Cost

Each column requires TabPFN fitting:

Feature 0: Fit TabPFN to learn P(X₀)
Feature 1: Fit TabPFN to learn P(X₁ X₀)
Feature 2: Fit TabPFN to learn P(X₂ X₀,X₁)
…and so on

With permutations (n_permutations=3):

Each column: 3 fits (one per permutation)
Total for 5 features: 5 × 3 = 15 TabPFN fits

Why? Each conditional P(Xᵢ

X₀,…,Xᵢ₋₁) is a different supervised learning problem requiring a separate model fit.

Key Characteristics

First feature: Learn marginal P(X₀) using random noise input
Subsequent features: Condition on all previous features
Temperature: 1.0 (diverse samples)
Goal: Realistic diverse samples
Computation: One TabPFN fit per column (× n_permutations)

Task 3: Outlier Detection

Goal

Detect anomalous samples (entire tuples, not individual features).

Strategy

Compute joint probability P(X₁, X₂, …, Xₙ) using chain rule of probability.

Chain Rule Foundation

P(X₁, X₂, X₃) = P(X₁) × P(X₂|X₁) × P(X₃|X₁,X₂)

Different orderings are mathematically equivalent:
P(X₁, X₂, X₃) = P(X₂) × P(X₃|X₂) × P(X₁|X₂,X₃)
P(X₁, X₂, X₃) = P(X₃) × P(X₁|X₃) × P(X₂|X₁,X₃)

Detailed Process: Step-by-Step Probability Evaluation

For each feature in the chain:

Fit TabPFN on training data to learn P(current_feature previous_features)
Predict probability distribution for the test sample’s previous features
Map the test sample’s ground truth value to this distribution
Extract the probability of observing this specific ground truth value
Accumulate this probability (multiply, or add in log space)

Each step = One TabPFN fit (different conditional, different model needed)

Algorithm Steps

For each permutation of features:
- Apply chain rule sequentially
- Each feature conditions on previous features in ordering
- For each step: Fit TabPFN → Get P(ground_truth previous)
- Multiply probabilities (add in log space)
Average probabilities across permutations
Return scores (lower = outlier)

Example with Detailed Steps

Test Sample: [Age=25, Income=200K, Credit=300]
Training Data: X_fit (100 samples)

Permutation: (Age, Income, Credit)

# ========================================
# Step 1: Evaluate P(Age=25)
# ========================================
Fit: model.fit(random_noise, X_fit[:, Age])
     → Learns P(Age) from training distribution

Predict: distribution = model.predict(random_noise_test)
     → Returns probability distribution over ages

Map: ground_truth = 25
     Extract: P(Age=25) from distribution = 0.8 ✓

Accumulate: log_p = log(0.8) = -0.22

# ========================================
# Step 2: Evaluate P(Income=200K | Age=25)
# ========================================
Fit: model.fit(X_fit[:, Age], X_fit[:, Income])
     → Learns P(Income | Age) from training

Predict: distribution = model.predict([Age=25])
     → Returns probability distribution for Income given Age=25
     → e.g., likely range [40K-60K] based on training

Map: ground_truth = 200K
     Extract: P(Income=200K | Age=25) from distribution = 0.001 ✗
     → 200K is in the tail of the distribution! Unusual!

Accumulate: log_p = -0.22 + log(0.001) = -0.22 + (-6.91) = -7.13

# ========================================
# Step 3: Evaluate P(Credit=300 | Age=25, Income=200K)
# ========================================
Fit: model.fit(X_fit[:, [Age, Income]], X_fit[:, Credit])
     → Learns P(Credit | Age, Income) from training

Predict: distribution = model.predict([[Age=25, Income=200K]])
     → Returns probability distribution for Credit
     → High income typically → high credit (700-850 range)

Map: ground_truth = 300
     Extract: P(Credit=300 | Age=25, Income=200K) = 0.05 ✗
     → 300 is very low credit for 200K income!

Accumulate: log_p = -7.13 + log(0.05) = -7.13 + (-3.0) = -10.13

# ========================================
# Final Result
# ========================================
P(sample) = exp(log_p) = exp(-10.13) = 0.00004

→ OUTLIER! (Very low probability)

Tuple-Level Detection (Critical!)

We detect whether the ENTIRE SAMPLE is anomalous, not individual features.

Individual values (univariate):
- Age=25: ✓ Common
- Income=200K: ✓ Exists in training
- Credit=300: ✓ Exists in training

Combination (multivariate):
- Age=25 + Income=200K: ✗ Unusual relationship!
→ Sample is outlier due to violated relationships

Real-world example:

A 25-year-old earning $200K is unusual
Even though “25-year-olds” and “$200K earners” both exist separately
The relationship between age and income is what’s anomalous

Why Multiple Permutations?

Different orderings capture different anomaly patterns:

Permutation 1: “Income unusual for Age”
Permutation 2: “Age unusual for Income+Credit”
Permutation 3: “Credit unusual for Age+Income”

Averaging reduces sensitivity to ordering artifacts.

Computational Cost

Each sample evaluation:

Feature 0: Fit TabPFN to learn P(X₀), evaluate at ground truth
Feature 1: Fit TabPFN to learn P(X₁ X₀), evaluate at ground truth
Feature 2: Fit TabPFN to learn P(X₂ X₀,X₁), evaluate at ground truth
…and so on

With permutations (n_permutations=10):

Each sample: n_features × n_permutations TabPFN fits
For 100 samples with 5 features: 100 × 5 × 10 = 5,000 fits
Optimization: All samples use same models (fit once per feature per permutation)
Actual cost: 5 features × 10 permutations = 50 TabPFN fits total

Why? Same conditional P(Xᵢ

previous) applies to all samples, so fit once and evaluate for all.

Key Characteristics

Conditioning: PREVIOUS features only (chain rule)
Output: ONE score per sample (tuple-level)
Detection: Unusual relationships, not extreme values
Method: Mathematically rigorous (chain rule)
Computation: n_features × n_permutations TabPFN fits (shared across samples)

Comparison

Aspect	Imputation	Synthesis	Outlier Detection
Goal	Fill NaN values	Generate new samples	Detect anomalies
Input	Partial data with NaN	n_samples (number)	Complete samples
Output	Filled data	Synthetic data	Probability scores
Conditioning	ALL other features	PREVIOUS features only	PREVIOUS features only
Feature Order	Column-wise (any)	Sequential (0→n)	Sequential (permutation)
Temperature	0.000000001	1.0	N/A (probability)
Action	Sample to fill	Sample to generate	Evaluate probability

Shared Components

All three tasks use:

Pre-trained TabPFN (no retraining needed)
density_() function (core engine)
Feature reframing (features as labels)
Permutation averaging (robustness)

Conclusion

TabPFN’s unsupervised extensions demonstrate a powerful paradigm:

Convert unsupervised problems into supervised ones, then leverage pre-trained models’ strengths.

The Key Technique: Treat features as both inputs and labels through the density_() function, enabling zero-shot performance on new datasets with uncertainty quantification and interpretable results.

Generated from technical analysis of tabpfn-extensions/src/tabpfn_extensions/unsupervised/

TabPFN for downstream unsupervised task [tabpfn-extension]

Table of Contents

1. Core Approach: Unsupervised → Supervised

Key Insight

The Main Pattern

Core Engine: density_() Function

2. TabPFN Regression Architecture

The Surprising Truth: Regression = 50-Class Classification

How It Works

Step 1: Discretize Target Range into 50 Bins

Step 2: Train as 50-Way Classifier

Step 3: Two-Stage Sampling

Key Properties

Task 1: Imputation

Goal

Strategy

Key Steps

Example with Row Filtering

Key Characteristics

Task 2: Synthetic Data Generation

Goal

Strategy

Key Steps

How to Generate the First Feature?

Example

Why Sequential?

Computational Cost

Key Characteristics

Task 3: Outlier Detection

Goal

Strategy

Chain Rule Foundation

Detailed Process: Step-by-Step Probability Evaluation

Algorithm Steps

Example with Detailed Steps

Tuple-Level Detection (Critical!)

Why Multiple Permutations?

Computational Cost

Key Characteristics

Comparison

Shared Components

Conclusion

You may also enjoy:

[IDEA] Value-Based Position Encoding for Numerical Features

Tabular Foundation Model: Synthesis Data Generation

Core Engine: `density_()` Function