Lingze Personal website for life and research

TabPFN for downstream unsupervised task [tabpfn-extension]

This is the note generated by claude code in my reading the source code of tabpfn-extension. We understand TabPFN’s unsupervised learning capabilities: Imputation, Synthetic Data Generation, and Outlier Detection. I remains what I concerns in the whole reading process.

Table of Contents

  1. Core Approach: Unsupervised → Supervised
  2. TabPFN Regression Architecture
  3. Task 1: Imputation
  4. Task 2: Synthetic Data Generation
  5. Task 3: Outlier Detection
  6. Comparison

1. Core Approach: Unsupervised → Supervised

Key Insight

All three unsupervised tasks leverage the pre-trained TabPFN by converting unsupervised problems into supervised learning problems.

The Main Pattern

# Unsupervised: Work with unlabeled data X

# TabPFN Solution: Treat features as labels!
X_input = X[:, conditioning_features]  # Some features as inputs
y_target = X[:, target_feature]        # One feature as "label"

model.fit(X_input, y_target)  # Standard supervised learning!

Example:

  • Original data: [Age, Income, Credit] (no labels)
  • Reframe: Use [Age, Income] to predict Credit
  • Learn: P(Credit | Age, Income) via supervised learning

Core Engine: density_() Function

The shared workhorse for all three tasks:

def density_(X_predict, X_fit, conditional_idx, column_idx):
    """
    Converts unsupervised → supervised.
    Learns: P(feature_column | features_conditional)
    """
    # 1. Extract features
    X_train = X_fit[:, conditional_idx]  # Input features
    y_train = X_fit[:, column_idx]       # Target feature

    # 2. Select model (classifier vs regressor)
    model = tabpfn_clf if categorical else tabpfn_reg

    # 3. Fit and return
    model.fit(X_train, y_train)
    return model, X_test, y_test

This function is called repeatedly with different feature combinations, building up the solution through multiple supervised problems!

2. TabPFN Regression Architecture

The Surprising Truth: Regression = 50-Class Classification

TabPFN doesn’t do traditional regression. Instead, it performs classification over discretized value ranges.

How It Works

Step 1: Discretize Target Range into 50 Bins

Target range [100K, 500K] → 50 equal bins
Bin 0:  [100K, 108K]
Bin 1:  [108K, 116K]
...
Bin 49: [492K, 500K]

Step 2: Train as 50-Way Classifier

  • Convert continuous values to bin indices (class labels)
  • Train with cross-entropy loss (standard classification)
  • Output: Probabilities over 50 bins

Step 3: Two-Stage Sampling

Stage 1: Sample which bin (categorical sampling from probabilities)
         e.g., Bin 2 selected with 50% probability

Stage 2: Sample value uniformly within bin
         e.g., uniform(5.0, 7.5) → 6.2

Key Properties

Property Value Notes
Number of bins 50 (typical) Fixed in architecture
Bin ranges Dynamic Adapt to dataset
Output Probability distribution Uncertainty quantification
Loss Cross-entropy Classification loss

Advantages: Uncertainty quantification, multi-modal distributions, transformer-friendly Trade-off: Discretization error (±half bin width) vs. uncertainty

Task 1: Imputation

Goal

Fill missing values (NaN) in datasets.

Strategy

Use all available features to predict each missing feature, processing column-by-column with row filtering.

Key Steps

  1. Identify columns with NaN - Find which features have missing values
  2. For each column with NaN:
    • Condition on ALL other features (maximize information)
    • Filter rows: Only process rows that have NaN in this specific column
    • Generate multiple permutations of conditioning features
    • Average predictions across permutations
    • Sample values (low temperature = deterministic)
    • Fill NaN positions in this column
  3. Move to next column and repeat

Example with Row Filtering

Input Data:
  Row 0: [25, NaN, 750]  ← Has NaN in column 1
  Row 1: [30, 60K, NaN]  ← Has NaN in column 2
  Row 2: [35, 70K, 800]  ← Complete, no NaN

Process Column 1 (Income):
  - Filter: Only Row 0 has NaN in column 1
  - Condition on: [Age=25, Credit=750] (all other features)
  - Predict: Income ≈ 52K
  - Fill: Row 0, Column 1 = 52K

Process Column 2 (Credit):
  - Filter: Only Row 1 has NaN in column 2
  - Condition on: [Age=30, Income=60K] (all other features)
  - Predict: Credit ≈ 745
  - Fill: Row 1, Column 2 = 745

Result:
  Row 0: [25, 52K, 750]  ✓ Imputed
  Row 1: [30, 60K, 745]  ✓ Imputed
  Row 2: [35, 70K, 800]  ✓ Unchanged

Key Characteristics

  • Column-wise processing: Iterate through columns with NaN
  • Row filtering: Only predict for rows with NaN in current column
  • Conditioning: Use ALL other features (maximum information)
  • Temperature: 0.000000001 (deterministic “best guess”)
  • Efficiency: Skip rows without missing values in current column

Task 2: Synthetic Data Generation

Goal

Generate realistic new samples from scratch.

Strategy

Sequential (autoregressive) generation: Generate features one-by-one, conditioning only on previously generated features.

Key Steps

  1. Start with all-NaN matrix
  2. Generate features sequentially (left to right)
  3. Each feature conditions on ONLY previous features
  4. Sample with higher temperature (diverse results)

How to Generate the First Feature?

Special case: The first feature has no previous features to condition on.

Solution: Learn the marginal distribution P(X₀)

  • Fit TabPFN with random noise as input and first feature as target
  • Model learns to ignore the meaningless input
  • Effectively learns: “What values does this feature typically take?”
  • Sample from this learned marginal distribution
Training: model.fit(random_noise, X_train[:, 0])
→ Learns P(Age) from training data

Prediction: model.predict(random_noise_test)
→ Samples Age values following training distribution

Example

Initial:  [NaN, NaN, NaN]

Step 1 - First Feature (Age):
  - No conditioning (no previous features)
  - Learn P(Age) using random noise as input
  - Sample: Age = 27
  Result:   [27, NaN, NaN]

Step 2 - Second Feature (Income):
  - Condition on: Age=27
  - Learn P(Income | Age=27)
  - Sample: Income = 53K
  Result:   [27, 53K, NaN]

Step 3 - Third Feature (Credit):
  - Condition on: Age=27, Income=53K
  - Learn P(Credit | Age=27, Income=53K)
  - Sample: Credit = 720
  Result:   [27, 53K, 720]

Synthetic sample generated!

Why Sequential?

When generating from scratch, future features don’t exist yet, so we can only condition on what’s been generated so far. It is like causal inference.

Computational Cost

Each column requires TabPFN fitting:

  • Feature 0: Fit TabPFN to learn P(X₀)
  • Feature 1: Fit TabPFN to learn P(X₁ X₀)
  • Feature 2: Fit TabPFN to learn P(X₂ X₀,X₁)
  • …and so on

With permutations (n_permutations=3):

  • Each column: 3 fits (one per permutation)
  • Total for 5 features: 5 × 3 = 15 TabPFN fits
Why? Each conditional P(Xᵢ X₀,…,Xᵢ₋₁) is a different supervised learning problem requiring a separate model fit.

Key Characteristics

  • First feature: Learn marginal P(X₀) using random noise input
  • Subsequent features: Condition on all previous features
  • Temperature: 1.0 (diverse samples)
  • Goal: Realistic diverse samples
  • Computation: One TabPFN fit per column (× n_permutations)

Task 3: Outlier Detection

Goal

Detect anomalous samples (entire tuples, not individual features).

Strategy

Compute joint probability P(X₁, X₂, …, Xₙ) using chain rule of probability.

Chain Rule Foundation

P(X₁, X₂, X₃) = P(X₁) × P(X₂|X₁) × P(X₃|X₁,X₂)

Different orderings are mathematically equivalent:
P(X₁, X₂, X₃) = P(X₂) × P(X₃|X₂) × P(X₁|X₂,X₃)
P(X₁, X₂, X₃) = P(X₃) × P(X₁|X₃) × P(X₂|X₁,X₃)

Detailed Process: Step-by-Step Probability Evaluation

For each feature in the chain:

  1. Fit TabPFN on training data to learn P(current_feature previous_features)
  2. Predict probability distribution for the test sample’s previous features
  3. Map the test sample’s ground truth value to this distribution
  4. Extract the probability of observing this specific ground truth value
  5. Accumulate this probability (multiply, or add in log space)

Each step = One TabPFN fit (different conditional, different model needed)

Algorithm Steps

  1. For each permutation of features:
    • Apply chain rule sequentially
    • Each feature conditions on previous features in ordering
    • For each step: Fit TabPFN → Get P(ground_truth previous)
    • Multiply probabilities (add in log space)
  2. Average probabilities across permutations
  3. Return scores (lower = outlier)

Example with Detailed Steps

Test Sample: [Age=25, Income=200K, Credit=300]
Training Data: X_fit (100 samples)

Permutation: (Age, Income, Credit)

# ========================================
# Step 1: Evaluate P(Age=25)
# ========================================
Fit: model.fit(random_noise, X_fit[:, Age])
      Learns P(Age) from training distribution

Predict: distribution = model.predict(random_noise_test)
      Returns probability distribution over ages

Map: ground_truth = 25
     Extract: P(Age=25) from distribution = 0.8 

Accumulate: log_p = log(0.8) = -0.22

# ========================================
# Step 2: Evaluate P(Income=200K | Age=25)
# ========================================
Fit: model.fit(X_fit[:, Age], X_fit[:, Income])
      Learns P(Income | Age) from training

Predict: distribution = model.predict([Age=25])
      Returns probability distribution for Income given Age=25
      e.g., likely range [40K-60K] based on training

Map: ground_truth = 200K
     Extract: P(Income=200K | Age=25) from distribution = 0.001 
      200K is in the tail of the distribution! Unusual!

Accumulate: log_p = -0.22 + log(0.001) = -0.22 + (-6.91) = -7.13

# ========================================
# Step 3: Evaluate P(Credit=300 | Age=25, Income=200K)
# ========================================
Fit: model.fit(X_fit[:, [Age, Income]], X_fit[:, Credit])
      Learns P(Credit | Age, Income) from training

Predict: distribution = model.predict([[Age=25, Income=200K]])
      Returns probability distribution for Credit
      High income typically  high credit (700-850 range)

Map: ground_truth = 300
     Extract: P(Credit=300 | Age=25, Income=200K) = 0.05 
      300 is very low credit for 200K income!

Accumulate: log_p = -7.13 + log(0.05) = -7.13 + (-3.0) = -10.13

# ========================================
# Final Result
# ========================================
P(sample) = exp(log_p) = exp(-10.13) = 0.00004

 OUTLIER! (Very low probability)

Tuple-Level Detection (Critical!)

We detect whether the ENTIRE SAMPLE is anomalous, not individual features.

Individual values (univariate):
- Age=25: ✓ Common
- Income=200K: ✓ Exists in training
- Credit=300: ✓ Exists in training

Combination (multivariate):
- Age=25 + Income=200K: ✗ Unusual relationship!
→ Sample is outlier due to violated relationships

Real-world example:

  • A 25-year-old earning $200K is unusual
  • Even though “25-year-olds” and “$200K earners” both exist separately
  • The relationship between age and income is what’s anomalous

Why Multiple Permutations?

Different orderings capture different anomaly patterns:

  • Permutation 1: “Income unusual for Age”
  • Permutation 2: “Age unusual for Income+Credit”
  • Permutation 3: “Credit unusual for Age+Income”

Averaging reduces sensitivity to ordering artifacts.

Computational Cost

Each sample evaluation:

  • Feature 0: Fit TabPFN to learn P(X₀), evaluate at ground truth
  • Feature 1: Fit TabPFN to learn P(X₁ X₀), evaluate at ground truth
  • Feature 2: Fit TabPFN to learn P(X₂ X₀,X₁), evaluate at ground truth
  • …and so on

With permutations (n_permutations=10):

  • Each sample: n_features × n_permutations TabPFN fits
  • For 100 samples with 5 features: 100 × 5 × 10 = 5,000 fits
  • Optimization: All samples use same models (fit once per feature per permutation)
  • Actual cost: 5 features × 10 permutations = 50 TabPFN fits total
Why? Same conditional P(Xᵢ previous) applies to all samples, so fit once and evaluate for all.

Key Characteristics

  • Conditioning: PREVIOUS features only (chain rule)
  • Output: ONE score per sample (tuple-level)
  • Detection: Unusual relationships, not extreme values
  • Method: Mathematically rigorous (chain rule)
  • Computation: n_features × n_permutations TabPFN fits (shared across samples)

Comparison

Aspect Imputation Synthesis Outlier Detection
Goal Fill NaN values Generate new samples Detect anomalies
Input Partial data with NaN n_samples (number) Complete samples
Output Filled data Synthetic data Probability scores
Conditioning ALL other features PREVIOUS features only PREVIOUS features only
Feature Order Column-wise (any) Sequential (0→n) Sequential (permutation)
Temperature 0.000000001 1.0 N/A (probability)
Action Sample to fill Sample to generate Evaluate probability

Shared Components

All three tasks use:

  1. Pre-trained TabPFN (no retraining needed)
  2. density_() function (core engine)
  3. Feature reframing (features as labels)
  4. Permutation averaging (robustness)

Conclusion

TabPFN’s unsupervised extensions demonstrate a powerful paradigm:

Convert unsupervised problems into supervised ones, then leverage pre-trained models’ strengths.

The Key Technique: Treat features as both inputs and labels through the density_() function, enabling zero-shot performance on new datasets with uncertainty quantification and interpretable results.


Generated from technical analysis of tabpfn-extensions/src/tabpfn_extensions/unsupervised/