This is the note generated by claude code in my reading the source code of
tabpfn-extension. We understand TabPFN’s unsupervised learning capabilities: Imputation, Synthetic Data Generation, and Outlier Detection. I remains what I concerns in the whole reading process.
Table of Contents
- Core Approach: Unsupervised → Supervised
- TabPFN Regression Architecture
- Task 1: Imputation
- Task 2: Synthetic Data Generation
- Task 3: Outlier Detection
- Comparison
1. Core Approach: Unsupervised → Supervised
Key Insight
All three unsupervised tasks leverage the pre-trained TabPFN by converting unsupervised problems into supervised learning problems.
The Main Pattern
# Unsupervised: Work with unlabeled data X
# TabPFN Solution: Treat features as labels!
X_input = X[:, conditioning_features] # Some features as inputs
y_target = X[:, target_feature] # One feature as "label"
model.fit(X_input, y_target) # Standard supervised learning!
Example:
- Original data:
[Age, Income, Credit](no labels) - Reframe: Use
[Age, Income]to predictCredit - Learn:
P(Credit | Age, Income)via supervised learning
Core Engine: density_() Function
The shared workhorse for all three tasks:
def density_(X_predict, X_fit, conditional_idx, column_idx):
"""
Converts unsupervised → supervised.
Learns: P(feature_column | features_conditional)
"""
# 1. Extract features
X_train = X_fit[:, conditional_idx] # Input features
y_train = X_fit[:, column_idx] # Target feature
# 2. Select model (classifier vs regressor)
model = tabpfn_clf if categorical else tabpfn_reg
# 3. Fit and return
model.fit(X_train, y_train)
return model, X_test, y_test
This function is called repeatedly with different feature combinations, building up the solution through multiple supervised problems!
2. TabPFN Regression Architecture
The Surprising Truth: Regression = 50-Class Classification
TabPFN doesn’t do traditional regression. Instead, it performs classification over discretized value ranges.
How It Works
Step 1: Discretize Target Range into 50 Bins
Target range [100K, 500K] → 50 equal bins
Bin 0: [100K, 108K]
Bin 1: [108K, 116K]
...
Bin 49: [492K, 500K]
Step 2: Train as 50-Way Classifier
- Convert continuous values to bin indices (class labels)
- Train with cross-entropy loss (standard classification)
- Output: Probabilities over 50 bins
Step 3: Two-Stage Sampling
Stage 1: Sample which bin (categorical sampling from probabilities)
e.g., Bin 2 selected with 50% probability
Stage 2: Sample value uniformly within bin
e.g., uniform(5.0, 7.5) → 6.2
Key Properties
| Property | Value | Notes |
|---|---|---|
| Number of bins | 50 (typical) | Fixed in architecture |
| Bin ranges | Dynamic | Adapt to dataset |
| Output | Probability distribution | Uncertainty quantification |
| Loss | Cross-entropy | Classification loss |
Advantages: Uncertainty quantification, multi-modal distributions, transformer-friendly Trade-off: Discretization error (±half bin width) vs. uncertainty
Task 1: Imputation
Goal
Fill missing values (NaN) in datasets.
Strategy
Use all available features to predict each missing feature, processing column-by-column with row filtering.
Key Steps
- Identify columns with NaN - Find which features have missing values
- For each column with NaN:
- Condition on ALL other features (maximize information)
- Filter rows: Only process rows that have NaN in this specific column
- Generate multiple permutations of conditioning features
- Average predictions across permutations
- Sample values (low temperature = deterministic)
- Fill NaN positions in this column
- Move to next column and repeat
Example with Row Filtering
Input Data:
Row 0: [25, NaN, 750] ← Has NaN in column 1
Row 1: [30, 60K, NaN] ← Has NaN in column 2
Row 2: [35, 70K, 800] ← Complete, no NaN
Process Column 1 (Income):
- Filter: Only Row 0 has NaN in column 1
- Condition on: [Age=25, Credit=750] (all other features)
- Predict: Income ≈ 52K
- Fill: Row 0, Column 1 = 52K
Process Column 2 (Credit):
- Filter: Only Row 1 has NaN in column 2
- Condition on: [Age=30, Income=60K] (all other features)
- Predict: Credit ≈ 745
- Fill: Row 1, Column 2 = 745
Result:
Row 0: [25, 52K, 750] ✓ Imputed
Row 1: [30, 60K, 745] ✓ Imputed
Row 2: [35, 70K, 800] ✓ Unchanged
Key Characteristics
- Column-wise processing: Iterate through columns with NaN
- Row filtering: Only predict for rows with NaN in current column
- Conditioning: Use ALL other features (maximum information)
- Temperature: 0.000000001 (deterministic “best guess”)
- Efficiency: Skip rows without missing values in current column
Task 2: Synthetic Data Generation
Goal
Generate realistic new samples from scratch.
Strategy
Sequential (autoregressive) generation: Generate features one-by-one, conditioning only on previously generated features.
Key Steps
- Start with all-NaN matrix
- Generate features sequentially (left to right)
- Each feature conditions on ONLY previous features
- Sample with higher temperature (diverse results)
How to Generate the First Feature?
Special case: The first feature has no previous features to condition on.
Solution: Learn the marginal distribution P(X₀)
- Fit TabPFN with random noise as input and first feature as target
- Model learns to ignore the meaningless input
- Effectively learns: “What values does this feature typically take?”
- Sample from this learned marginal distribution
Training: model.fit(random_noise, X_train[:, 0])
→ Learns P(Age) from training data
Prediction: model.predict(random_noise_test)
→ Samples Age values following training distribution
Example
Initial: [NaN, NaN, NaN]
Step 1 - First Feature (Age):
- No conditioning (no previous features)
- Learn P(Age) using random noise as input
- Sample: Age = 27
Result: [27, NaN, NaN]
Step 2 - Second Feature (Income):
- Condition on: Age=27
- Learn P(Income | Age=27)
- Sample: Income = 53K
Result: [27, 53K, NaN]
Step 3 - Third Feature (Credit):
- Condition on: Age=27, Income=53K
- Learn P(Credit | Age=27, Income=53K)
- Sample: Credit = 720
Result: [27, 53K, 720]
Synthetic sample generated!
Why Sequential?
When generating from scratch, future features don’t exist yet, so we can only condition on what’s been generated so far. It is like causal inference.
Computational Cost
Each column requires TabPFN fitting:
- Feature 0: Fit TabPFN to learn P(X₀)
-
Feature 1: Fit TabPFN to learn P(X₁ X₀) -
Feature 2: Fit TabPFN to learn P(X₂ X₀,X₁) - …and so on
With permutations (n_permutations=3):
- Each column: 3 fits (one per permutation)
- Total for 5 features: 5 × 3 = 15 TabPFN fits
| Why? Each conditional P(Xᵢ | X₀,…,Xᵢ₋₁) is a different supervised learning problem requiring a separate model fit. |
Key Characteristics
- First feature: Learn marginal P(X₀) using random noise input
- Subsequent features: Condition on all previous features
- Temperature: 1.0 (diverse samples)
- Goal: Realistic diverse samples
- Computation: One TabPFN fit per column (× n_permutations)
Task 3: Outlier Detection
Goal
Detect anomalous samples (entire tuples, not individual features).
Strategy
Compute joint probability P(X₁, X₂, …, Xₙ) using chain rule of probability.
Chain Rule Foundation
P(X₁, X₂, X₃) = P(X₁) × P(X₂|X₁) × P(X₃|X₁,X₂)
Different orderings are mathematically equivalent:
P(X₁, X₂, X₃) = P(X₂) × P(X₃|X₂) × P(X₁|X₂,X₃)
P(X₁, X₂, X₃) = P(X₃) × P(X₁|X₃) × P(X₂|X₁,X₃)
Detailed Process: Step-by-Step Probability Evaluation
For each feature in the chain:
-
Fit TabPFN on training data to learn P(current_feature previous_features) - Predict probability distribution for the test sample’s previous features
- Map the test sample’s ground truth value to this distribution
- Extract the probability of observing this specific ground truth value
- Accumulate this probability (multiply, or add in log space)
Each step = One TabPFN fit (different conditional, different model needed)
Algorithm Steps
- For each permutation of features:
- Apply chain rule sequentially
- Each feature conditions on previous features in ordering
-
For each step: Fit TabPFN → Get P(ground_truth previous) - Multiply probabilities (add in log space)
- Average probabilities across permutations
- Return scores (lower = outlier)
Example with Detailed Steps
Test Sample: [Age=25, Income=200K, Credit=300]
Training Data: X_fit (100 samples)
Permutation: (Age, Income, Credit)
# ========================================
# Step 1: Evaluate P(Age=25)
# ========================================
Fit: model.fit(random_noise, X_fit[:, Age])
→ Learns P(Age) from training distribution
Predict: distribution = model.predict(random_noise_test)
→ Returns probability distribution over ages
Map: ground_truth = 25
Extract: P(Age=25) from distribution = 0.8 ✓
Accumulate: log_p = log(0.8) = -0.22
# ========================================
# Step 2: Evaluate P(Income=200K | Age=25)
# ========================================
Fit: model.fit(X_fit[:, Age], X_fit[:, Income])
→ Learns P(Income | Age) from training
Predict: distribution = model.predict([Age=25])
→ Returns probability distribution for Income given Age=25
→ e.g., likely range [40K-60K] based on training
Map: ground_truth = 200K
Extract: P(Income=200K | Age=25) from distribution = 0.001 ✗
→ 200K is in the tail of the distribution! Unusual!
Accumulate: log_p = -0.22 + log(0.001) = -0.22 + (-6.91) = -7.13
# ========================================
# Step 3: Evaluate P(Credit=300 | Age=25, Income=200K)
# ========================================
Fit: model.fit(X_fit[:, [Age, Income]], X_fit[:, Credit])
→ Learns P(Credit | Age, Income) from training
Predict: distribution = model.predict([[Age=25, Income=200K]])
→ Returns probability distribution for Credit
→ High income typically → high credit (700-850 range)
Map: ground_truth = 300
Extract: P(Credit=300 | Age=25, Income=200K) = 0.05 ✗
→ 300 is very low credit for 200K income!
Accumulate: log_p = -7.13 + log(0.05) = -7.13 + (-3.0) = -10.13
# ========================================
# Final Result
# ========================================
P(sample) = exp(log_p) = exp(-10.13) = 0.00004
→ OUTLIER! (Very low probability)
Tuple-Level Detection (Critical!)
We detect whether the ENTIRE SAMPLE is anomalous, not individual features.
Individual values (univariate):
- Age=25: ✓ Common
- Income=200K: ✓ Exists in training
- Credit=300: ✓ Exists in training
Combination (multivariate):
- Age=25 + Income=200K: ✗ Unusual relationship!
→ Sample is outlier due to violated relationships
Real-world example:
- A 25-year-old earning $200K is unusual
- Even though “25-year-olds” and “$200K earners” both exist separately
- The relationship between age and income is what’s anomalous
Why Multiple Permutations?
Different orderings capture different anomaly patterns:
- Permutation 1: “Income unusual for Age”
- Permutation 2: “Age unusual for Income+Credit”
- Permutation 3: “Credit unusual for Age+Income”
Averaging reduces sensitivity to ordering artifacts.
Computational Cost
Each sample evaluation:
- Feature 0: Fit TabPFN to learn P(X₀), evaluate at ground truth
-
Feature 1: Fit TabPFN to learn P(X₁ X₀), evaluate at ground truth -
Feature 2: Fit TabPFN to learn P(X₂ X₀,X₁), evaluate at ground truth - …and so on
With permutations (n_permutations=10):
- Each sample: n_features × n_permutations TabPFN fits
- For 100 samples with 5 features: 100 × 5 × 10 = 5,000 fits
- Optimization: All samples use same models (fit once per feature per permutation)
- Actual cost: 5 features × 10 permutations = 50 TabPFN fits total
| Why? Same conditional P(Xᵢ | previous) applies to all samples, so fit once and evaluate for all. |
Key Characteristics
- Conditioning: PREVIOUS features only (chain rule)
- Output: ONE score per sample (tuple-level)
- Detection: Unusual relationships, not extreme values
- Method: Mathematically rigorous (chain rule)
- Computation: n_features × n_permutations TabPFN fits (shared across samples)
Comparison
| Aspect | Imputation | Synthesis | Outlier Detection |
|---|---|---|---|
| Goal | Fill NaN values | Generate new samples | Detect anomalies |
| Input | Partial data with NaN | n_samples (number) | Complete samples |
| Output | Filled data | Synthetic data | Probability scores |
| Conditioning | ALL other features | PREVIOUS features only | PREVIOUS features only |
| Feature Order | Column-wise (any) | Sequential (0→n) | Sequential (permutation) |
| Temperature | 0.000000001 | 1.0 | N/A (probability) |
| Action | Sample to fill | Sample to generate | Evaluate probability |
Shared Components
All three tasks use:
- Pre-trained TabPFN (no retraining needed)
density_()function (core engine)- Feature reframing (features as labels)
- Permutation averaging (robustness)
Conclusion
TabPFN’s unsupervised extensions demonstrate a powerful paradigm:
Convert unsupervised problems into supervised ones, then leverage pre-trained models’ strengths.
The Key Technique:
Treat features as both inputs and labels through the density_() function, enabling zero-shot performance on new datasets with uncertainty quantification and interpretable results.
Generated from technical analysis of tabpfn-extensions/src/tabpfn_extensions/unsupervised/