Conference: VLDB 2025 Workshop
Current feature stores (Feast, Feathr, Tecton) are reinventing the wheel—treating features as ML-specific objects when they’re conceptually just materialized views. SQL-ML argues feature stores should be a database problem, not an external ML stack component.
The Problem with Current Feature Stores
Modern feature stores like Feast and Feathr have several issues:
- Custom DSLs - Python-based APIs with embedded SQL, steep learning curve
- Proprietary Metadata - Each system maintains its own registry
- Lack of Standardization - No common interface across systems
- Reinventing Database Concepts - Features are essentially derived tables/materialized views
- Limited Optimizations - Only leverage underlying database optimizer, no cross-feature optimization
Comparison Table (from Paper)
| Feature Store | Language | Metadata | Optimizations |
|---------------|---------------------|-------------|----------------------------|
| SQL-ML | SQL | SQL engine | Host DB + Global opts |
| Feast | Python + SQL | Proprietary | Host DB only |
| Feathr | Python + Spark SQL | Proprietary | Spark optimizer |
SQL-ML Architecture
High-Level Design
SQL-ML operates as middleware between users and a host database:
[User]
↓ (SQL-ML Commands)
[SQL-ML Middleware]
├─ Parser (extended PostgreSQL)
├─ Analyzer (semantic checks)
├─ Planner (global optimizations)
└─ Metastore (feature metadata)
↓ (Standard SQL)
[Host Database - PostgreSQL]
└─ Actual feature data storage
Command Mapping
SQL-ML commands translate directly to database operations:
| SQL-ML Command | Host DB Command |
|---|---|
CREATE FEATURESTORE rideshare |
CREATE DATABASE rideshare |
DROP FEATURESTORE rideshare |
DROP DATABASE rideshare |
CREATE FEATURE trip_rollup AS SELECT... |
CREATE TABLE trip_rollup (...) |
DROP FEATURE trip_rollup |
DROP TABLE trip_rollup |
Key Features
1. SQL Extensions
Clean, declarative syntax for feature management:
-- Create a feature store
CREATE FEATURESTORE rideshare;
-- Define a feature with automatic partitioning
CREATE FEATURE rideshare.trip_rollup AS
SELECT driver_id, COUNT(*) AS total_trips
FROM rideshare_rawdata.driver_stats
GROUP BY driver_id
PARTITION BY creation_timestamp BY DAY;
-- Backfill historical data
UPDATE FEATURE rideshare.trip_rollup
WHERE creation_timestamp BETWEEN '2024-01-01' AND '2024-03-01';
-- Inspect feature metadata
DESCRIBE FEATURE rideshare.trip_rollup;
2. Partition-Based Updates
Features support 5 granularities:
- HOUR - Updates every hour
- DAY - Updates daily at midnight
- WEEK - Updates weekly (Sunday midnight)
- MONTH - Updates monthly (1st of month)
- YEAR - Updates yearly (New Year’s Day)
Each granularity has a background process that automatically refreshes matching features.
3. Global Optimization
The Innovation: When multiple features share source data, SQL-ML:
- Identifies common table scans with identical filters
- Creates shared temporary results (via SPOOL operator)
- Reuses intermediate data across features
Example:
-- Feature 1: Trip count per driver
CREATE FEATURE trip_rollup AS
SELECT driver_id, COUNT(*) as total_trips
FROM driver_stats
PARTITION BY creation_timestamp BY DAY;
-- Feature 2: Revenue per driver
CREATE FEATURE revenue_rollup AS
SELECT driver_id, SUM(revenue) as total_revenue
FROM driver_stats
PARTITION BY creation_timestamp BY DAY;
Without optimization: Scan driver_stats twice
With optimization: Scan once, spool results, aggregate separately
Traditional: Optimized:
Scan(driver_stats) Scan(driver_stats)
| |
| Filter(date)
+----+----+ |
| | Spool (temp)
Filter1 Filter2 / \
| | / \
Agg1 Agg2 Agg(COUNT) Agg(SUM)
4. Metadata Management
SQL-ML extends PostgreSQL’s catalog:
Extended Tables:
pg_namespace- Addednspkindcolumn to distinguish feature stores from schemaspg_class- Modified to distinguish features from regular tables/views
New System Views:
pg_feature- Query feature metadatapg_featurestore- Query feature store metadata
Example Query:
SELECT * FROM pg_feature;
-- Output:
feature_name | feature_store | partition_column | granularity | last_updated
-------------|---------------|------------------|-------------|-------------------
heart_data | health | record_dt | day | 01-22-2023 20:07:40