SQL-ML: A SQL-Centric Framework for Feature Stores

Conference: VLDB 2025 Workshop

Current feature stores (Feast, Feathr, Tecton) are reinventing the wheel—treating features as ML-specific objects when they’re conceptually just materialized views. SQL-ML argues feature stores should be a database problem, not an external ML stack component.

The Problem with Current Feature Stores

Modern feature stores like Feast and Feathr have several issues:

Custom DSLs - Python-based APIs with embedded SQL, steep learning curve
Proprietary Metadata - Each system maintains its own registry
Lack of Standardization - No common interface across systems
Reinventing Database Concepts - Features are essentially derived tables/materialized views
Limited Optimizations - Only leverage underlying database optimizer, no cross-feature optimization

Comparison Table (from Paper)

| Feature Store | Language            | Metadata    | Optimizations              |
|---------------|---------------------|-------------|----------------------------|
| SQL-ML        | SQL                 | SQL engine  | Host DB + Global opts      |
| Feast         | Python + SQL        | Proprietary | Host DB only               |
| Feathr        | Python + Spark SQL  | Proprietary | Spark optimizer            |

SQL-ML Architecture

High-Level Design

SQL-ML operates as middleware between users and a host database:

[User]
  ↓ (SQL-ML Commands)
[SQL-ML Middleware]
  ├─ Parser (extended PostgreSQL)
  ├─ Analyzer (semantic checks)
  ├─ Planner (global optimizations)
  └─ Metastore (feature metadata)
  ↓ (Standard SQL)
[Host Database - PostgreSQL]
  └─ Actual feature data storage

Command Mapping

SQL-ML commands translate directly to database operations:

SQL-ML Command	Host DB Command
`CREATE FEATURESTORE rideshare`	`CREATE DATABASE rideshare`
`DROP FEATURESTORE rideshare`	`DROP DATABASE rideshare`
`CREATE FEATURE trip_rollup AS SELECT...`	`CREATE TABLE trip_rollup (...)`
`DROP FEATURE trip_rollup`	`DROP TABLE trip_rollup`

Key Features

1. SQL Extensions

Clean, declarative syntax for feature management:

-- Create a feature store
CREATE FEATURESTORE rideshare;

-- Define a feature with automatic partitioning
CREATE FEATURE rideshare.trip_rollup AS
  SELECT driver_id, COUNT(*) AS total_trips
  FROM rideshare_rawdata.driver_stats
  GROUP BY driver_id
  PARTITION BY creation_timestamp BY DAY;

-- Backfill historical data
UPDATE FEATURE rideshare.trip_rollup
WHERE creation_timestamp BETWEEN '2024-01-01' AND '2024-03-01';

-- Inspect feature metadata
DESCRIBE FEATURE rideshare.trip_rollup;

2. Partition-Based Updates

Features support 5 granularities:

HOUR - Updates every hour
DAY - Updates daily at midnight
WEEK - Updates weekly (Sunday midnight)
MONTH - Updates monthly (1st of month)
YEAR - Updates yearly (New Year’s Day)

Each granularity has a background process that automatically refreshes matching features.

3. Global Optimization

The Innovation: When multiple features share source data, SQL-ML:

Identifies common table scans with identical filters
Creates shared temporary results (via SPOOL operator)
Reuses intermediate data across features

Example:

-- Feature 1: Trip count per driver
CREATE FEATURE trip_rollup AS
  SELECT driver_id, COUNT(*) as total_trips
  FROM driver_stats
  PARTITION BY creation_timestamp BY DAY;

-- Feature 2: Revenue per driver
CREATE FEATURE revenue_rollup AS
  SELECT driver_id, SUM(revenue) as total_revenue
  FROM driver_stats
  PARTITION BY creation_timestamp BY DAY;

Without optimization: Scan driver_stats twice With optimization: Scan once, spool results, aggregate separately

Traditional:                    Optimized:

  Scan(driver_stats)           Scan(driver_stats)
       |                              |
       |                         Filter(date)
  +----+----+                         |
  |         |                      Spool (temp)
Filter1  Filter2                    /    \
  |         |                      /      \
Agg1      Agg2            Agg(COUNT)   Agg(SUM)

4. Metadata Management

SQL-ML extends PostgreSQL’s catalog:

Extended Tables:

pg_namespace - Added nspkind column to distinguish feature stores from schemas
pg_class - Modified to distinguish features from regular tables/views

New System Views:

pg_feature - Query feature metadata
pg_featurestore - Query feature store metadata

Example Query:

SELECT * FROM pg_feature;

-- Output:
feature_name | feature_store | partition_column | granularity | last_updated
-------------|---------------|------------------|-------------|-------------------
heart_data   | health        | record_dt        | day         | 01-22-2023 20:07:40

SQL-ML: A SQL-Centric Framework for Feature Stores

The Problem with Current Feature Stores

Comparison Table (from Paper)

SQL-ML Architecture

High-Level Design

Command Mapping

Key Features

1. SQL Extensions

2. Partition-Based Updates

3. Global Optimization

4. Metadata Management

You may also enjoy:

A Row-Level Perspective on Feature Store

Optimizing Data Pipelines for Machine Learning in Feature Stores

Overivew: Enterprice-level Data Analytics Solution

Deep Retrieval in Bytedance RecSys

Relational Data Generation

End-to-end Optimization of Machine Learning Prediction Queries