Lingze Personal website for life and research

SQL-ML: A SQL-Centric Framework for Feature Stores

Conference: VLDB 2025 Workshop

Current feature stores (Feast, Feathr, Tecton) are reinventing the wheel—treating features as ML-specific objects when they’re conceptually just materialized views. SQL-ML argues feature stores should be a database problem, not an external ML stack component.


The Problem with Current Feature Stores

Modern feature stores like Feast and Feathr have several issues:

  1. Custom DSLs - Python-based APIs with embedded SQL, steep learning curve
  2. Proprietary Metadata - Each system maintains its own registry
  3. Lack of Standardization - No common interface across systems
  4. Reinventing Database Concepts - Features are essentially derived tables/materialized views
  5. Limited Optimizations - Only leverage underlying database optimizer, no cross-feature optimization

Comparison Table (from Paper)

| Feature Store | Language            | Metadata    | Optimizations              |
|---------------|---------------------|-------------|----------------------------|
| SQL-ML        | SQL                 | SQL engine  | Host DB + Global opts      |
| Feast         | Python + SQL        | Proprietary | Host DB only               |
| Feathr        | Python + Spark SQL  | Proprietary | Spark optimizer            |

SQL-ML Architecture

High-Level Design

SQL-ML operates as middleware between users and a host database:

[User]
  ↓ (SQL-ML Commands)
[SQL-ML Middleware]
  ├─ Parser (extended PostgreSQL)
  ├─ Analyzer (semantic checks)
  ├─ Planner (global optimizations)
  └─ Metastore (feature metadata)
  ↓ (Standard SQL)
[Host Database - PostgreSQL]
  └─ Actual feature data storage

Command Mapping

SQL-ML commands translate directly to database operations:

SQL-ML Command Host DB Command
CREATE FEATURESTORE rideshare CREATE DATABASE rideshare
DROP FEATURESTORE rideshare DROP DATABASE rideshare
CREATE FEATURE trip_rollup AS SELECT... CREATE TABLE trip_rollup (...)
DROP FEATURE trip_rollup DROP TABLE trip_rollup

Key Features

1. SQL Extensions

Clean, declarative syntax for feature management:

-- Create a feature store
CREATE FEATURESTORE rideshare;

-- Define a feature with automatic partitioning
CREATE FEATURE rideshare.trip_rollup AS
  SELECT driver_id, COUNT(*) AS total_trips
  FROM rideshare_rawdata.driver_stats
  GROUP BY driver_id
  PARTITION BY creation_timestamp BY DAY;

-- Backfill historical data
UPDATE FEATURE rideshare.trip_rollup
WHERE creation_timestamp BETWEEN '2024-01-01' AND '2024-03-01';

-- Inspect feature metadata
DESCRIBE FEATURE rideshare.trip_rollup;

2. Partition-Based Updates

Features support 5 granularities:

  • HOUR - Updates every hour
  • DAY - Updates daily at midnight
  • WEEK - Updates weekly (Sunday midnight)
  • MONTH - Updates monthly (1st of month)
  • YEAR - Updates yearly (New Year’s Day)

Each granularity has a background process that automatically refreshes matching features.

3. Global Optimization

The Innovation: When multiple features share source data, SQL-ML:

  1. Identifies common table scans with identical filters
  2. Creates shared temporary results (via SPOOL operator)
  3. Reuses intermediate data across features

Example:

-- Feature 1: Trip count per driver
CREATE FEATURE trip_rollup AS
  SELECT driver_id, COUNT(*) as total_trips
  FROM driver_stats
  PARTITION BY creation_timestamp BY DAY;

-- Feature 2: Revenue per driver
CREATE FEATURE revenue_rollup AS
  SELECT driver_id, SUM(revenue) as total_revenue
  FROM driver_stats
  PARTITION BY creation_timestamp BY DAY;

Without optimization: Scan driver_stats twice With optimization: Scan once, spool results, aggregate separately

Traditional:                    Optimized:

  Scan(driver_stats)           Scan(driver_stats)
       |                              |
       |                         Filter(date)
  +----+----+                         |
  |         |                      Spool (temp)
Filter1  Filter2                    /    \
  |         |                      /      \
Agg1      Agg2            Agg(COUNT)   Agg(SUM)

4. Metadata Management

SQL-ML extends PostgreSQL’s catalog:

Extended Tables:

  • pg_namespace - Added nspkind column to distinguish feature stores from schemas
  • pg_class - Modified to distinguish features from regular tables/views

New System Views:

  • pg_feature - Query feature metadata
  • pg_featurestore - Query feature store metadata

Example Query:

SELECT * FROM pg_feature;

-- Output:
feature_name | feature_store | partition_column | granularity | last_updated
-------------|---------------|------------------|-------------|-------------------
heart_data   | health        | record_dt        | day         | 01-22-2023 20:07:40