Lingze Personal website for life and research

Relational Data Generation

(ICML 2026) PLUREL: Synthetic Data unlocks Scaling laws for Relational Foundation Models.

In this blog, I only record the methdology on How PluRel generates the schema, the row-level foreign-key connectivity, and the temporal attribute.

Some example is summarized from the code repository of PLUREL by claude code.


1. Schema sampling

Stage 1 picks a DAG over tables. Each node = one table; each edge A → B means “table B has a FK column pointing to table A.”

Three layouts are sampled at the schema level:

  • BarabasiAlbert
  • ReverseRandomTree
  • WattsStrogatz

After the DAG is built, every node with no outgoing edges becomes an Activity table (large, gets a date column); everything else is an Entity table.

1.1 BarabasiAlbert — scale-free, hubs, multiple sinks

  • Preferential attachment: each new node connects to existing nodes with probability proportional to their current degree.
  • Result: power-law degree distribution → a few hub tables, most tables sparse.
  • Sinks are thinned (some incoming edges to leaf tables are dropped) so activity tables don’t get implausibly wide fan-in.
  • Many edges, multiple sinks, diamond patterns are common.
              users                  ← hub
           ╱  │  │  ╲
       posts  │  │   follows
         │ ╲  │  │  ╱  │
         │   ▼  ▼  ▼   │
         │  comments   │
         ▼     │       ▼
       likes   ▼     tags          ← 3 sinks  =  3 Activity tables
            reactions

The diamond users → posts → comments and users → comments means multiple paths exist between the same pair — features fan in through more than one route.

1.2 ReverseRandomTree — fan-in to a single sink

  • A uniformly random tree with all edges oriented toward a randomly chosen root.
  • Exactly n − 1 edges, exactly one sink, no diamonds.
   customers   products   employees   coupons
        │         │           │         │
        └──┐   ┌──┘           └──┐  ┌───┘
           ▼   ▼                 ▼  ▼
         promos                  pay
            │                     │
            └─────────┐  ┌────────┘
                      ▼ ▼
                     orders        ← the single sink

Classic star / snowflake schema: many dimension tables feed exactly one fact table.

1.3 WattsStrogatz — small-world ring + shortcuts

  • Start from a ring where each node connects to its k nearest neighbors, then randomly rewire a small fraction of edges.
  • Result: mostly local/sequential edges plus a few long-range shortcuts.
  • High local clustering, short average path length.
   raw_events ──> sessions ──> daily_agg ──> weekly_agg ──> reports
        │                          ▲                            ▲
        └──────── shortcut ────────┘                            │
                              │                                 │
                              └────────── shortcut ─────────────┘

Feel: staged pipeline, mostly stage-by-stage with occasional cross-stage joins.

1.4 Comparison

  BarabasiAlbert ReverseRandomTree WattsStrogatz
Edge count ≈ 2n n − 1 ≈ nk/2
# of sinks (Activity tables) many exactly 1 varies
Diamond / multi-path joins yes no rare
Hub location anywhere root only none
Real-world analogue social graph star/snowflake staged pipeline

2. Connectivity generation

The schema DAG says “table C has a FK into table P” but not which specific parent row each child row links to. Stage 2 fills in those FK values using a hierarchical stochastic block model (HSBM) on the bipartite (parent, child) graph.

2.1 Pipeline

For each (parent, child) relationship in the schema:

  1. Sample HSBM hyperparameters.
    • num_levels (a small integer, e.g. 1–5).
    • For each level, a cluster count for each side (a small integer, e.g. 1–3).
    • Fresh draw per FK relationship — independent block structures.
  2. Assign a hierarchical cluster label to every row on each side. Rows are split contiguously by row index into base clusters; each row gets an L-tuple of cluster IDs across the L levels.

  3. Sample a block-probability matrix per level. For each level, a small matrix indexed by (parent cluster, child cluster):

    • off-diagonal entries: very small (e.g. ~0.001).
    • diagonal entries: large (e.g. 0.9). Strong same-cluster bias.
  4. Score every (parent, child) pair as a product across levels:

    score(a, b) = Π_l  P_l[ cluster_a[a, l], cluster_b[b, l] ]
    

    With L = 2 levels, four levels of “distance”:

    case example score
    same fine cluster 0.9 × 0.9 = 0.81
    same coarse, different fine 0.9 × 0.001 ≈ 9e-4
    different coarse, same fine 0.001 × 0.9 ≈ 9e-4
    totally different 0.001 × 0.001 ≈ 1e-6
  5. Normalize per child row. For each child row b, normalize the scores across all candidate parents so they sum to 1. Children are normalized independently.

  6. Sample one parent per child from that distribution (inverse-CDF / categorical sampling).

  7. Write the sampled parent indices into the child’s FK column.

2.2 Worked example — posts(12) → users(8)

With num_levels = 2 and 2 clusters per level on each side (4 clusters total per side):

users:               posts:
  0,1 → (0,0)          0,1,2    → (0,0)
  2,3 → (0,1)          3,4,5    → (0,1)
  4,5 → (1,0)          6,7,8    → (1,0)
  6,7 → (1,1)          9,10,11  → (1,1)

Block-probability matrix (same shape at both levels):

            cluster_b=0   cluster_b=1
cluster_a=0   0.9          0.001
cluster_a=1   0.001        0.9

And we can get the cluster probability matrix, the bipartite join is approximately block-diagonal.

                      posts
              (0,0) (0,1) (1,0) (1,1)
       (0,0)  ████   ·     ·     ·
users  (0,1)   ·    ████   ·     ·
       (1,0)   ·     ·    ████   ·
       (1,1)   ·     ·     ·    ████

A typical draw:

post 0  → user 0      post 6  → user 4
post 1  → user 1      post 7  → user 5
post 2  → user 0      post 8  → user 1   ← rare cross-cluster stray
post 3  → user 3      post 9  → user 7
post 4  → user 2      post 10 → user 6
post 5  → user 3      post 11 → user 7

2.3 Determine the PK-FK pair.

Sampling directly from the per-row distribution is mathematically equivalent to:

  1. Sample a parent cluster with probability proportional to (cluster size) × score(parent_cluster, child_cluster).
  2. Sample a parent row uniformly within that cluster.

So you only really need the small K_a × K_b cluster-block score matrix (plus cluster sizes) — you don’t need to compute scores per row pair.

2.4 Properties

  • Block-diagonal joins. Most child rows link to parents in the same fine cluster (~0.9 mass per level), with rare off-block strays from the small off-diagonal probability.
  • Hierarchical bleed-through. Off-block penalties multiply across levels. With more levels, totally-different cluster pairs become effectively unreachable, while “same coarse / different fine” pairs stay possible.
  • Many-to-one is natural. Children sample independently, so one parent can be picked by many children; some parents may be picked by none. No constraints enforce “every parent used.”
  • Multiple FKs are independent. If a child has FKs into two parents, each FK gets its own freshly sampled HSBM. There is no joint coupling across FK columns at this stage.

3. Temporal attribute

3.1 How date is generated

  • Only Activity tables (sinks) get a date column. Entity tables have no timestamp.
  • The dataset picks a random sub-window [min_ts, max_ts] from a wide configured range. This same window is shared by every Activity table in the database.
  • For each Activity table with N rows, the date column is N evenly-spaced timestamps between min_ts and max_ts.
  • val_timestamp and test_timestamp are placed at the 80% and 90% points of the window — used as eval split markers.

Key consequence: row index encodes time order. Row 0 is the earliest, row N−1 is the latest. Rows are never shuffled.

3.2 Time-series-flavored features

Independently of date, some feature columns can be generated as a function of row_idx with three components:

value(row_idx) = trend(row_idx) + cycle(row_idx) + AR(1)-noise(row_idx)

Activity tables get non-zero trend and cycle scales (so features look genuinely temporal); Entity tables get zero trend/cycle and high noise (so features look like pure noise — appropriate, since entities have no time).

When such a feature is plotted against the date column, it looks like a real time series — but this is purely a side-effect of indexing the generator by row position.

3.3 Emergent temporal cohorts

Two facts combine into an interesting consequence:

  • HSBM cluster assignment is contiguous by row index.
  • The date column is monotonic with row index.

Therefore HSBM “communities” automatically become temporal cohorts:

posts cluster (0,0)  →  earliest 25% of the time window
posts cluster (0,1)  →  next 25%
posts cluster (1,0)  →  next 25%
posts cluster (1,1)  →  latest 25%

Because user-cluster c connects almost exclusively to post-cluster c, each user effectively gets a lifetime window:

user cluster when those users post
(0, 0) early period
(0, 1) early-middle
(1, 0) late-middle
(1, 1) late

Rare off-diagonal links become “old user comes back” events. With more HSBM levels, cohorts nest into super-cohorts and sub-cohorts.

3.4 Disadvantage

  • No event bursts, no business-hour or weekday effects. Real activity logs are spiky; here every inter-event gap is identical.
  • No referential-time consistency. A child row at time t1 can reference a parent activity row at a later time t2 > t1 — the HSBM ignores timestamps.
  • Hard cohort boundaries. Off-diagonal probabilities are tiny, so cohort boundaries are clear. Real users have long tails of activity.
  • Independent FK cohorts. Each FK’s HSBM is sampled independently, so on a child table with two FKs the two cohort timings are uncorrelated. No “users from 2015 wrote about topics from 2015” coupling.
  • date is post-hoc, not causal. The timestamp column is added after features are generated; it is not a node in the causal DAG. Temporal-looking features are an artifact of generator indexing, not of the timestamp itself.