Relational Foundation Models -> the future of data engineering?
The End of Feature Engineering? How Relational Foundation Models are Rewriting the AI Playbook
- Introduction: The Data Scientist’s Burden
For decades, the life of a data scientist has been defined by manual labor. Before a single prediction can be made, teams are buried under a mountain of "drudgery": endless SQL queries, fragile ETL pipelines, and the painful maintenance of feature stores. This process of "flattening" data into a single table for a model to digest is not only a bottleneck—it is fundamentally reductive.
Jure Leskovec—Stanford professor and co-founder of Kumo—envisions a radical shift. Instead of humans hand-crafting summaries of data for machines, he proposes a world where AI reasons directly over raw, multi-tabular databases. This transition from manual feature engineering to Relational Foundation Models (RFMs) promises to turn databases into reasoning engines that require zero specific model training to deliver accurate business insights.
- Takeaway 1: Your Database is Actually a Hidden Graph
The fundamental shift begins with a change in perspective. While most enterprise data is stored in rows and columns, Leskovec argues that multi-tabular data is actually a complex network of nodes and edges.
In a standard three-table schema (Customer, Product, and Transaction), the data creates a "path" for a neural network to follow. A User ID connects to a Transaction ID, which in turn connects to a Product ID. By treating these connections as a graph, a model can navigate the relationships between entities rather than viewing them as isolated entries.
"Biology—cells and molecules—and enterprise data are essentially the same: a graph of interactions. My view is that I can use these digital traces as a telescope into human behaviors or biological processes."
By applying graph transformers, AI can attend to these paths directly, performing a type of feature discovery that humans—with their inherent biases—simply cannot replicate.
- Takeaway 2: The "Outlandish" Reality of Zero-Shot Business Prediction
The most disruptive capability of the Relational Foundation Model (RFM) is its ability to perform "zero-shot" predictions. This means the model can make accurate predictions on a database it has never seen before, for a task it wasn't specifically trained on, without any additional weight updates.
This is achieved through In-Context Learning for databases. Much like a Large Language Model (LLM) uses a prompt, the RFM uses "labeled in-context examples" from the database.
- Mechanism: The model identifies subgraphs of historical data to "learn" patterns in a single forward pass. Because the model uses frozen, pre-trained weights, it doesn't need to learn what the columns mean ahead of time; it builds the predictive model "in its brain" during that forward pass.
- Domain-Agnostic Reasoning: The model identifies the "right relationships" and mathematical functions that appear in relational structures, making it truly domain-agnostic.
Furthermore, this architecture solves the "black box" problem. By running the model backwards and using an LLM to translate the model's attention maps into human-readable text, practitioners can receive a textual explanation of exactly which cells and columns drove a specific prediction. This represents a "GPT moment" for structured data: moving from bespoke, task-specific models to general-purpose reasoning.
- Takeaway 3: Why the "Single Table" is a Data Science Trap
The industry standard of "denormalization"—flattening tables for tools like XGBoost—is a trap. Leskovec argues that once you aggregate data, you lose the "nuance" that drives high-accuracy predictions.
When data is aggregated into a single table, critical signals are destroyed:
- Nuanced Statistics: Humans often choose a "median price" or "average price," losing the raw distribution of purchase behavior.
- Temporal Specificity: Aggregations lose the distinction between "morning shopping" versus "evening shopping."
- Contextual Sensitivity: Human-engineered features often miss how behaviors change during specific windows, such as holiday sleep patterns or daylight savings transitions.
While single-table problems are largely "solved," the real "hidden wins" are in the raw, multi-table structure. RFMs avoid the trap by attending over millions of cells directly, allowing the neural network to perform its own feature discovery via gradient descent.
- Takeaway 4: Building a "Digital Twin" from a Single Drop of Blood
The power of relational reasoning extends into the "AI Virtual Cell." Leskovec’s research demonstrates that biological systems can be modeled using the same hierarchical graph logic: Proteins \rightarrow Cells \rightarrow Patients.
By representing a cell as a collection of protein molecules (using 20,000-dimensional vectors of protein abundance), the model allows self-supervised emergent biology to take place. The AI isn't programmed with human biological knowledge; instead, it learns cell types and states purely from the data.
A practical application is the "digital twin" created from a single drop of blood. Because blood circulates through the entire body, it captures the immune state of the entire body. By profiling every cell in that drop using single-cell RNA seek data, the model can detect diseases and predict patient trajectories without human bias interfering in the feature selection.
- Takeaway 5: Why AI Agents Will Fail Without High-Level Data APIs
As organizations deploy autonomous AI agents, the complexity of raw code has become a bottleneck. When agents are asked to navigate low-level implementation, they make "subtle mistakes," such as the information leakage seen in an Expedia fraud model project where an agent incorrectly aggregated transactions to midnight rather than the current time.
For agents to be effective, they require a higher level of abstraction:
- PyTorch/Raw Code: ~1,000 lines (20 steps). Too many opportunities for an agent to get lost.
- XGBoost: ~500 lines. Intermediate complexity, still prone to engineering errors.
- Relational API: ~50 lines (2 steps). Using a high-level API allows the agent to focus on the goal rather than the drudgery of implementation.
"For agents to be effective, they need 'agent-friendly APIs'—abstractions that allow them to make decisions without being bogged down by the drudgery of low-level implementation."
- The Verdict: From CPU Summaries to GPU Reasoning
The industry is witnessing a fundamental shift in compute. We are moving from CPU-heavy summarization (where humans write code to simplify data) to GPU-driven reasoning (where models ingest raw data). Crucially, this reasoning is scalable because the attention mechanism in these models is non-quadratic—it follows the graph structure, allowing it to handle millions of cells where standard LLMs would fail.
The real-world impact is already significant:
- Reddit: Achieved double-digit increases in ad click-through rates (CTR) by using graph embeddings alongside traditional features.
- DoorDash: Revolutionized restaurant recommendations and notification timing, impacting revenue by hundreds of millions of dollars.
- Coinbase: Scaled relational models across the entire Bitcoin blockchain to detect fraud.
As Relational Foundation Models take over the task of feature discovery, the role of the "Data Scientist" must evolve. The value of the expert shifts from manual data engineering to formulating the right questions. The future of AI lies not in how well we can summarize our databases, but in how effectively we can let the models reason across them.