TL;DR
How AI agents actually translate a product analytics question into governed SQL on your warehouse, why naive text-to-SQL breaks on retention and funnels, and how the semantic layer doubles as an access-control boundary. When teams say "an AI agent that queries our data warehouse," they are usually picturing a chat box that turns a sentence into SQL.
When teams say "an AI agent that queries our data warehouse," they are usually picturing a chat box that turns a sentence into SQL. The interesting engineering is not the chat box. It is the layer between the prompt and the SQL: how the agent resolves ambiguous business language to specific entities, picks the right analytical method, and executes against governed data without leaking what the asking user is not allowed to see. This guide walks through the four steps an AI analytics agent actually runs, shows a side-by-side of naive versus governed SQL on a real product analytics question, and explains why the semantic layer ends up doing double duty as an access-control boundary.
The four steps an AI agent actually runs
Every production text-to-SQL agent for product analytics follows the same loop, even when vendors describe it differently:
- Intent parsing. Translate the question into a structured analytical intent: what kind of question is it (lookup, segmentation, funnel, cohort retention, attribution), what entities are involved, what time window is implied.
- Schema and semantic resolution. Map every business term in the question to a concrete entity, metric, or event in the semantic layer. This is where most failures originate.
- SQL generation. Compose SQL that implements both the metric definition and the analytical method, then validate it against schema and policy before it runs.
- Execution and verification. Run the SQL on live warehouse data using the asking user's permissions, return the result, and surface the SQL plus assumptions so an analyst can review.
A generic LLM can do step 1 and a passable version of step 3. The trouble starts at step 2, and step 4 is where governance lives. For a category-level treatment of how these systems differ from generic chat, see what is agentic analytics.
Why naive text-to-SQL breaks on product analytics questions?
On simple lookups against a clean star schema, modern LLMs hit accuracy numbers that look impressive in a demo. Recent evaluation work, including Snowflake's Cortex Analyst paper and the widely shared text-to-SQL performance cliff analysis, agrees on the same pattern: the moment questions involve cohorts, funnels, time windows, or stitched user identities, accuracy drops sharply. Product analytics questions live almost entirely on the wrong side of that cliff.
The failure modes are predictable. A model staring at raw event tables will invent columns it has seen in training data, pick the wrong join key when a user identity is split across anonymous and authenticated sessions, treat a phrase like "last quarter" as a rolling 90-day window, skip deduplication because it looks like a stylistic choice, and apply funnel ordering rules that no analyst would accept. None of these are syntax errors. The SQL runs and returns a number.
The number is wrong, and nothing about the output flags that.
This is also where the difference between a generic LLM assistant and an analytics agent shows up. We covered the failure surface in detail in AI analytics hallucinations and SQL transparency and the methodology gap in semantic layer plus agentic analytics.
A real example: "What was our 30-day retention for users who completed onboarding last quarter?"
Assume a typical product analytics warehouse layout. There is an events table with one row per event (user_id, anonymous_id, event_name, event_time, properties) and a users table with profile attributes. The question seems narrow, but it implies four analytical decisions: which event marks onboarding completion, what calendar window counts as last quarter, how to define a 30-day retention window per user, and how to count a user once across stitched identities.
Here is what a naive LLM with raw schema access typically produces. It looks reasonable on first read.
-- Naive LLM output: runs cleanly, returns a number, but the number is wrong.
SELECT
COUNT(DISTINCT u.id) AS retained_users,
COUNT(DISTINCT u.id) * 100.0 / NULLIF(
(SELECT COUNT(*) FROM users WHERE onboarding_completed), -- invented boolean column
0
) AS retention_rate
FROM users u
JOIN events e
ON e.user_id = u.id -- ignores anonymous_id stitching pre-login
WHERE u.onboarding_completed = TRUE -- invented again
AND e.event_time >= CURRENT_DATE - INTERVAL '90 days' -- "last quarter" guessed as 90 days
AND e.event_time <= CURRENT_DATE; -- no per-user 30-day window from onboardingFive things are wrong, and an analyst would catch all of them in seconds. Onboarding completion is an event, not a column on the users table. Last quarter has explicit calendar boundaries, not a rolling 90 days. The 30-day retention window has to be measured from each user's onboarding event, not from today. Joining only on user_id ignores the pre-login activity that lives under anonymous_id. And the denominator is a separate population from the numerator, so the rate is meaningless even if the rest were right.
Here is what an agent grounded in a semantic layer generates for the same prompt. The entity names (onboarding_completed, the retention event set, the stitched user_id) come from definitions that an analyst already approved. The agent does not invent them.
-- Governed SQL generated against semantic-layer entities.
WITH cohort AS (
SELECT
user_id, -- stitched identity from semantic layer
MIN(event_time) AS onboarded_at
FROM analytics.events
WHERE event_name = 'onboarding_completed'
AND event_time >= TIMESTAMP '2026-01-01 00:00:00'
AND event_time < TIMESTAMP '2026-04-01 00:00:00' -- strict Q1 2026 boundaries
GROUP BY user_id
),
retained AS (
SELECT DISTINCT c.user_id
FROM cohort c
JOIN analytics.events e
ON e.user_id = c.user_id
AND e.event_time > c.onboarded_at + INTERVAL '1' DAY
AND e.event_time <= c.onboarded_at + INTERVAL '30' DAY
WHERE e.event_name IN ('app_opened', 'session_started') -- retention event set, defined once in the semantic layer
)
SELECT
COUNT(DISTINCT c.user_id) AS cohort_size,
COUNT(DISTINCT r.user_id) AS retained_users,
COUNT(DISTINCT r.user_id) * 100.0
/ NULLIF(COUNT(DISTINCT c.user_id), 0) AS retention_30d_pct
FROM cohort c
LEFT JOIN retained r USING (user_id);Each fix maps back to something the agent could only get right with semantic context. The cohort is a real event-derived population. The window is a strict calendar quarter. Retention is measured per user, in a 1-to-30 day window from that user's own onboarding moment.
The retention event set is a named definition that an analyst maintains, not a guess. And because the underlying query runs on the warehouse with the asking user's permissions, row-level filters (region, environment, customer segment) are applied automatically without anyone having to remember them.
Governance: the semantic layer as an access-control boundary
The first question every enterprise prospect asks when you say "AI can query your warehouse" is who controls what it can see. The honest answer is layered, and worth stating out loud because most posts skip it.
Layer one is the warehouse itself. Snowflake, BigQuery, Databricks, and Redshift already enforce row-level security, column masking, and role-based access. A well-designed agent runs SQL using the asking user's identity, which means it inherits those controls. It cannot bypass them by construction.
If a row would be filtered for that user in a SQL editor, it is filtered for them through the agent.
Layer two is the semantic layer, and this is the part that earns the access-control boundary framing. Warehouse RBAC controls which rows return. The semantic layer controls which entities and metrics a role is even allowed to ask about. A growth manager can ask anything in their semantic scope; revenue tables outside that scope are not in the agent's vocabulary, so the question simply never produces a query that touches them.
This is more restrictive than warehouse RBAC alone, and it is also auditable: every generated SQL statement carries the semantic resolution that produced it, the asking user, and the timestamp.
Layer three is the explicit anti-pattern: copying warehouse data into a separate vector store, application database, or "AI cache" so the agent can answer faster. This is what kills enterprise deals. It breaks RBAC because the copy has its own permission model. It breaks audit because the agent now answers from a snapshot the warehouse cannot reproduce. It introduces a second source of truth for governed metrics. A warehouse-native agent does none of this. For the buyer-side rubric, see the natural language to SQL analytics checklist and ask questions about data without SQL: governance model.
What a production text-to-SQL agent for product analytics actually looks like?
Both Meta's internal analytics agent and the publicly written-up reference implementations (the "talk to your database" pattern, Snowflake Cortex Analyst, the comparable agents reviewed in 2026 text-to-SQL roundups) converge on the same component set. The names differ; the layers do not.
- Warehouse connector. Reads schema and runs SQL using the asking user's existing permissions. No data copy, no parallel store.
- Semantic layer. Maps business language to entities, metrics, dimensions, and event definitions. Often layered on dbt or an equivalent governed source.
- Method layer. Encodes how each question class is operationalized: cohort entry, retention windows, funnel ordering, attribution rules. This is the layer naive text-to-SQL skips.
- NL-to-SQL engine. Generates SQL from intent, semantic resolution, and method. Usually combines retrieval over the semantic layer with constrained generation.
- Execution layer. Runs the SQL on the live warehouse with the user's identity. No staging, no copy.
- Transparency layer. Surfaces generated SQL, semantic resolution, and assumptions for analyst review.
- Monitoring layer. Tracks KPIs and alerts on anomalies through the same semantic definitions.
For a deeper look at how the semantic and method layers interact when generating SQL for funnels and cohorts, read agentic analytics and the semantic layer.
How to roll this out without losing trust?
The teams that succeed with text-to-SQL on a warehouse share four habits. They start in one KPI domain with stable definitions, usually activation or conversion, before opening the agent up across the organization. They keep generated SQL visible by default, not on request, so analysts can review without ceremony. They route executive, financial, and externally shared metrics through analyst approval, regardless of how confident the agent looks.
And they review correction rate weekly, treating ambiguous prompts as semantic-layer bugs rather than user errors.
For a more complete buyer rubric, including pilot question packs and procurement scoring, pair this guide with the natural language to SQL analytics checklist and verified SQL and trust.
Sources
- Talk to Your Database: Building a Natural Language SQL Agent
- Inside Meta's Home Grown AI Analytics Agent
- The Text-to-SQL Performance Cliff (2026): Why Natural Language to SQL Breaks
- Snowflake Cortex Analyst: Evaluating Text-to-SQL Accuracy for Real-World BI
- Best Text to SQL Query Tools in 2026
FAQ
Can an AI agent query my warehouse safely?
Yes, when it executes SQL using the asking user's warehouse identity and answers from a semantic layer instead of from a copied data store. Row-level security, column masking, and role-based access are inherited from the warehouse, and the semantic layer narrows the question surface to entities the asking role is allowed to use.
Why does naive text-to-SQL fail on retention and funnel questions?
Retention and funnel questions require analytical methodology, not just metric definitions. A model that only sees raw schema cannot reliably pick cohort entry rules, time windows, ordering constraints, or user deduplication, so it produces SQL that runs but answers a different question than the one asked.
Does the agent see PII?
Only the PII the asking user is already allowed to see in the warehouse. Column masking, row-level security, and semantic-layer scoping all apply. A warehouse-native architecture is preferable here because it does not create a second copy of the data with its own permission model.
Can I review the SQL the agent generates?
In a trustworthy implementation the SQL is visible by default, alongside the semantic resolution that produced it and the assumptions the agent made. Reviewing SQL should be a routine workflow, not a special-case admin feature.
Try a warehouse-native AI analytics agent
Mitzu runs the architecture described above: a semantic layer grounded in your warehouse, governed text-to-SQL for product analytics questions, transparent generated SQL on every answer, and execution under the asking user's permissions. If you are evaluating how this fits into your stack, start at mitzu.io or book a demo.



