Funnels with SQL: The good, the bad and the ugly way
Understanding your user's journey is more than just advantageous. It's critical for success. Funnel metrics emerge as the linchpin in product analytics, offering a clear view into the user's pathway from initial engagement to conversion. They are the roadmap guiding businesses to identify bottlenecks, optimise user experiences, and make data-driven decisions.
During the development ofM we optimised our funnel query-building engine to support all our customers. In this process, we conducted interviews with various data analysts to understand their approaches to crafting a funnel query using SQL language.
We have identified three ways data teams in companies write these queries:
- The good
- The bad
- The ugly
Each of these has specific characteristics that we will discuss in this post.
The Setup
Before we jump into SQL query writing, let us first explore the typical data upon which the data team executes these queries.
This dataset is called a behavioural event dataset or just an event dataset. In this post, we are going to examine a simple funnel question:
What is the daily ratio of users who visited our website and paid for something within a day?
The Bad (naive approach)
Let's start with the worst of all. Why is it the bad one? Simply because it doesn't yield correct results.
Junior data analysts typically write this query. The main issue is that it doesn't connect the users who did "page visit" and then "payment" events.
It simply divides two numbers that can yield weird results. Often, the conversion rate is more than 100%, which is already something of concern.
Yes, we mention the two events, but the division is between two disconnected aggregations.
When would I use such a query? The answer is never. Although we have to give credit, it is straightforward to write and extend.
Besides that, this is the fastest query of all three we describe.
The Good (left joins)
Next in line comes "the good" query (as we call it). This query connects the users across their events and takes care of the correct funnel window filter.
This is "the good" solution, as it will yield good results. However, it is not a perfect solution, as you will see.
The left join with the user_id condition will ensure we count only the converted users. After that, the DateTime condition ensures we attribute conversions only to the correct funnel window.
The main problem with this solution comes from the fact that it uses a join. Imagine a different scenario where we look for a one-year funnel window, and the number of events is in the billions. The JOIN in the SQL query will lead to O(n^2) space and time complexity. Space complexity causes a bigger problem.
Why? Because any data warehouse or data lake's computation engine has limited memory. You can only linearly increase your cluster memory size, eventually leading to resource exhaustion.
Exceeding cluster memory will cause data to spill onto the disc, which results in extremely slow queries.
Therefore, I recommend using this SQL query only for "smaller" datasets.
The Ugly (window functions)
The previous approach gave good results but became slow on more extensive datasets. The third solution, "the ugly," is here to save us. See, the O(n^2) space complexity comes from the fact that we are trying to match every "page visit" event with every "payment" event. At least this will be the result of the JOIN. However, a single payment event after a page visit is enough to yield a successful conversion.
The solution is not to use joins but UNION ALL and window functions. Modern SQL-based data warehouses and data lakes support window functions. They are easy to start with but tricky to master. In the following query, the idea is to sort the union-ed events by their timestamp. This way, consecutive events will come after each other in the dataset. After this, the clever use of the LEAD window function finds the next payment event's timestamp right after any page visit. The key is to look up only the following first payment event, not each, as in the case of the JOIN.
Finally, we filter the funnel window and count the converted users. The IGNORE NULLS argument for the LEAD window function is essential, as it will skip the events in the sorted dataset that don't match the criteria. Some data warehouses and lakes don't yet support this. In that case, you should change the LEAD function to MIN. However, this will cause slower query performance. This query should give you the same conversion rate as the JOIN-based query. The time complexity remains O(n^2). However, the space complexity of this query is only O(n), which will unlikely cause a data spill in the cluster.
Conclusion
As you can see, there are multiple ways to write funnel SQL queries over an event dataset.
Some are good, some are bad, some are plain ugly. However, sometimes "ugly" leads to the most optimal results.
We apply the "ugly" funnel queries in Mitzu to all data lakes and data warehouses that support the LEAD window function with the IGNORE NULLS argument, as this query is the fastest and most likely won't cause issues in the cluster.
This approach opens the door for warehouse native product analytics.
It nullifies the argument that data warehouses can't query event datasets fast enough. We have seen companies having billions of events monthly stored in their data warehouse while having funnel queries executed within 3-4 seconds.
In an upcoming blog post I will cover how to apply window functions to retention queries, so stay tuned!