Data engineering

February 28, 2024

10 mins

      Funnels with SQL: The good, the bad and the ugly way

      István Mészáros

      Understanding your user's journey is more than just advantageous. It's critical for success. Funnel metrics emerge as the linchpin in product analytics, offering a clear view into the user's pathway from initial engagement to conversion. They are the roadmap guiding businesses to identify bottlenecks, optimise user experiences, and make data-driven decisions.

      During the development ofM we optimised our funnel query-building engine to support all our customers. In this process, we conducted interviews with various data analysts to understand their approaches to crafting a funnel query using SQL language.

      We have identified three ways data teams in companies write these queries:

      • The good
      • The bad
      • The ugly

      Each of these has specific characteristics that we will discuss in this post.

      The Setup

      Before we jump into SQL query writing, let us first explore the typical data upon which the data team executes these queries.

      Event dataset

      This dataset is called a behavioural event dataset or just an event dataset. In this post, we are going to examine a simple funnel question:

      What is the daily ratio of users who visited our website and paid for something within a day?

      The Bad (naive approach)

      Let's start with the worst of all. Why is it the bad one? Simply because it doesn't yield correct results.
      Junior data analysts typically write this query. The main issue is that it doesn't connect the users who did "page visit" and then "payment" events.
      It simply divides two numbers that can yield weird results. Often, the conversion rate is more than 100%, which is already something of concern.

      
      with page_visits as (
      	select 
      		date(event_time) as day, 
      		count(distinct user_id) as user_count
      	from events
      	where event = 'page_visit'
      ),
      payments as (
      	select 
      		date(event_time) as day, 
      		count(distinct user_id) as user_count
      	from events
      	where event = 'payment'
      )
      
      select 
      	t1.day, 
      	t2.user_count * 100 / t1.user_count as conversion_rate
      from page_visits t1
      join payments t2 using(day)
      

      Yes, we mention the two events, but the division is between two disconnected aggregations.

      When would I use such a query? The answer is never. Although we have to give credit, it is straightforward to write and extend.
      Besides that, this is the fastest query of all three we describe.

      CharacteristicComment
      Code complexityvery simple
      Time complexityO(n)
      Space complexityO(n)
      Resultsinaccurate
      Usage recommendationnever

      The Good (left joins)

      Next in line comes "the good" query (as we call it). This query connects the users across their events and takes care of the correct funnel window filter.
      This is "the good" solution, as it will yield good results. However, it is not a perfect solution, as you will see.

      
      with page_visits as (
      	select 
      		event_time, 
      		user_id
      	from events
      	where event = 'page_visit'
      ),
      payments as (
      	select 
      		event_time,
      		user_id
      	from events
      	where event = 'payment'
      )
      
      select 
      	t1.day, 
      	count(distinct t1.user_id) * 100 / count(distinct t2.user_id) as conversion_rate
      from page_visits t1
      left join payments t2 on 
      	t1.user_id = t2.user_id 
      	and t2.event_time > t1.event_time
      	and t2.event_time <= t1.event_time + interval '1' day
      

      The left join with the user_id condition will ensure we count only the converted users. After that, the DateTime condition ensures we attribute conversions only to the correct funnel window.

      The main problem with this solution comes from the fact that it uses a join. Imagine a different scenario where we look for a one-year funnel window, and the number of events is in the billions. The JOIN in the SQL query will lead to O(n^2) space and time complexity. Space complexity causes a bigger problem.

      Left JOIN causes O(n^2) space complexity. As it must compare every event to every event.

      Why? Because any data warehouse or data lake's computation engine has limited memory. You can only linearly increase your cluster memory size, eventually leading to resource exhaustion.
      Exceeding cluster memory will cause data to spill onto the disc, which results in extremely slow queries.
      Therefore, I recommend using this SQL query only for "smaller" datasets.

      CharacteristicComment
      Code complexitymedium complexity
      Time complexityO(n^2)
      Space complexityO(n^2)
      Resultsprecise
      Usage recommendationsmall to medium size datasets

      The Ugly (window functions)

      The previous approach gave good results but became slow on more extensive datasets. The third solution, "the ugly," is here to save us. See, the O(n^2) space complexity comes from the fact that we are trying to match every "page visit" event with every "payment" event. At least this will be the result of the JOIN. However, a single payment event after a page visit is enough to yield a successful conversion.

      The solution is not to use joins but UNION ALL and window functions. Modern SQL-based data warehouses and data lakes support window functions. They are easy to start with but tricky to master. In the following query, the idea is to sort the union-ed events by their timestamp. This way, consecutive events will come after each other in the dataset. After this, the clever use of the LEAD window function finds the next payment event's timestamp right after any page visit. The key is to look up only the following first payment event, not each, as in the case of the JOIN.

      
      with all_events as (
      	select 
      		user_id,
      		event,
      		event_time
      	from events
      ),
      
      conversion_times as (
      	select 
      		user_id,
      		event,
      		event_time as event_time_1,
      		case event = 'page_visit'
      			then lead(case event='payment' then event_time end) ignore nulls 
      					over (partition by user_id order by event_time asc)
      			else null 
      		end as event_time_2
      	from all_events
      ),
      
      filtered_conversions as (
      	select 
      		user_id,
      		event_time_1,
      		case event_time_2 <= event_time_1 + interval '1' day 
      			then user_id 
      			else null 
      		end as converting_user_id
      	from conversion_times
      	where event = 'page_visit'
      )
      
      select 
      	date(event_time_1) as day, 
      	count(distinct user_id) * 100 / count(distinct converting_user_id) as conversion_rate
      from filtered_conversions
      

      Finally, we filter the funnel window and count the converted users. The IGNORE NULLS argument for the LEAD window function is essential, as it will skip the events in the sorted dataset that don't match the criteria. Some data warehouses and lakes don't yet support this. In that case, you should change the LEAD function to MIN. However, this will cause slower query performance. This query should give you the same conversion rate as the JOIN-based query. The time complexity remains O(n^2). However, the space complexity of this query is only O(n), which will unlikely cause a data spill in the cluster.

      Event conversion lookup with Window functions. Doesn't need to match every event with every other.
      CharacteristicComment
      Code ComplexityCan Get Very Complex
      Time ComplexityO(n^2)
      Space ComplexityO(n)
      ResultsPrecise
      Usage RecommendationLarge Datasets, Millions Or Billions Of Events

      Conclusion

      As you can see, there are multiple ways to write funnel SQL queries over an event dataset.
      Some are good, some are bad, some are plain ugly. However, sometimes "ugly" leads to the most optimal results.

      We apply the "ugly" funnel queries in Mitzu to all data lakes and data warehouses that support the LEAD window function with the IGNORE NULLS argument, as this query is the fastest and most likely won't cause issues in the cluster.
      This approach opens the door for warehouse native product analytics.
      It nullifies the argument that data warehouses can't query event datasets fast enough. We have seen companies having billions of events monthly stored in their data warehouse while having funnel queries executed within 3-4 seconds.

      In an upcoming blog post I will cover how to apply window functions to retention queries, so stay tuned!

      Explore warehouse native product analytics

      See how you can benefit from warehouse native product analytics