Back to Blog
Data Engineering

Identifying Users In The Data Warehouse

Connected components algorithm for user unification

Discover how to use the connected components algorithm to unify user aliases in a single table, enabling seamless tracking and analysis across all datasets.

István Mészáros
István Mészáros

Co-founder & CEO

November 2, 2023
5 min read
Identifying Users In The Data Warehouse

TL;DR

Discover how to use the connected components algorithm to unify user aliases in a single table, enabling seamless tracking and analysis across all datasets. User identification in data is a recurring problem for warehouse-first B2C companies.

Why User Identification is Complex in B2C Companies?

User identification in data is a recurring problem for warehouse-first B2C companies. A single user can be identified with multiple aliases in the datasets:

✅ user_id - ID after sign-in✅ device_id - Mobile or desktop device's ID✅ anonymous_id - Tracking cookie✅ customer_id - Stripe✅ contact_id - Hubspot✅ email - Used for sign-in... and many more.

The Risks of Using a Single Identifier for Users

Picking a single identifier for user identification is risky. You may end up:❌ not being able to use that identifier for all your users❌ multiple identifiers for the same user

Retroactive User Recognition: Merging IDs for Accurate Tracking

ℹ️ The correct solution is to merge all these IDs into a single merged ID and create a user alias table where all aliases are associated with their merged IDs.

This approach is called retroactive user recognition and it simplifies data modelling significantly. You won't have to think about which ID to use in joins. You can safely use the merged ID once they are backfilled to the all tables.

Example: How to Handle Multiple User Aliases in Clickstream Data

It is the simplest to explain this challenge with an example. Imagine you have a dataset that contains page visits of users.

It has four columns:

  • event_time
  • anonymous_id
  • email
  • url
Clickstream events table

The anonymous_id is always filled from the browser's cookie. However, it changes if the browser cookie is refreshed. The email is filled once the user logs into the application. However, it changes if the user logs in with another email.

Now, you need to answer a simple question:

Creating a User Alias Table: Simplifying Data Modeling with Merged IDs

As you can see, no single column can uniquely identify the user. In fact, all of its anonymous_ids and emails identify the users. The anonymous_id and the emails are only aliases for the same user. What you need is a user_alias table that contains all aliases.

The alias table is the association of all the users' aliases to a single merged_id that can be used to track that user across all datasets.

Example aliases table

Creating and backfilling all aliases where it is missing is the process of retroactive user recognition.

Why Building a User Alias Table is More Than Just SQL Joins?

Creating the user aliases table is way more complex than you think. Most people approach it with some special joins and group bys. However, building that alias table is impossible with only simple joins and group bys with SQL.

Why? Because the problem of user aliases is, in fact, a graph problem.

User with anonymous_id aid_1 signs in with email (john.doe@gmail.com). The browser clears the cookie at some point, so the user is also signed out. The user signs in again while having the new cookie. At another point, the user switches to the company email (john.doe@professional.com), which is typical for SaaS businesses.

We are still talking about the same user who has four aliases.

Here is the change graph of alias associations for our user:

User alias association graph

As you can see, none of the anonymous_ids or emails are an excellent identifier for the user. We need to define a new ID uniquely associated with the user. The best way to do this is to have the whole graph as the source of the unique ID.

Once we have a unique ID, we can backfill it to all events in the table and be able to support the two original requirements.

Walking the Graph: Using Recursive CTEs to Map User Aliases

To map out all the aliases, we must walk through all edges in this graph. Graph walking in SQL is only possible to do with recursive CTEs. In this example we are using Postgres SQL, with minor changes this SQL can ported to all SQL dialects.

First, we need to define the edges:

create table default.user_aliases as 
WITH RECURSIVE 
-- collecting the graph edges
edges as (
  select 
    anonymous_id as a,
    coalesce(email, anonymous_id) as b -- making sure users without sign up are also included
  from page_visits  
),

-- making sure the graph is bidirectional
bidirectional as (
  select a,
    b
  from edges
  union all
  select b as a,
    a as b
  from edges
)
create table default.user_aliases as 
WITH RECURSIVE 
-- collecting the graph edges
edges as (
  select 
    anonymous_id as a,
    coalesce(email, anonymous_id) as b -- making sure users without sign up are also included
  from page_visits  
),

-- making sure the graph is bidirectional
bidirectional as (
  select a,
    b
  from edges
  union all
  select b as a,
    a as b
  from edges
)
-- The connected components algorithm in SQL:
-- walking the graph to collect all aliases
walks AS (
  SELECT 
    a as a,
    a as b,
    1 as depth
  FROM bidirectional
  UNION
  SELECT w.a,
    e.b,
    depth + 1 as depth
  FROM walks w
    JOIN bidirectional e ON w.b = e.a
    and not (
      e.a = w.a
      and e.b = w.b
    )
    and not (
      e.b = w.a
      and e.a = w.b
    )
  -- Termination condition
  where depth < 10 
)
create table default.user_aliases as 
WITH RECURSIVE 
-- collecting the graph edges
edges as (
  select 
    anonymous_id as a,
    coalesce(email, anonymous_id) as b -- making sure users without sign up are also included
  from page_visits 
),

-- making sure the graph is bidirectional
bidirectional as (
  select a,
    b
  from edges
  union all
  select b as a,
    a as b
  from edges
),

-- The connected components algorithm in SQL:
-- walking the graph to collect all aliases
walks AS (
  SELECT 
    a as a,
    a as b,
    1 as depth
  FROM bidirectional
  UNION
  SELECT w.a,
    e.b,
    depth + 1 as depth
  FROM walks w
    JOIN bidirectional e ON w.b = e.a
    and not (
      e.a = w.a
      and e.b = w.b
    )
    and not (
      e.b = w.a
      and e.a = w.b
    )
  -- Termination condition
  where depth < 10 
)

-- picking the merged_id for all aliases
select 
  min(b) merged_id,
  a as alias
from walks
group by 2
create table default.user_aliases as 
WITH RECURSIVE 
edges as (
  select 
    anonymous_id as a,
    coalesce(email, anonymous_id) as b 
  from page_visits
  union all
  select 
    contat_id as a,
    email as b
  from hubspot_contacts -- extending with Hubspot
  union all
  select 
    customer_id as a,
    email as b
  from stripe_customers  -- extending with Stripe
),

bidirectional as (
  select a,
    b
  from edges
  union all
  select b as a,
    a as b
  from edges
),

walks AS (
  SELECT 
    a as a,
    a as b,
    1 as depth
  FROM bidirectional
  UNION
  SELECT w.a,
    e.b,
    depth + 1 as depth
  FROM walks w
    JOIN bidirectional e ON w.b = e.a
    and not (
      e.a = w.a
      and e.b = w.b
    )
    and not (
      e.b = w.a
      and e.a = w.b
    )
  -- Termination condition
  where depth < 10 
)

select 
  min(b) merged_id,
  a as alias
from walks
group by 2

Key Takeaways

  • Discover how to use the connected components algorithm to unify user aliases in a single table, enabling seamless tracking and analysis across all datasets.

About the Author

István Mészáros

Co-founder & CEO

LinkedIn: https://www.linkedin.com/in/imeszaros/

Co-founder and CEO of Mitzu. Passionate about product analytics and helping companies make data-driven decisions.

Share this article

Subscribe to our newsletter

Get the latest insights on product analytics.

Ready to transform your analytics?

See how Mitzu can help you gain deeper insights from your product data.

Get Started

How to get started with Mitzu

Start analyzing your product data in three simple steps

Connect your data warehouse

Securely connect Mitzu to your existing data warehouse in minutes.

Define your events

Map your product events and user properties with our intuitive interface.

Start analyzing

Create funnels, retention charts, and user journeys without writing SQL.