Identifying Users In The Data Warehouse
User identification in data is a recurring problem for warehouse-first B2C companies. A single user can be identified with multiple aliases in the datasets:
✅ user_id - ID after sign-in
✅ device_id - Mobile or desktop device's ID
✅ anonymous_id - Tracking cookie
✅ customer_id - Stripe
✅ contact_id - Hubspot
✅ email - Used for sign-in
... and many more.
Picking a single identifier for user identification is risky. You may end up:
❌ not being able to use that identifier for all your users
❌ multiple identifiers for the same user
ℹ️ The correct solution is to merge all these IDs into a single merged ID and create a user alias table where all aliases are associated with their merged IDs.
This approach is called retroactive user recognition and it simplifies data modelling significantly.
You won't have to think about which ID to use in joins. You can safely use the merged ID once they are backfilled to the all tables.
Retroactive User Recognition Example
It is the simplest to explain this challenge with an example. Imagine you have a dataset that contains page visits of users.
It has four columns:
- event_time
- anonymous_id
- url
The anonymous_id is always filled from the browser's cookie. However, it changes if the browser cookie is refreshed. The email is filled once the user logs into the application. However, it changes if the user logs in with another email. Now, you need to answer a simple question:
What was the first URL that a user visited?
As you can see, no single column can uniquely identify the user. In fact, all of its anonymous_ids and emails identify the users. The anonymous_id and the emails are only aliases for the same user. What you need is a user_alias table that contains all aliases. The alias table is the association of all the users' aliases to a single merged_id that can be used to track that user across all datasets.
Creating and backfilling all aliases where it is missing is the process of retroactive user recognition.
Creating The User Aliases Table
Creating the user aliases table is way more complex than you think. Most people approach it with some special joins and group bys. However, building that alias table is impossible with only simple joins and group bys with SQL.
Why? Because the problem of user aliases is, in fact, a graph problem.
User with anonymous_id aid_1 signs in with email (john.doe@gmail.com). The browser clears the cookie at some point, so the user is also signed out. The user signs in again while having the new cookie. At another point, the user switches to the company email (john.doe@professional.com), which is typical for SaaS businesses. We are still talking about the same user who has four aliases.
Here is the change graph of alias associations for our user:
As you can see, none of the anonymous_ids or emails are an excellent identifier for the user. We need to define a new ID uniquely associated with the user. The best way to do this is to have the whole graph as the source of the unique ID.
Once we have a unique ID, we can backfill it to all events in the table and be able to support the two original requirements.
Walking The Graph
To map out all the aliases, we must walk through all edges in this graph. Graph walking in SQL is only possible to do with recursive CTEs.
In this example we are using Postgres SQL, with minor changes this SQL can ported to all SQL dialects.
First, we need to define the edges:
Once we have, we must apply the so-called "connected components" algorithm to walk through the graph edges and detect if the nodes are part of the same sub-graph.
Here is the result of the Walks CTE. You can see which node (a) are reachable from the nodes (b) in the graph. Of course in our single example every node is reachable.
We need some additional cleaning and finalization:
Below, you can see what our final user_aliases table will look like (notice the merged_id is one of the aliases):
Now, you can join this table from the events and do a backfill.
Supporting All Tables
In our example, we had only a table and two user alias types. The added benefit of this approach is that it is extendable to any number of tables and aliases. Extendibility is the beauty of the connected components algorithm.
Here is an example of how a typical SaaS company would use this algorithm to recognize all of its users retroactively. In this example, we have:
- Hubspot contact_id and email pairs
- Stripe customer_id and email pairs
- Product anonymous_id and email pairs
By adding the additional edges, we will have a more complete alias table with Stripe customer IDs and Hubspot contact IDs.
The connected components covering all these IDs enable the data team to join all datasets.
Joining across all data sets from different domains unlocks access to the entire user lifecycle:
- Landing page visit
- Sign up
- Engagement with the product
- Subscription started (e.g. in Stripe)
- The sales team contacted the person for upselling
- Subscription cancelled
Here is an example how to extend this SQL to support different datasources. With this creating the user aliases table has never been easier before:
Final Thoughts
With the help of the connected components algorithm, we can map out correctly all the aliases of every user into a single table. We can then use this table to backfill all event (fact) tables with the merged_id.
Once we backfill the merged IDs to the event tables, we can easily track the users across all datasets. With that, a whole new world of data analytics reveals itself as joining across datasets becomes a simple exercise.