István Mészáros
Collecting, storing, and modeling events in a data lake or warehouse has many benefits.
However the biggest benefit of all is, that event based data models are framework in your data warehouse that can support all business analytics use-cases.
In this post, I will cover 3 data models of how you can model events in the data warehouse and what factors you should consider before deciding on the model that will fit your use-case.
All three of these models we have seen while integrating Mitzu with customer data warehouses.
These three factors are present in each of the event models, and as you will soon see, there is no silver bullet solution. It depends on the business which you should choose.
While working with customers at Mitzu we identified 3 major event models that most companies use.
I am 100% sure there are more models. But these three comes natural.
A single table per event is a model that most likely comes first to your mind as a data engineer.
There will be precisely one table in the data warehouse for every event type the client application produces.
The greatness of this model lies in its simplicity. It is easy to explain and, for the most part, to query. Every table has its own set of columns corresponding to the events' properties.
It is also fast to query as each table can be partitioned or indexed by the event's timestamps. The name of the table describes the type of event precisely.
The problem comes when the client application produces hundreds or thousands of different event types.
Just imagine a complex mobile application or a game. Maintaining many tables in the modeling infrastructure like DBT gets exponentially more challenging.
Just imagine renaming an event property (column) or fixing column values in each of the 1000 tables.
This event model is the natural answer to the disadvantages of the single-wide event table model.
If the data team knows that the client application will produce 1000 different event types, they can decide to store them in a single giant table.
The upside of this event model is that the data team can maintain it relatively easily compared to the 1000 event tables.
However, this solution has multiple issues, rendering it impractical.
The Problems:
The biggest downside of the wide one big table is that most of the thousand columns are filled with NULLs for most event types. SQL language dialects don't give an option to ignore null-filled columns.
This makes it impractical for data analysts to understand and use this product event model. Determining which columns contain values for which event type is a constant struggle.
The narrow one big table model is a variation of the wide one big table model that solves the clarity problem.
In this model, we store every event type in the same table. However, we keep the individual event properties in a single column with a dynamic type such as JSON (in some cases MAP).
Modeling event properties with a JSON column is excellent because every event can have its unique schema inside the same table.
The downside of this solution is that the data analysts will need to use JSON operations in every SQL query they write. This requires some education, and it decreases the performance of the table, as filtering on JSON columns will always require extra CPU time.
Most companies will choose one of the three models that I described above. Here is a quick guide on which to choose for your scenario.
Final Comparison Table
See how you can benefit from warehouse native product analytics