•

Transforming Product Analytics With Databricks And Mitzu

Discover how Databricks products and Mitzu.io work together for real-time insights and accurate revenue attribution.

István Mészáros

September 14, 2023

•

5 min read

Share this post

Clickhouse and Mitzu warehouse-native integration

Table of contents

Revolutionizing SaaS Analytics: From Billions of Events to Actionable Insights

Understanding user behavior is a critical component in the world of Software as a Service (SaaS), particularly in B2C contexts. The ability to analyze and interpret billions of user events can spell the difference between a stagnant product and an ever-evolving, user-centric service.

The Challenge: Processing 8 Billion User Events in B2C SaaS

Recently, I had the opportunity to address a complex product analytics problem involving the management and processing of a staggering 8 billion user events over a two-year period, generated by a leading B2C SaaS CAD company’s application.

The primary problem was that access to relevant insights from this amount of data was sluggish at best. Answering some simple questions for product team often took hours or even days.
‍

Streaming and Organizing Data: From JSON Files to S3 Data Lake

The data was streamed to an AWS S3 data lake in JSON-line files. Each file represented 15 minutes’ worth of application usage data and was methodically organized into distinct date partitions. This system allowed for daily iterative handling of the data.

Bronze layer partitions for the raw JSON-lines data

‍

Example for events stored in JSON-line files:


{"event_type":"application_opened", "event_time":"2023-06-01T16:11:54.815479Z", "user_id":"4c17...", "event_properties":{"first_start": true}}
{"event_type":"button_clicked", "event_time":"2023-06-01T16:12:14.721323Z", "user_id":"2ec5...", "event_properties":{"button_name": "save"}}
{"event_type":"button_clicked", "event_time":"2023-06-01T16:12:28.479144Z", "user_id":"d8cd...", "event_properties":{"button_name": "close"}}
{"event_type":"viewport_zoomed", "event_time":"2023-06-01T16:13:21.332156Z", "user_id":"aaf5...", "event_properties":{"direction": "down", "amount": 0.5}}
{"event_type":"application_closed", "event_time":"2023-06-01T16:16:01.123579Z", "user_id":"aaf5...", "event_properties":{}}
...
...


-- Databricks table Schema:

event_type: STRING,
event_time: TIMESTAMP,
user_id: STRING,
event_properties: MAP,
partition_cols: [date]

-- The day partition was represented on S3 as the "folder" prefix.
-- s3://our-bucket-where-we-stored-the-data/usage-events/date=2023-06-01/...

Event properties were compactly stored in a single flat JSON object. These properties were modelled as Spark MAP<string, string> types. This strategy allowed us to minimize the number of columns and maintain a clean data structure. In addition, it ensured the scalability of the number of events and it’s properties. The best part of this approach was that we kept every event in a single databricks table.

Transforming Raw Data Daily for Faster Queries

The data processing utilized Spark SQL capabilities to generate a single delta table from the continuously ingested JSON-line files. The delta table was partitioned by the event_type column, a step that greatly enhanced query efficiency (more about this later). Data for the delta table was appended daily from the most recent “date” partition from the raw dataset. Then it was optimized and z-ordered by the event_time column, which further increased table efficiency. And finally a vacuum operation was performed to keep everything tidy on S3.

Raw data processing for query efficiency


                          -- Optimiziation and vacuuming of the full table (daily):

OPTIMIZE silver.all_events ZORDER BY (event_time);
VACUUM silver.all_events;

Enhancing Query Efficiency with Partitioning and Optimization

Scalability was an inherent feature of this project. As the data grew, our system efficiently accommodated 500 individual event types, resulting in an equivalent number of partitions in the delta table. Having only a single table ensures minimal maintenance costs. We could allocate a single data engineer part-time to keep the system up and running.

This was all possible thanks to the MAP<string, string> type. The alternative option would have been to create 500 individual tables with different schemas. It is difficult to imagine maintaining that as easily as a single table.

Of course the MAP<string, string> comes with its drawbacks. Namely, there is no strong typing of map values. However, Spark is smart enough to figure out what to cast in case there is a comparison in the SQL queries over the event property values.

We also thought of creating 3 or 4 different maps each with its own data type, however in the end it wasn’t worth doing.

Speeding Up Insights: Sub-10 Second Query Performance

As it turns out, 90% of questions product managers ask cover one to three event types and only look back six months. Most of these questions are about daily usage of certain features or user funnel questions. Some queries may be more complex such as calculating retention which requires delicate joins. In spite of that, those were still completed within a minute.

Partitioning The Data Based On Event_type And Z-Ordering By Event_time Ensured That 75% Of Ad-Hoc Analytics Queries Finished In Under 10 Seconds Over The 8 Billion Rows Of Data.

With an ever-evolving product, “time to insight” is essential. Product teams ask hundreds of questions about users and their behaviour. If they needed to wait hours for every piece of information the company’s decision making could slow down to a halt.

Empowering Non-Technical Teams with Mitzu.io Product Analytics Tool

Having data in the data lakehouse is one thing, but having it accessible for everyone, even non-tech people, is another. To ensure everyone can access product data insights, we introduced Mitzu.io.

Mitzu is a self-serve, no-code product analytics tool that directly queries the data warehouse. It is an open-source project developed entirely in Python and Dash.

It proved to be an invaluable tool for us. It successfully interpreted the complex delta table schema and identified variations in event properties. Mitzu.io essentially gave us unified access to all our usage data without ever touching the SQL editor.

Retention calculation in Mitzu with Databricks Serverless SQL

Mitzu.io further reduced the “time to insight” for our stakeholders. With it stakeholders didn’t need to wait for data analysts to answer their questions, they could do it for themselves.

By facilitating native data warehouse querying via Databricks’ Serverless SQL the time taken to generate insights was reduced from hours or days to mere seconds.

Reducing Time-to-Insight with Self-Serve Product Analytics

In summary, this project represented a successful journey from a seemingly overwhelming 8 billion user events table to an easily accessible and insightful resource. This was achieved through the strategic use of Databricks features such as partitioning and z-ordering of the delta tables, and the power of analytics tools such as Mitzu.io.

The Modern Data Stack Advantage: In-Warehouse Product Analytics

With the emergence of the modern data stack we no longer need to move product data to external analytics tools such as Mixpanel or Amplitude. Everything can be done in place, in the data warehouse.

Leveraging Dash for Rapid Development of Mitzu.io Features

Mitzu.io is entirely written in Dash. It is an excellent example of what can be achieved with this open-source Python library. As a result, we could develop this project completely and iterate over features extremely fast due to its simplicity. The inbuilt charting and data grid solutions allowed us to ship data visualization features in just a couple of days. This otherwise would have taken weeks or months.

Kudos to the Dash team!

Blogs for your growth

View all

SaaS Data Analytics

Transforming Product Analytics With Databricks And Mitzu

Revolutionizing SaaS Analytics: From Billions of Events to Actionable Insights

The Challenge: Processing 8 Billion User Events in B2C SaaS

Streaming and Organizing Data: From JSON Files to S3 Data Lake

Transforming Raw Data Daily for Faster Queries

Enhancing Query Efficiency with Partitioning and Optimization

Speeding Up Insights: Sub-10 Second Query Performance

Empowering Non-Technical Teams with Mitzu.io Product Analytics Tool

Reducing Time-to-Insight with Self-Serve Product Analytics

The Modern Data Stack Advantage: In-Warehouse Product Analytics

Leveraging Dash for Rapid Development of Mitzu.io Features

Blogs for your growth

The First Data Expert in a Startup

Warehouse Native vs. First-Generation Product Analytics

Warehouse-Native Analytics: Transforming Travel Data Insights

How to get started?

Collect data
‍

Setup Mitzu
‍

Start making better decisions faster

Unbeatable solution for all of your analytics needs

Revolutionizing SaaS Analytics: From Billions of Events to Actionable Insights

The Challenge: Processing 8 Billion User Events in B2C SaaS

Streaming and Organizing Data: From JSON Files to S3 Data Lake

Transforming Raw Data Daily for Faster Queries

Enhancing Query Efficiency with Partitioning and Optimization

Speeding Up Insights: Sub-10 Second Query Performance

Empowering Non-Technical Teams with Mitzu.io Product Analytics Tool

Reducing Time-to-Insight with Self-Serve Product Analytics

The Modern Data Stack Advantage: In-Warehouse Product Analytics

Leveraging Dash for Rapid Development of Mitzu.io Features

Blogs for your growth

The First Data Expert in a Startup

Warehouse Native vs. First-Generation Product Analytics

Warehouse-Native Analytics: Transforming Travel Data Insights

How to get started?

Collect data‍

Setup Mitzu‍

Start making better decisions faster

Unbeatable solution for all of your analytics needs

Collect data
‍

Setup Mitzu
‍