István Mészáros
In the realm of data analytics, warehouse-native product analytics platforms, such as Mitzu, are revolutionizing how companies leverage their data warehouses, including those powered by Trino. By directly interfacing with Trino, Mitzu eliminates the need for traditional reverse ETL processes, enabling real-time analytics directly on top of existing data infrastructure. This approach ensures that businesses can harness the full power of their data without duplicating it across disparate systems. The integration between Mitzu and Trino offers unprecedented scalability, flexibility, and speed in querying vast datasets, making it easier for organizations to find actionable insights from their product data.
This blog post delves into how leveraging Mitzu over Trino for product analytics can transform data strategies by providing more profound, more actionable insights with significantly reduced overhead and complexity.
In the ever-evolving landscape of big data analytics, Trino stands out as a powerful distributed SQL query engine designed for interactive query performance over large-scale data lakes. Originally developed by Facebook and known as PrestoSQL, it was rebranded to Trino to avoid confusion with PrestoDB. Trino facilitates querying data where it lives, including in Hadoop, S3, Azure Blob Storage, Google Cloud Storage, and many relational database systems, without data movement or transformation.
Trino's architecture is built for speed and efficiency, employing a low-latency, high-performance engine that supports a variety of data sources through its connector framework. This makes it an ideal tool for data analysts, scientists, and engineers who require real-time insights across different data stores. Its ability to execute analytical queries against various data sources with ANSI SQL support allows for complex data analysis and aggregation without needing separate data pipelines.
This post will look at Trino as a classic data lake solution. Trino has many great features, like joining between two completely different data sources or functioning as an ELT/ETL solution. However, those features are not relevant for product data analytics.
As I have discussed in this blog post, there are only three ways to model product events in a data lake or warehouse.
I am not going to cover the benefits and drawbacks here. Instead, I chose the single event table model and showcased warehouse native product analytics with an example.
Trino has a three-level namespace system where your product event tables sit in a single catalog under a single Schema.
These product event tables may be partitioned on the storage side (e.g., S3).
Ideally, there is only one partition column, DATE. Having multiple date-type partition columns in a table, like YEAR, MONTH, or DAY, is impractical. It makes SQL query development a pain for data analysts.
Another essential thing to mention is mandatory columns. Each product event table has a user identifier and event time columns. While the name of the table represents the name of the action the user performed in the product. It is essential to have all mandatory columns be named the same, e.g., USER_ID, CLIENT_EVENT_TIME.
Traditionally, there were two ways of learning about your product usage with Trino:
Both options are mediocre at best. 3rd party solutions are costly as they need to copy all your data. Besides that, their insights often diverge from your SQL-based insights.
Product analytics tools, however, are essential to a data-driven organization as they enable the non-tech folk (product managers, marketing, sales, customer success) to understand product usage behavior.
There is a new and emerging way for product analytics that is best described as warehouse-native. This approach kind of merges the best of the two worlds:
On top of this, they are significantly cheaper than 3rd party solutions (Amplitude, Mixpanel).
These solutions achieve cost efficiency by not copying your data. The warehouse-native product analytics solution is a thin user interface over your data lake. Last but not least, data security is not compromised.
Mitzu is one of the leading warehouse native product analytics solutions.
It enables the aforementioned non-tech folk to learn directly from the data lake the following insights:
Besides that, it has excellent charting capabilities:
Users of Mitzu can interact with the data in Trino as they would interact with data in Amplitude or Mixpanel. Each click shows a new learning about the users of your product.
Creating user cohorts, dashboards and showing user properties (like email or origin country) is always one click away.
As your company doesn't have to set up a reverse ETL pipeline or any other data ingestion mechanism to 3rd party tools, Mitzu can run on a fixed price model.
It doesn't care about your monthly tracked users or events. Typical warehouse-native applications have either seat-based pricing or fixed pricing models.
Trino is a robust and capable technology. It optimizes the generated SQL queries sufficiently, so a typical product analytics query runs almost as fast as it would run in Amplitude or Mixpanel over their proprietary storage solution.
Integrating Mitzu with Trino for a typical data lake that has, for example, 500 event tables takes around 5 minutes.
You only need to create a user in Trino that has permission to access (read-only) the tables in the catalog.
Once the user is ready, you must share the credentials with Mitzu. Remember, you control what permissions you give to your user.
Select the 500 tables containing your product usage events as a second step. Choose the user identification and event time columns and, optionally, the date partition (if you have those).
As the last step, click "Save and Update Catalog". This step should take a minute or two.
After this step, head to the Explore page and start understanding your users.
In this post, I wanted to showcase how you can benefit from using Trino and a warehouse-native product analytics solution like Mitzu.
Integrating Mitzu with your Trino data lake only takes 5 minutes, and you can control which tables it can reach and which can't. After the setup is ready, you and your non-tech colleagues, like product managers, marketing specialists, and sales representatives, can access the data lake in a self-served way without ever learning SQL.
See how you can benefit from warehouse native product analytics