István Mészáros
Google Analytics 4 (GA4), introduced in October 2020, is positioned as a successor to Universal Analytics(UA).
GA4 aimed to enhance privacy standards in data collection and incorporated more AI and machine learning than UA, offering potential benefits for marketers.
In a nutshell, people at Google wanted GA4 to become the next marketing and product analytics solution for any business.
Google announced that GA4 would replace Universal Analytics in July 2023, and historical data would not transfer from UA to GA4.
Two years later, it is painfully clear the transition was neither smooth nor well-received.
Here are the top complaints about GA4:
While the GA4 user interface may be sufficient for basic requirements, answering more complex questions can be impossible. Luckily, you can export all your GA4 data to BigQuery for free (up until a limit).
Here is a great description and video from Google on how to do it:
https://support.google.com/analytics/answer/9358801?hl=en
You can set up daily batch exports or use the streaming approach to BigQuery.
While the streaming approach requires you to share your credit card details, it solves the issue of data lag.
You will see almost immediately all your GA4 tracked events in BigQuery.
If data lag is not something you are concerned about, stick to daily batch exports. It is free.
Having your GA4 data in BigQuery comes with a lot of benefits. The obvious one is that you can connect Looker Studio or similar data tools to it.
However, there is a caveat. GA4 data model is very complex. You must be an "expert" in SQL to answer any questions from your raw BigQuery data.
The main issue with the BigQuery GA4 data is the REPEATED type for the Event Param column.
Repeated types are the arrays in BigQuery. The main issue with arrays in SQL is that you need to know which element to access by its index.
The only sensible way to access the repeated types is by using the UNNEST function in BigQuery. (See above)
However, this will multiply the number of rows per event in the dataset, which is often impractical to query. The other issue with multiplying the number of rows per single event is that it breaks a fundamental concept of event modeling in the data warehouse. This concept means every user interaction with your landing page or the app you are tracking is modelled in a single row. The single row per event concept is essential for larger projects as you can easily reason about it as a data analyst or data scientist. Having multiple rows per event will force you to always keep in mind the deduplication.
A widely accepted data model for product or landing page events is called one big table (OBT). GA4 data is already modeled with this model.
Dealing with event properties in REPEATED types can be a real headache regarding SQL. As we've discussed, attempting to unnest these repeated types can break the fundamental concept of one event per row.
Currently, the only viable solution in BigQuery for avoiding REPEATED types while keeping the single row per single event concept is to model the properties initially in REPEATED types as a JSON column.
Why? The JSON keys can represent the names of the event properties as the keys of the JSON objects, while the values will be the corresponding JSON values.
This way, the property values are addressed not by indexing in an array (REPEATED type) but with the property's name.
More about event modeling concepts here: Modeling events in a data warehouse
One caveat is that we need to perform periodic data transformation due to this model change.
The easiest way is to create a new daily table containing all the event and user properties in JSON columns.
The code below transforms the REPEATED types of varying lengths to JSON columns instead, preserving the single row per event concept while making the properties accessible not by an index but by their name.
Visit GitHub for the most up to date code.
The new schema will look like this:
The transformed data will look like this:
Once the transformation is ready, you access the events and their properties in a human-readable way. Yes, it is true now you must use an extra inbuilt function.
This is something that should be addressed by folks at BigQuery. Other data warehouse solutions support JSON or MAP types natively.
Using the transformed model now is sufficient for answering complex questions.
A warehouse-native product analytics solution such as Mitzu can easily create SQL queries over a modeled dataset with JSON columns.
Mitzu provides a clean and easy-to-understand UI for the GA4 events from BigQuery.
Creating funnel insights or measuring visitor retention this way becomes a simple task.
You as a data analyst or data scientist don't need to worry about how to write the SQL queries.
See how you can benefit from warehouse native product analytics