Overview
Context
Google Analytics 4 event data has a schema which can be extremely difficult to work with, even for SQL experts. All automatic and custom parameters and properties (e.g. event_params
, user_properties
) are returned in ARRAYS
of complex key-value STRUCT
pairs, which require prior knowledge of the name and the data type of the paremeters and properties.
This is further complicated by the fact that the automatic parameters can change over time, and the data types are not always consistent.
example: parameter data type inconsistency
An example here is the ga_session_id
which is typically an INT64
, but which we have observed in a number of streams as a STRING
. This means that extracting the INT64
value using the following example query - which should return all raw event_data plus the ga_session_id
as a separate column - would lose data whenever the value was returned as a STRING
.
SELECT *,
(SELECT event_param.value.int_value FROM UNNEST(event_params) AS event_param WHERE event_param.key = 'ga_session_id')
FROM `project_id.analytics_##########.events_2023*`
This would result in under-reporting of session counts.
Approach
Profiling
The approach we take, in order to enable users to remodel their raw data into a useable structure, is to run a set of profiler functions over the raw data, in order to determine:
event_name
: unique valuesevent_params
: unique names, data types and data type occurrencesuser_properties
: unique names, data types and data type occurences
Neither the ecommerce
nor the items
are included in the profiler at this stage.
Data Type Selection
In order to avoid potential data loss due to data type inconsistency, the profiling stage identifies observed data types and occurrence counts. This enables identification of any data type inconsistencies, which can happen as implementation is prone to potential human error. If there are inconsistencies, then we can deploy custom code for these edge cases in line with the BigQuery conversion rules.
The potential data conversion rules between the different possible data types (using the CAST
or SAFE_CAST
functions) are:
flowchart TB
STRING
INT64
FLOAT64
INT64-->STRING
FLOAT64-->STRING
INT64-->FLOAT64
The logic to define output data types for each parameter or property is governed by these relationships and the data types identified in the event_profile
.
Observed Data Type(s) | Potential Output Data Type |
---|---|
STRING |
STRING |
INT64 |
INT64 |
FLOAT64 |
FLOAT64 |
STRING , INT64 |
STRING |
STRING , FLOAT64 |
STRING |
INT64 , FLOAT64 |
FLOAT64 |
STRING , INT64 , FLOAT64 |
STRING |
Decoding Expressions
Once the data has been profiled and we understand the event characteristics, we can build the SQL required to decode the complex structures into the desired output structures. For the event_name
, event_params
and user_properties
the strategies employed are:
event_name
Attribute | Value |
---|---|
Column Name | event_name |
Column Schema | STRING |
Default Values | click , first_visit , page_view , scroll , session_start , user_engagement , purchase |
Challenges | While each event has a single, non-null event_name , this structure can be difficult to use in downstream reporting tools. For example, in Looker Studio it is simple to display one of these metrics (e.g. page_view ) as a Scorecard by adding a chart filter for e.g. event_name = 'page_view' and adding the records metric to the scorecard. However if you wanted to plot multiple event_name metrics on a time series this would be impossible as the filters required would cancel each other out. It is possible to get around this by creating calculated fields for each filter, but it is laborious, adds unnecessary complexity and could also potentially introduce human errors as formula errors are common. |
Approach | The original event_name column is retained in the data as it can be useful for a variety of use-cases. However an additional STRUCT flag column event_count is included, with a flag (1 or NULL ) column for each event_name value. This means that default events like page_view become their own metric event_count.page_view , which can be treated as a normal metrics, allowing usage of normal mathematical functions. Custom events will also be included as their own columns within the event_count STRUCT , and plotting metrics against each other on a single chart is now trivial |
Output Schema | event_name STRING, event_count STRUCT<event_name_1 STRING, event_name_2 STRING, ... event_name_n STRING> |
event_count.conversions
Conversions are not explicitly identified in BigQuery GA4 events, however we add an additional event_count.conversions
column in the event_count
output STRUCT
to enable them to be used as a metric. The default logic for this is aligned to the Google Analytics 4 definition, and includes all events with event_name = 'purchase'
. This can be updated if additional events are marked as conversions, however it will not happen automatically.
event_params
Attribute | Value |
---|---|
Column Name | event_params |
Column Schema | event_params ARRAY<STRUCT<key STRING, value STRUCT<string_value STRING, int_value INT64, float_value FLOAT64, double_value FLOAT64>>> |
Default Values | event_params.key : anonymize_ip , campaign , content , dclid , debug_mode , engaged_session_event , engagement_time_msec , entrances , ga_session_id , ga_session_number , gclid , ignore_referrer , language , link_classes , link_domain , link_id , link_url , medium , outbound , page_location , page_referrer , page_title , parent_term , percent_scrolled , screen_resolution , session_engaged , source , term |
Challenges | Extracting values from this complex ARRAY column requires previous knowledge of both the event_params.key to be extracted and the data type, as well as using the UNNEST operator to access array elements, which is a complicated syntax to master, e.g. (SELECT event_param.value.int_value FROM UNNEST(event_params) AS event_param WHERE event_param.key = 'ga_session_id') Using this structure in a downstream reporting tool is near-impossible as it requires complex calculated fields for every possible key expected in the data. |
Approach | The decoding expression for the event_params column is generated programmatically from the event_profile derived in the profiling stage, and leverages a set of extract functions (extract_event_params_double_value , extract_event_params_float_value , extract_event_params_int_value , extract_event_params_string_value ) to avoid data loss due to type inconsistencies. |
user_properties
Attribute | Value |
---|---|
Column Name | user_properties |
Column Schema | user_properties ARRAY<STRUCT<key STRING, value STRUCT<string_value STRING, int_value INT64, float_value FLOAT64, double_value FLOAT64, set_timestamp_micros INT64>>> |
Default Values | None |
Challenges | Extracting values from this complex ARRAY column requires previous knowledge of both the user_properties.key to be extracted and the data type, as well as using the UNNEST operator to access array elements, which is a complicated syntax to master, e.g. (SELECT user_property.value.int_value FROM UNNEST(user_properties) AS user_property WHERE user_property.key = 'ga_session_id') Using this structure in a downstream reporting tool is near-impossible as it requires complex calculated fields for every possible key expected in the data. |
Approach | The decoding expression for the user_properties column is generated programmatically from the event_profile derived in the profiling stage, and leverages a set of extract functions (extract_user_properties_double_value , extract_user_properties_float_value , extract_user_properties_int_value , extract_user_properties_string_value ) to avoid data loss due to type inconsistencies. |
Standard Base Query
We start with a base query, which is standard structured as a sequence of common table expressions (explicitly named SQL-defined logical steps). This query then calls the custom queries (add_event_counts
, decode_event_params
and decode_user_properties
) which are deployed in either the inbound GA4 dataset or a separate, regionally co-located dataset. We can then periodically refresh the custom queries to ensure that new event_name
, event_params
and user_properties
values are refelcted in the downstream data.
Decoder Table Function
This base query (along with the referenced custom queries) is then deployed in the GA4 BigQuery dataset as a date-paramterised Table Function called events
, which can be efficiently queried directly by specifying a start_date
and end_date
DATE
parameter, e.g. for the past 30 days
GA4 Dataset
SELECT *
FROM `[project_id].analytics_########.events`(CURRENT_DATE - 30, CURRENT_DATE);
Separate Dataset
SELECT *
FROM `[project_id].[dataset_name].events`(CURRENT_DATE - 30, CURRENT_DATE);
It can also be queried directly from Looker Studio with report date parameters, using the following code sytnax to parse the DS_START_DATE
and DS_END_DATE
parameters into SQL DATE
values:
SELECT *
FROM `[project_id].analytics_########.events`(
PARSE_DATE("%Y%m%d", @DS_START_DATE),
PARSE_DATE("%Y%m%d", @DS_END_DATE));
For event streams which have either a high query load, a large amount of data or both, we can use bqtools
to build and periodically update a date-partitioned table, to which downstream tools can connect directly. This will optimise performance and cost, but requires a scheduled query to keep updating with the latest partitions.
Multi-Property Decoding
It is also possible to 'merge' multiple event_profile
objects to build a universal decoder, which will result in a perfectly aligned schema across multiple GA4 properties regardless of discrepencies seen in actual observed data. This means that the downstream data can be integrated trivially.
Event Decoding
Summary Logic
The decoder function is a common-table-expression structured, date parameterised table function. It takes a start_date
and an end_date
parameter which pre-filters for a range of date shards in the inbound event data, and then executes a set of logical steps to remodel the data in a clean and extensible manner.
Each of these logical steps is implemented as a single sequential common table expression (CTE), separating out the different steps for ease of readbility, debugging and extensibility.
Name | Description |
---|---|
get_raw_events |
Get inbound dates by date shard range. |
add_event_id |
Add a hexadecimal SHA256 hash of the JSON representation of each row as a unique event_id . |
add_session_id |
Add a hexadecimal SHA256 hash of the JSON representation of the user_pseudo_id and the ga_session_id extracted from the event_params ARRAY . |
convert_dates_and_timestamps |
Converts UNIX dates and timestamps to DATE and TIMESTAMP data types. |
add_event_counts |
Custom function for each GA4 property. Returns an event_count STRUCT containing event_name count columns for each observed event_name value, in addition to total events and total observed conversions . |
decode_event_params |
Custom function for each GA4 property. Returns an event_param STRUCT for each event with a column for each observed value in the event_params ARRAY . Output data types are determined by observed data types in the profiling stage and the logical approach outlined in the docs. |
decode_user_properties |
Custom function for each GA4 property. Returns a user_property STRUCT for each event with a column for each observed value in the user_properties ARRAY . Output data types are determined by observed data types in the profiling stage and the logical approach outlined in the docs. |
correct_attribution |
Optional fix for the known misattribution bug for Google / Paid Search . |
additional_logic |
Placeholder for additional custom logic. |
exclude_columns |
Configurable exclusion of non-required columns for schema simplification purposed (e.g. event_params and user_properties ARRAYS but can be trivially extended if e.g. items or ecommerce are not required). |