Skip to content

Overview

Context

Google Analytics 4 event data has a schema which can be extremely difficult to work with, even for SQL experts. All automatic and custom parameters and properties (e.g. event_params, user_properties) are returned in ARRAYS of complex key-value STRUCT pairs, which require prior knowledge of the name and the data type of the paremeters and properties.

This is further complicated by the fact that the automatic parameters can change over time, and the data types are not always consistent.

example: parameter data type inconsistency

An example here is the ga_session_id which is typically an INT64, but which we have observed in a number of streams as a STRING. This means that extracting the INT64 value using the following example query - which should return all raw event_data plus the ga_session_id as a separate column - would lose data whenever the value was returned as a STRING.

SELECT *,
(SELECT event_param.value.int_value FROM UNNEST(event_params) AS event_param WHERE event_param.key = 'ga_session_id')
FROM `project_id.analytics_##########.events_2023*` 

This would result in under-reporting of session counts.

Approach

Profiling

The approach we take, in order to enable users to remodel their raw data into a useable structure, is to run a set of profiler functions over the raw data, in order to determine:

  • event_name: unique values
  • event_params: unique names, data types and data type occurrences
  • user_properties: unique names, data types and data type occurences

Neither the ecommerce nor the items are included in the profiler at this stage.

Data Type Selection

In order to avoid potential data loss due to data type inconsistency, the profiling stage identifies observed data types and occurrence counts. This enables identification of any data type inconsistencies, which can happen as implementation is prone to potential human error. If there are inconsistencies, then we can deploy custom code for these edge cases in line with the BigQuery conversion rules.

The potential data conversion rules between the different possible data types (using the CAST or SAFE_CAST functions) are:

flowchart TB

STRING
INT64
FLOAT64

INT64-->STRING
FLOAT64-->STRING
INT64-->FLOAT64

The logic to define output data types for each parameter or property is governed by these relationships and the data types identified in the event_profile.

Observed Data Type(s) Potential Output Data Type
STRING STRING
INT64 INT64
FLOAT64 FLOAT64
STRING, INT64 STRING
STRING, FLOAT64 STRING
INT64, FLOAT64 FLOAT64
STRING, INT64, FLOAT64 STRING

Decoding Expressions

Once the data has been profiled and we understand the event characteristics, we can build the SQL required to decode the complex structures into the desired output structures. For the event_name, event_params and user_properties the strategies employed are:

event_name

Attribute Value
Column Name event_name
Column Schema STRING
Default Values click, first_visit, page_view, scroll, session_start, user_engagement, purchase
Challenges While each event has a single, non-null event_name, this structure can be difficult to use in downstream reporting tools. For example, in Looker Studio it is simple to display one of these metrics (e.g. page_view) as a Scorecard by adding a chart filter for e.g. event_name = 'page_view' and adding the records metric to the scorecard. However if you wanted to plot multiple event_name metrics on a time series this would be impossible as the filters required would cancel each other out.

It is possible to get around this by creating calculated fields for each filter, but it is laborious, adds unnecessary complexity and could also potentially introduce human errors as formula errors are common.
Approach The original event_name column is retained in the data as it can be useful for a variety of use-cases. However an additional STRUCT flag column event_count is included, with a flag (1 or NULL) column for each event_name value. This means that default events like page_view become their own metric event_count.page_view, which can be treated as a normal metrics, allowing usage of normal mathematical functions.

Custom events will also be included as their own columns within the event_count STRUCT, and plotting metrics against each other on a single chart is now trivial
Output Schema event_name STRING, event_count STRUCT<event_name_1 STRING, event_name_2 STRING, ... event_name_n STRING>
event_count.conversions

Conversions are not explicitly identified in BigQuery GA4 events, however we add an additional event_count.conversions column in the event_count output STRUCT to enable them to be used as a metric. The default logic for this is aligned to the Google Analytics 4 definition, and includes all events with event_name = 'purchase'. This can be updated if additional events are marked as conversions, however it will not happen automatically.

event_params

Attribute Value
Column Name event_params
Column Schema event_params ARRAY<STRUCT<key STRING, value STRUCT<string_value STRING, int_value INT64, float_value FLOAT64, double_value FLOAT64>>>
Default Values event_params.key: anonymize_ip, campaign, content, dclid, debug_mode, engaged_session_event, engagement_time_msec, entrances, ga_session_id, ga_session_number, gclid, ignore_referrer, language, link_classes, link_domain, link_id, link_url, medium, outbound, page_location , page_referrer, page_title, parent_term, percent_scrolled, screen_resolution, session_engaged, source, term
Challenges Extracting values from this complex ARRAY column requires previous knowledge of both the event_params.key to be extracted and the data type, as well as using the UNNEST operator to access array elements, which is a complicated syntax to master, e.g.
(SELECT event_param.value.int_value FROM UNNEST(event_params) AS event_param WHERE event_param.key = 'ga_session_id')
Using this structure in a downstream reporting tool is near-impossible as it requires complex calculated fields for every possible key expected in the data.
Approach The decoding expression for the event_params column is generated programmatically from the event_profile derived in the profiling stage, and leverages a set of extract functions (extract_event_params_double_value, extract_event_params_float_value, extract_event_params_int_value, extract_event_params_string_value) to avoid data loss due to type inconsistencies.

user_properties

Attribute Value
Column Name user_properties
Column Schema user_properties ARRAY<STRUCT<key STRING, value STRUCT<string_value STRING, int_value INT64, float_value FLOAT64, double_value FLOAT64, set_timestamp_micros INT64>>>
Default Values None
Challenges Extracting values from this complex ARRAY column requires previous knowledge of both the user_properties.key to be extracted and the data type, as well as using the UNNEST operator to access array elements, which is a complicated syntax to master, e.g.
(SELECT user_property.value.int_value FROM UNNEST(user_properties) AS user_property WHERE user_property.key = 'ga_session_id')
Using this structure in a downstream reporting tool is near-impossible as it requires complex calculated fields for every possible key expected in the data.
Approach The decoding expression for the user_properties column is generated programmatically from the event_profile derived in the profiling stage, and leverages a set of extract functions (extract_user_properties_double_value, extract_user_properties_float_value, extract_user_properties_int_value, extract_user_properties_string_value) to avoid data loss due to type inconsistencies.

Standard Base Query

We start with a base query, which is standard structured as a sequence of common table expressions (explicitly named SQL-defined logical steps). This query then calls the custom queries (add_event_counts, decode_event_params and decode_user_properties) which are deployed in either the inbound GA4 dataset or a separate, regionally co-located dataset. We can then periodically refresh the custom queries to ensure that new event_name, event_params and user_properties values are refelcted in the downstream data.

Decoder Table Function

This base query (along with the referenced custom queries) is then deployed in the GA4 BigQuery dataset as a date-paramterised Table Function called events, which can be efficiently queried directly by specifying a start_date and end_date DATE parameter, e.g. for the past 30 days

GA4 Dataset
SELECT * 
FROM `[project_id].analytics_########.events`(CURRENT_DATE - 30, CURRENT_DATE);
Separate Dataset
SELECT * 
FROM `[project_id].[dataset_name].events`(CURRENT_DATE - 30, CURRENT_DATE);

It can also be queried directly from Looker Studio with report date parameters, using the following code sytnax to parse the DS_START_DATE and DS_END_DATE parameters into SQL DATE values:

SELECT * 
FROM `[project_id].analytics_########.events`(
PARSE_DATE("%Y%m%d", @DS_START_DATE), 
PARSE_DATE("%Y%m%d", @DS_END_DATE));

For event streams which have either a high query load, a large amount of data or both, we can use bqtools to build and periodically update a date-partitioned table, to which downstream tools can connect directly. This will optimise performance and cost, but requires a scheduled query to keep updating with the latest partitions.

Multi-Property Decoding

It is also possible to 'merge' multiple event_profile objects to build a universal decoder, which will result in a perfectly aligned schema across multiple GA4 properties regardless of discrepencies seen in actual observed data. This means that the downstream data can be integrated trivially.

Event Decoding

Summary Logic

The decoder function is a common-table-expression structured, date parameterised table function. It takes a start_date and an end_date parameter which pre-filters for a range of date shards in the inbound event data, and then executes a set of logical steps to remodel the data in a clean and extensible manner.

Each of these logical steps is implemented as a single sequential common table expression (CTE), separating out the different steps for ease of readbility, debugging and extensibility.

Name Description
get_raw_events Get inbound dates by date shard range.
add_event_id Add a hexadecimal SHA256 hash of the JSON representation of each row as a unique event_id.
add_session_id Add a hexadecimal SHA256 hash of the JSON representation of the user_pseudo_id and the ga_session_id extracted from the event_params ARRAY.
convert_dates_and_timestamps Converts UNIX dates and timestamps to DATE and TIMESTAMP data types.
add_event_counts Custom function for each GA4 property. Returns an event_count STRUCT containing event_name count columns for each observed event_name value, in addition to total events and total observed conversions.
decode_event_params Custom function for each GA4 property. Returns an event_param STRUCT for each event with a column for each observed value in the event_params ARRAY. Output data types are determined by observed data types in the profiling stage and the logical approach outlined in the docs.
decode_user_properties Custom function for each GA4 property. Returns a user_property STRUCT for each event with a column for each observed value in the user_properties ARRAY. Output data types are determined by observed data types in the profiling stage and the logical approach outlined in the docs.
correct_attribution Optional fix for the known misattribution bug for Google / Paid Search.
additional_logic Placeholder for additional custom logic.
exclude_columns Configurable exclusion of non-required columns for schema simplification purposed (e.g. event_params and user_properties ARRAYS but can be trivially extended if e.g. items or ecommerce are not required).