Skip to content

Architecture

Summary

In order to transform GA4 event data in BigQuery into a useable structure, the complex nested event_params and user_properties objects need to be decoded, whilst controlling for potential data type inconsistencies. This means that it is necessary to profile the data and deploy a set of custom functions, unique to each property (or group of properties, if schema alignement and potentially data integration is desired).

Resource Deployment

The following resources are deployed to either the ga4_dataset or a different dataset in the same region:

Name Type Arguments Description
events TABLE FUNCTION start_date DATE, end_date DATE Main transformation function containing all transformation steps.
add_event_counts FUNCTION event_name STRING Returns an event_count STRUCT containing event_name count columns for each observed event_name value, in addition to total_events and total observed conversion events.
decode_event_params FUNCTION event_params ARRAY<STRUCT<key STRING, value STRUCT<string_value STRING, int_value INT64, float_value FLOAT64, double_value FLOAT64>>> Decoder for event_params objects. Returns an event_param STRUCT containing event_param values for each observed event_name key with the appropriate data types.
decode_user_properties FUNCTION user_properties ARRAY<STRUCT<key STRING, value STRUCT<string_value STRING, int_value INT64, float_value FLOAT64, double_value FLOAT64, set_timestamp_micros INT64>>> Decoder for user_properties objects. Returns a user_property STRUCT containing user_properties values for each observed user_properties key with the appropriate data types.

These resource are related according to the following dependency diagram:

flowchart LR

subgraph transformation_dataset["CUSTOM FUNCTIONS: Transformation Dataset"]
    events
    add_event_counts
    decode_event_params
    decode_user_properties
end

subgraph ga4_dataset["SOURCE DATA: analytics_########.events"]
    events_YYYYMMDD
end

events_YYYYMMDD --> |start_date,<br>end_date| events
events --> |event_name| add_event_counts
events --> |event_params| decode_event_params
events --> |user_properties| decode_user_properties

Note that custom functions can be deployed to the inbound GA4 dataset, however permissions limitations might mean that deploying them into a separate but regionally co-located dataset is preferable.

Looker Studio Connection

The events TABLE FUNCTION can be directly connected to a Looker Studio report, whilst leveraging report date parameters by using the following custom SQL:

SELECT * 
FROM `[project_id].analytics_########.events`(
PARSE_DATE("%Y%m%d", @DS_START_DATE), 
PARSE_DATE("%Y%m%d", @DS_END_DATE));

This is extremely efficient as it will only query the date partitions (and therefore upstream source table shards) which are selected in the report, improving performance and minimising query load and cost.

Resource Update

Once deployed, resource SQL code can be updated manually via the BigQuery Console (EDIT PERSISTENT FUNCTION for add_event_counts, decode_event_params or decode_user_properties and EDIT TABLE FUNCTION for events). Any of these functions can also be refreshed at any point by re-running the appropriate deployment function.

One recommended approach is to periodically update the three sub-functions (add_event_counts, decode_event_params, decode_user_properties) to ensure that new values are picked up. Minor updates to the overall logic can then be maintained in the events TABLE FUNCTION.

Additional Custom Logic

If there is significant additional downstream logic, another approach is to build one or more additional linked date-parameterised table functions in which you can encapsulate additional logic. The first common table expression (CTE) in this function would then be:

WITH
get_events AS (
SELECT * 
FROM  `[project_id].dataset_name.events`(start_date DATE, end_date DATE) 
)
...

You would then use this table function for report connection or table creation.

Performance Optimisation

If you are expecting high query load or have a large volume of data, it might be preferable to use the output table function to create a date-partitioned table, and connecting downstream tools to this table. This might also be required if your downstream reporting tool cannot use custom SQL or report date parameters.

Full Table Refresh

In the simplest form, this can be actioned and scheduled as a full table refresh using the following example syntax (note that this would be for the last 365 days only, but the data parameters can be changed to any arbitrary DATE values):

CREATE OR REPLACE TABLE
`[project_id].[dataset_name].[destination_table_name]`
PARTITION BY date
AS
SELECT * 
FROM  `[project_id].dataset_name.events`(CURRENT_DATE - 365, CURRENT_DATE) 

Note that full table refreshes are inefficient as you will be repeatedly copying the same data, and the query load and cost will scale linearly over time. This may be acceptable for small data sources, but an incremental approach is typically preferable.

Incremental Refresh

By consistently using date-partitioned tables and table functions and treating table date partitions as immutable objects, an incremental approach is relatively simple to deploy. This will result in lower refresh costs which will only scale with volume increases.

We leverage some of the tools in our bqtools library to deploy this in a concise and efficient manner.