In order to transform GA4 event data in BigQuery into a useable structure,
the complex nested
user_properties objects need to be decoded, whilst controlling for potential data type inconsistencies. This means that it is necessary to profile the data and deploy a set of custom functions, unique to each property (or group of properties, if schema alignement and potentially data integration is desired).
The following resources are deployed to either the
ga4_dataset or a different dataset in the same region:
||Main transformation function containing all transformation steps.|
These resource are related according to the following dependency diagram:
flowchart LR subgraph transformation_dataset["CUSTOM FUNCTIONS: Transformation Dataset"] events add_event_counts decode_event_params decode_user_properties end subgraph ga4_dataset["SOURCE DATA: analytics_########.events"] events_YYYYMMDD end events_YYYYMMDD --> |start_date,<br>end_date| events events --> |event_name| add_event_counts events --> |event_params| decode_event_params events --> |user_properties| decode_user_properties
Note that custom functions can be deployed to the inbound GA4 dataset, however permissions limitations might mean that deploying them into a separate but regionally co-located dataset is preferable.
Looker Studio Connection
TABLE FUNCTION can be directly connected to a Looker Studio report, whilst leveraging report date parameters by using the following custom SQL:
SELECT * FROM `[project_id].analytics_########.events`( PARSE_DATE("%Y%m%d", @DS_START_DATE), PARSE_DATE("%Y%m%d", @DS_END_DATE));
This is extremely efficient as it will only query the date partitions (and therefore upstream source table shards) which are selected in the report, improving performance and minimising query load and cost.
Once deployed, resource SQL code can be updated manually via the BigQuery Console (
EDIT PERSISTENT FUNCTION for
EDIT TABLE FUNCTION for
events). Any of these functions can also be refreshed at any point by re-running the appropriate deployment function.
One recommended approach is to periodically update the three sub-functions (
decode_user_properties) to ensure that new values are picked up. Minor updates to the overall logic can then be maintained in the
Additional Custom Logic
If there is significant additional downstream logic, another approach is to build one or more additional linked date-parameterised table functions in which you can encapsulate additional logic. The first common table expression (CTE) in this function would then be:
WITH get_events AS ( SELECT * FROM `[project_id].dataset_name.events`(start_date DATE, end_date DATE) ) ...
You would then use this table function for report connection or table creation.
If you are expecting high query load or have a large volume of data, it might be preferable to use the output table function to create a date-partitioned table, and connecting downstream tools to this table. This might also be required if your downstream reporting tool cannot use custom SQL or report date parameters.
Full Table Refresh
In the simplest form, this can be actioned and scheduled as a full table refresh using the following example syntax (note that this would be for the last 365 days only, but the data parameters can be changed to any arbitrary
CREATE OR REPLACE TABLE `[project_id].[dataset_name].[destination_table_name]` PARTITION BY date AS SELECT * FROM `[project_id].dataset_name.events`(CURRENT_DATE - 365, CURRENT_DATE)
Note that full table refreshes are inefficient as you will be repeatedly copying the same data, and the query load and cost will scale linearly over time. This may be acceptable for small data sources, but an incremental approach is typically preferable.
By consistently using date-partitioned tables and table functions and treating table date partitions as immutable objects, an incremental approach is relatively simple to deploy. This will result in lower refresh costs which will only scale with volume increases.
We leverage some of the tools in our
bqtools library to deploy this in a concise and efficient manner.