Google BigQuery: Hacks for Advanced users

BigQuery: Advanced SQL query hacks

This is a list of time-saving, cost-saving and life-saving SQL query hacks you need to know if you consider yourself an advanced BigQuery user.

Select all columns by excluding only some of them

When you need to select all the columns in the table but a couple of them you can use the EXCEPT keyword to do so:

SELECT * EXCEPT (col1, col2) 
FROM `project_id.dataset_name.table_name`

Convert JSON into structured data

If you have JSON formatted text in a string column you can use JavaScript UDF to convert the JSON text into an array which will be displayed as a nested field in BigQuery:

CREATE TEMP FUNCTION unnestJson(str STRING)
RETURNS ARRAY<STRING>
LANGUAGE js AS r"""
  var obj = JSON.parse(str);
  var keys = Object.keys(obj);
  var arr = [];
  for (i = 0; i < keys.length; i++) {
    arr.push(JSON.stringify(obj[keys[i]] || ""));
  }
  return arr;
""";

SELECT col1, unnestJson(col2) FROM `project_id.dataset_name.table_name`

Define queries and static values once if you have multiple references

If you need to reference a certain subquery multiple times in your main query or have to reference a certain value (e.g. a specific date) multiple times within the main query, you can use WITH to define it as a common table expression (CTE) and/or DEFINE to define a value as a static SQL variable.

In your actual query you can reuse all these definitions just by referencing them.

DECLARE start_date DATE DEFAULT DATE('2020-09-29');

WITH

valid_dataset AS (
  SELECT *
  FROM `project_id.dataset_name.table_name`
  WHERE col1 = TRUE
),

valid_dataset2 AS (
  SELECT *
  FROM `project_id.dataset_name.table_name`
  WHERE col2 = TRUE
)

SELECT col2
FROM valid_dataset
WHERE col3 > start_date 
UNION ALL
SELECT col2
FROM valid_dataset2
WHERE col3 > start_date 

UNNEST will automatically CROSS JOIN results

Be careful when using UNNEST, as it will automatically CROSS JOIN the results from the parent table and the untested table. This means that results from the parent table will disappear if there are no relevant rows in the untested table.

I’ve found that in most cases it would be safer to UNNEST by using LEFT JOIN instead. You can see the syntax for both options in the example below. Test this with your dataset to understand what works best for you.

-- UNNESTING BY CROSS JOIN
SELECT
  session.id as ses_id,
  h.id as hit_id
FROM
  `project_id.dataset_id.sessions` AS session ,
  UNNEST(session.hits) AS h ,
  UNNEST (h.events) as e


-- UNNESTING BY LEFT JOIN
SELECT
  session.id as ses_id,
  h.id as hit_id
FROM
  `project_id.dataset_id.sessions` AS session` AS session
  LEFT JOIN UNNEST(session.hits) AS h
  LEFT JOIN UNNEST (h.events) as e

Testing queries that will update your data (like a Pro)

When updating data with a DML SQL query (e.g. UPDATE, INSERT) you need to be very careful. The easiest approach is to use a SELECT query that matches what you want to do in your final SQL query (see example below).

-- TESTING USING A SELECT QUERY
SELECT
 additional_information,
 regexp_replace(additional_information,'Old value', 'New value') AS new_additional_information
FROM
 `project.dataset.table`
WHERE 
 account_id = 123
 AND additional_information LIKE '%Old value%'

-- MAKING CHANGES WITH AN UPDATE QUERY
UPDATE  
 `project.dataset.table`
SET 
 additional_information = regexp_replace(additional_information,'Old value', 'New value')
WHERE 
 account_id = 123
 AND additional_information LIKE '%Old value%'
But this approach is not foolproof, as there’s always a small chance to have a typo or small differences between the 2 versions.

Ideally, you want to be able to run your query, see the changes in your data, and only after you verify that everything looks OK, apply the changes! You can do that with a ROLLBACK statement!

-- TRANSACTION WITH ROLLBACK
BEGIN TRANSACTION;  INSERT INTO 
    `project.dataset.table`
  VALUES ('New value', 'Another new value', 123);  SELECT 
    * 
  FROM 
    `project.dataset.table`;ROLLBACK TRANSACTION;

Cut down query costs with Partitioned or Clustered tables

When using a DB for a project chances are that you will have a lot of data to work with and in most cases, there’s some kind of date value related to that data that you use to filter the results and focus your query on a specific timeframe. With BigQuery, when querying a table with a date (or any other field) in the WHERE clause you will be charged based on the complete size of the table for all the fields in your SELECT clause!

You need to split your table into smaller tables to cut down your costs. You can do that using Fragmented tables, Partitioned tables, or Clustered tables. In all of these cases, you can use a date field (or even a few other data types for clustered tables) to limit the scope of your query to a specific timeframe and reduce its cost. While at the same time you are also able to easily query your complete table if you need to.

With fragmented tables, you just add the date in the name of your table and use REGEX rules to define the timeframe you want to query. CAUTION: This is an easy way to split your dataset, but will hurt the performance of queries using a large set of tables. You may also come across limitations if you use them for a very long time.

With clustered tables, the date becomes part of the table name and you use that to specify the timeframe you want to query.

With partitioned tables, you can just use a table field (the field you used to partition your table or the _PARTITIONTIME field if you used ingestion-time partition) to prune down your partitions.

-- SUM sales data for January 2021

-- FRAGMENTED TABLES
SELECT
  SUM(totalSale)
FROM
  `mydataset.SalesData-2021-01-*`


-- PARTITIONED TABLES (TIME-UNIT COLUMN-PARTITIONED TABLE)
SELECT
  SUM(totalSale)
FROM
  `mydataset.PartitionedSalesData`
WHERE
  invoice_date >= '2021-01-01') AND invoice_date <= '2021-01-31'

-- PARTITIONED TABLES (INGESTION-TIME)
SELECT
  SUM(totalSale)
FROM
  `mydataset.PartitionedSalesData`
WHERE
  _PARTITIONTIME BETWEEN TIMESTAMP('2021-01-01') AND TIMESTAMP('2021-01-31')


-- CLUSTERED TABLES
SELECT
  SUM(totalSale)
FROM
  `mydataset.ClusteredSalesData`
WHERE
  EXTRACT(MONTH from invoice_date) = 1 AND EXTRACT(YEAR from invoice_date) = 2021

Pivot tables in BigQuery

A pivot table is extremely helpful when summarizing or analyzing datasets in tabular format. Let’s use the “baseball” public BigQuery dataset (pitch-by-pitch data for Major League Baseball (MLB) games in 2016) as an example.

The SQL query below would return us the average number of “out” balls by venue and day or night game:

SELECT
  venueCity,
  dayNight,
  AVG(outs) AS outs_avg
FROM
  `bigquery-public-data.baseball.games_wide`
GROUP BY
  venueCity,
  dayNight


-- Sample data:
| venueCity | dayNight | outs_avg |
|-----------|----------|----------|
| Miami     | N        | 1.1097   |
| Miami     | D        | 1.1306   |
| Atlanta   | D        | 1.0978   |
| Atlanta   | N        | 1.1051   |
| ...       | ...      | ...      |

It would be ideal if we could pivot the dayNight dimension into a column, to be able to see in 1 row the average out balls per venue both for day and night games. This can be achieved with the following SQL query which utilizes the PIVOT operator:

SELECT
  *
FROM (
  SELECT
    venueCity,
    dayNight,
    AVG(outs) AS outs_avg
  FROM
    `bigquery-public-data.baseball.games_wide`
  GROUP BY
    venueCity,
    dayNight ) PIVOT(AVG(outs_avg) FOR dayNight IN ("D", "N"))

-- Sample data:
| venueCity  | D      | N      |
|------------|--------|--------|
| Miami      | 1.1306 | 1.1097 |
| Atlanta    | 1.0978 | 1.1051 |
| Fort Bragg | null   | 1.1043 |
| ...        | ...    | ...    |

For more advanced use cases of the PIVOT operator, check out this article on Medium.

BigQuery functions

BigQuery, with native and remote routines, is really powerful. But sometimes SQL can be limiting and it cannot work or is too time-consuming for several special cases (e.g. Creating fake datasets, Removing personal data, Array transformations, String transformations, etc.).

This is where BigFunctions comes into play. An open-source package of pre-defined BigQuery-compatible functions. BigFunctions are open-source BigQuery routines that give you SQL superpowers in BigQuery. All BigFunctions represented by a ‘yaml’ file in bigfunctions folder are automatically deployed in public datasets so that you can call them directly without installing them from your BigQuery project. To explore available BigFunctions and to get started, visit the BigFunctions website.

ROLLUP instead of UNION

Using the ROLLUP function you can generate cross-joins avoiding the use of long UNION ALL SQL queries. Imagine having a table like the following:

DeviceOSVisits
MobileAndroid5
MobileiOS10
TabletAndroid5
TabletiOS10
DesktopWindows5
Desktopmacos5

You can use the following SQL query to calculate both the totals by OS and the totals for all OS & device combinations:

SELECT
  os,
  device,
  SUM(visits) AS visits_total
FROM
  datatable
GROUP BY
  ROLLUP(os, device)
ORDER BY
  OS

The result would look like the following table:

Travel back in time

BigQuery allows you to view table data as of a previous point in time, within a pre-defined travel window (by default set to 7 days). Let’s say for instance that you have a table with 10 rows of data on Jan 1st and on Jan 2nd you add 10 more rows of data.

You can use the following SQL query to count the number of rows that were available for that table on Jan 1st:

SELECT Count(*) FROM `condenast-baresquare.qa_automation.output` 

FOR SYSTEM_TIME AS OF TIMESTAMP('2023-08-25 00:00:00.000 UTC')

You can also change the travel window on a dataset level using the following SQL query:

ALTER SCHEMA your_table_name
       SET OPTIONS (
              max_time_travel_hours = your_number_of_hours_here
       );

Or you can even join tables using the time travel option just by including the FOR SYSTEM_TIME AS OF clause after the table name in your SQL query.

Use LLMs in SQL

The integration of VertexAI Foundation Models in BigQuery makes it easy for businesses to use ML to analyze their data – This can help you to improve your operations, make better decisions, and gain a competitive advantage. For instance, you can do:

  • Content generation: Analyze customer feedback and generate personalized email content right inside BigQuery without the need for complex tools
  • Summarization: Summarize text stored in BigQuery columns such as online reviews or chat transcripts
  • Data enhancement: Obtain a country name for a given city name
  • Rephrasing: Correct spelling and grammar in textual content such as voice-to-text transcriptions
  • Feature extraction: Extract key information or words from large text files such as in online reviews and call transcripts
  • Sentiment analysis: Understand human sentiment about specific subjects in a text

Learn more here
Watch demo here
Step-by-step tutorial here
Codelab tutorial here

Source: https://www.linkedin.com/in/jansmerhovsky/

Analyze job history

Did you know that you can just use an SQL query to collect the complete history of jobs/queries executed against your project? The INFORMATION_SCHEMA.JOBS view contains near real-time metadata about all BigQuery jobs in the current project.

You can use an SQL query like the following to get the list of SQL queries ordered by each user (see sample query below) or even an approximation of the cost for each query by using the total_bytes_billed and total_bytes_processed columns. This would allow you to create a dashboard with statistics, alerts on cost or even report on the most used columns.

SELECT
  user_email,
  query
FROM
  `region-us.INFORMATION_SCHEMA.JOBS`
WHERE
  creation_time >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 HOUR)
  AND creation_time < TIMESTAMP_ADD(CURRENT_TIMESTAMP(), INTERVAL 1 HOUR)

The underlying data is partitioned by the creation_time column and clustered by project_id and user_email. The query_info column contains additional information about your query jobs. You can find the complete schema here.

QUALIFY: Filtering window function results

QUALIFY is used to filter your data based on the results of window functions.

In other words, it is to window functions what HAVINGi s to GROUP BY and WHERE is to FROM.

Source

Panagiotis

Written By

Panagiotis (pronounced Panayotis) is a passionate G(r)eek with experience in digital analytics projects and website implementation. Fan of clear and effective processes, automation of tasks and problem-solving technical hacks. Hands-on experience with projects ranging from small to enterprise-level companies, starting from the communication with the customers and ending with the transformation of business requirements to the final deliverable.