SQL is an amazing language, it lets you declaratively say what you want, and the engine figures out for you the best way to return it to you. Or should I say, it figures out the best way to return it to you given the information it has and the capabilities of the engine itself.

In this post, we’ll discuss a performance optimization technique for BigQuery (also other advanced enough Enteprise Data Warehouses and databases support SEMI JOINS, but I'll focus on BigQuery since it's the one I use the most these days): using semi joins.

A SEMI JOIN returns rows from the first table (left table) where one or more matches are found in the second table (right table), but it does not return rows from the second table. This can significantly improve query performance in certain scenarios.”

Let's look at an example query using their public datasets, the NYC taxi dataset, which contains a log of taxi trips in NYC.

Suppose that we want to know the distinct pickup_location_id values where both yellow and green taxis picked up clients in 2022.

One way to express this query might be the following:

SELECT DISTINCT yellow.pickup_location_id
  FROM `bigquery-public-data.new_york_taxi_trips.tlc_yellow_trips_2022` yellow
  join `bigquery-public-data.new_york_taxi_trips.tlc_green_trips_2022` green
  on yellow.pickup_location_id = green.pickup_location_id

If we look at the performance, on my project at the time of running this query's performance is the following (results might change a bit based on your google cloud project, slot availability, time of day, etc.):

  • Total elapsed time: 4min 45sec
  • Slot time consumed: 3h40min (A slot in BigQuery is a unit of computational capacity required to execute SQL queries)

If we look at the execution graph, we see the following Execution Graph, showing that the join took more than 4 minutes

The join is definitely the step that takes the longest! If we click the join step in the graph, BigQuery really helpfully shows us more information:

Join Detail, showing the join stats with a lot more rows produced than consumed

The UI shows that the join produced a lot more rows than it consumed, due to how the join was applied. You can also see that it uses a INNER HASH JOIN.

In theory, this is a very efficient kind of join, as it builds a hash table with the join keys of one table (the smaller one), and then probes the other table (the larger one) to find the row keys that have a match in the hash table.

The problem in this case is not much in how the join happens, but in what it produces. As you can peek from the screenshot above, the number of rows produced is 83 million! This is due to the many-to-many relationship that we have in this join, where a pickup_location_id can happen multiple times in either table.

In a sense, BigQuery here does way more than we need to, as it would be enough to find one match in the "green" table to consider the row in yellow valid. In other words, we need a SEMI JOIN.

A SEMI JOIN returns rows from the first table (left table) where one or more matches are found in the second table (right table), but it does not return rows from the second table.

How do we rewrite the previous query to make BigQuery use a semi join?

For this specific query we have (at least) three options, which all make use of the SEMI HASH JOIN in the query plan:

Option 1: use EXISTS

SELECT DISTINCT yellow.pickup_location_id
  FROM `bigquery-public-data.new_york_taxi_trips.tlc_yellow_trips_2022` yellow
  WHERE EXISTS (
    SELECT 1
    FROM `bigquery-public-data.new_york_taxi_trips.tlc_green_trips_2022` green
    WHERE green.pickup_location_id = yellow.pickup_location_id
  )

Option 2: use IN

SELECT DISTINCT yellow.pickup_location_id
  FROM `bigquery-public-data.new_york_taxi_trips.tlc_yellow_trips_2022` yellow
  WHERE yellow.pickup_location_id IN (
    SELECT DISTINCT green.pickup_location_id
    FROM `bigquery-public-data.new_york_taxi_trips.tlc_green_trips_2022` green
  )

Option 3: use INTERSECT DISTINCT

SELECT DISTINCT yellow.pickup_location_id
FROM `bigquery-public-data.new_york_taxi_trips.tlc_yellow_trips_2022` yellow
INTERSECT DISTINCT 
SELECT DISTINCT green.pickup_location_id
FROM `bigquery-public-data.new_york_taxi_trips.tlc_green_trips_2022` green

Comparison

The three options all return the same result of the original query, and they all produce a query plan with a semi hash join, with a much better performance:

  • Total elapsed time: 1sec
  • Slot time consumed: 40sec

While they generate a similarly shaped plan, and in this case produce the same result, they are not the same from a logical point of view.

EXISTS

Option 1 (EXISTS) is the most flexible, because it lets you write multiple predicates in the WHERE clause. So you can for example write:

SELECT DISTINCT yellow.pickup_location_id
  FROM `bigquery-public-data.new_york_taxi_trips.tlc_yellow_trips_2022` yellow
  WHERE EXISTS (
    SELECT 1
    FROM `bigquery-public-data.new_york_taxi_trips.tlc_green_trips_2022` green
    WHERE green.pickup_location_id = yellow.pickup_location_id
    -- add another predicate
    AND green.dropoff_location_id = yellow.dropoff_location_id
  )

This would still do a SEMI HASH JOIN, but now we also filter the rows so that the result returns locations where both the pickup and dropoff was the same.

From a performance point of view, even if this is fast, this still scans 83,869,625 rows in the join phase.

If we want to reduce the number of rows scanned, in this case we can do it with a WITH statement, like this:

WITH green AS (
  SELECT DISTINCT pickup_location_id
  FROM `bigquery-public-data.new_york_taxi_trips.tlc_green_trips_2022`
)
SELECT DISTINCT yellow.pickup_location_id
  FROM `bigquery-public-data.new_york_taxi_trips.tlc_yellow_trips_2022` yellow
  WHERE EXISTS (
    SELECT 1
    FROM green
    WHERE green.pickup_location_id = yellow.pickup_location_id
  )

This query usually performs a bit faster (for me it's around 900ms), uses slightly less slot seconds (for me about 30s), and scans less rows in the join phase (now 36,272,535). The main difference is that we do a DISTINCT before joining.

To reduce the join to a minimum, we can also do one more step and do a distinct on yellow too, and we get

WITH green AS (
  SELECT DISTINCT pickup_location_id
  FROM `bigquery-public-data.new_york_taxi_trips.tlc_green_trips_2022`
),
yellow AS (
  SELECT DISTINCT yellow.pickup_location_id
  FROM `bigquery-public-data.new_york_taxi_trips.tlc_yellow_trips_2022` yellow
)
SELECT DISTINCT yellow.pickup_location_id
  FROM yellow
  WHERE EXISTS (
    SELECT 1
    FROM green
    WHERE green.pickup_location_id = yellow.pickup_location_id
  )

Now the performance is even faster (for me around 800ms), uses even less slot seconds (about 20s), and more deterministically we can say that it only scans 16,813in the join phase.

IN

With IN we are a bit more constrained (unless we do ugly string concatenation things), as we can really only compare one element per statement.

So to express a query where we want pickup_location_id and dropoff_location_id to be the same you'd have to write:

SELECT DISTINCT yellow.pickup_location_id
  FROM `bigquery-public-data.new_york_taxi_trips.tlc_yellow_trips_2022` yellow
  WHERE yellow.pickup_location_id IN (
    SELECT DISTINCT green.pickup_location_id
    FROM `bigquery-public-data.new_york_taxi_trips.tlc_green_trips_2022` green
  )
  AND yellow.dropoff_location_id IN (
    SELECT DISTINCT green.dropoff_location_id
    FROM `bigquery-public-data.new_york_taxi_trips.tlc_green_trips_2022` green
  )

Since this uses two statements, in the JOIN step in the query it does two joins!

Join Detail, showing that it does two joins

The IN is more perfomant than the original EXISTS query written above, and is more comparable to the EXISTS query where we use WITH green AS... to do the select distinct. Also in the IN case we scan 36,272,535rows (like in the first improved exists).

We can reach a comparable performance to the second improved exists (the one with WITH green AS..., yellow AS...) if we do

WITH yellow AS (
  SELECT DISTINCT yellow.pickup_location_id
  FROM `bigquery-public-data.new_york_taxi_trips.tlc_yellow_trips_2022` yellow
)
SELECT DISTINCT yellow.pickup_location_id
FROM yellow
WHERE yellow.pickup_location_id IN (
  SELECT DISTINCT green.dropoff_location_id
  FROM `bigquery-public-data.new_york_taxi_trips.tlc_green_trips_2022` green
)

Also here, we scan 16,819 rows in the JOIN.

There is one more caveat with IN, as sometimes IN and EXISTS don't always produce the same result! See NOT IN and NOT EXISTS don't always produce the same result

INTERSECT DISTINCT

I'll admit it, I crafted the query so that intersect distinct would make the cut as well, as I particularly like it syntax wise.

INTERSECT DISTINCT in our case produces exactly the same result as IN and EXISTS but has a different limitation: the columns selected need to be the same in both tables AND the result needs to be distinct (there is no such thing as an INTERSECT ALL).

From a performance profile, this immediately produces the most efficient result, scanning only 16,819 rows in the JOIN (which again, is a SEMI HASH JOIN).

Conclusion

In conclusion, we saw that SEMI HASH JOIN can be a powerful optimization, especially when dealing with many-to-many relationships, and can be done in several ways. Each method has its strengths and limitations, but in the end the main optimization is the impact on the query plan of moving from a INNER to a SEMI join.