SQL is an amazing language, it lets you declaratively say what you want, and the engine figures out for you the best way to return it to you. Or should I say, it figures out the best way to return it to you given the information it has and the capabilities of the engine itself.
In this post, we’ll discuss a performance optimization technique for BigQuery (also other advanced enough Enteprise Data Warehouses and databases support SEMI JOINS, but I'll focus on BigQuery since it's the one I use the most these days): using semi joins.
A SEMI JOIN returns rows from the first table (left table) where one or more matches are found in the second table (right table), but it does not return rows from the second table. This can significantly improve query performance in certain scenarios.”
Let's look at an example query using their public datasets, the NYC taxi dataset, which contains a log of taxi trips in NYC.
Suppose that we want to know the distinct pickup_location_id values where both yellow and green taxis picked up clients in 2022.
One way to express this query might be the following:
SELECT DISTINCT yellow.pickup_location_id
FROM `bigquery-public-data.new_york_taxi_trips.tlc_yellow_trips_2022` yellow
join `bigquery-public-data.new_york_taxi_trips.tlc_green_trips_2022` green
on yellow.pickup_location_id = green.pickup_location_id
If we look at the performance, on my project at the time of running this query's performance is the following (results might change a bit based on your google cloud project, slot availability, time of day, etc.):
- Total elapsed time: 4min 45sec
- Slot time consumed: 3h40min (A slot in BigQuery is a unit of computational capacity required to execute SQL queries)
If we look at the execution graph, we see the following
The join is definitely the step that takes the longest! If we click the join step in the graph, BigQuery really helpfully shows us more information:
The UI shows that the join produced a lot more rows than it consumed, due to how the join was applied. You can also see that it uses a INNER HASH JOIN
.
In theory, this is a very efficient kind of join, as it builds a hash table with the join keys of one table (the smaller one), and then probes the other table (the larger one) to find the row keys that have a match in the hash table.
The problem in this case is not much in how the join happens, but in what it produces. As you can peek from the screenshot above, the number of rows produced is 83 million! This is due to the many-to-many relationship that we have in this join, where a pickup_location_id
can happen multiple times in either table.
In a sense, BigQuery here does way more than we need to, as it would be enough to find one match in the "green" table to consider the row in yellow valid. In other words, we need a SEMI JOIN
.
A
SEMI JOIN
returns rows from the first table (left table) where one or more matches are found in the second table (right table), but it does not return rows from the second table.
How do we rewrite the previous query to make BigQuery use a semi join?
For this specific query we have (at least) three options, which all make use of the SEMI HASH JOIN
in the query plan:
Option 1: use EXISTS
SELECT DISTINCT yellow.pickup_location_id
FROM `bigquery-public-data.new_york_taxi_trips.tlc_yellow_trips_2022` yellow
WHERE EXISTS (
SELECT 1
FROM `bigquery-public-data.new_york_taxi_trips.tlc_green_trips_2022` green
WHERE green.pickup_location_id = yellow.pickup_location_id
)
Option 2: use IN
SELECT DISTINCT yellow.pickup_location_id
FROM `bigquery-public-data.new_york_taxi_trips.tlc_yellow_trips_2022` yellow
WHERE yellow.pickup_location_id IN (
SELECT DISTINCT green.pickup_location_id
FROM `bigquery-public-data.new_york_taxi_trips.tlc_green_trips_2022` green
)
Option 3: use INTERSECT DISTINCT
SELECT DISTINCT yellow.pickup_location_id
FROM `bigquery-public-data.new_york_taxi_trips.tlc_yellow_trips_2022` yellow
INTERSECT DISTINCT
SELECT DISTINCT green.pickup_location_id
FROM `bigquery-public-data.new_york_taxi_trips.tlc_green_trips_2022` green
Comparison
The three options all return the same result of the original query, and they all produce a query plan with a semi hash join, with a much better performance:
- Total elapsed time: 1sec
- Slot time consumed: 40sec
While they generate a similarly shaped plan, and in this case produce the same result, they are not the same from a logical point of view.
EXISTS
Option 1 (EXISTS
) is the most flexible, because it lets you write multiple predicates in the WHERE clause. So you can for example write:
SELECT DISTINCT yellow.pickup_location_id
FROM `bigquery-public-data.new_york_taxi_trips.tlc_yellow_trips_2022` yellow
WHERE EXISTS (
SELECT 1
FROM `bigquery-public-data.new_york_taxi_trips.tlc_green_trips_2022` green
WHERE green.pickup_location_id = yellow.pickup_location_id
-- add another predicate
AND green.dropoff_location_id = yellow.dropoff_location_id
)
This would still do a SEMI HASH JOIN
, but now we also filter the rows so that the result returns locations where both the pickup and dropoff was the same.
From a performance point of view, even if this is fast, this still scans 83,869,625
rows in the join phase.
If we want to reduce the number of rows scanned, in this case we can do it with a WITH statement, like this:
WITH green AS (
SELECT DISTINCT pickup_location_id
FROM `bigquery-public-data.new_york_taxi_trips.tlc_green_trips_2022`
)
SELECT DISTINCT yellow.pickup_location_id
FROM `bigquery-public-data.new_york_taxi_trips.tlc_yellow_trips_2022` yellow
WHERE EXISTS (
SELECT 1
FROM green
WHERE green.pickup_location_id = yellow.pickup_location_id
)
This query usually performs a bit faster (for me it's around 900ms), uses slightly less slot seconds (for me about 30s), and scans less rows in the join phase (now 36,272,535
).
The main difference is that we do a DISTINCT
before joining.
To reduce the join to a minimum, we can also do one more step and do a distinct on yellow too, and we get
WITH green AS (
SELECT DISTINCT pickup_location_id
FROM `bigquery-public-data.new_york_taxi_trips.tlc_green_trips_2022`
),
yellow AS (
SELECT DISTINCT yellow.pickup_location_id
FROM `bigquery-public-data.new_york_taxi_trips.tlc_yellow_trips_2022` yellow
)
SELECT DISTINCT yellow.pickup_location_id
FROM yellow
WHERE EXISTS (
SELECT 1
FROM green
WHERE green.pickup_location_id = yellow.pickup_location_id
)
Now the performance is even faster (for me around 800ms), uses even less slot seconds (about 20s), and more deterministically we can say that it only scans 16,813
in the join phase.
IN
With IN
we are a bit more constrained (unless we do ugly string concatenation things), as we can really only compare one element per statement.
So to express a query where we want pickup_location_id
and dropoff_location_id
to be the same you'd have to write:
SELECT DISTINCT yellow.pickup_location_id
FROM `bigquery-public-data.new_york_taxi_trips.tlc_yellow_trips_2022` yellow
WHERE yellow.pickup_location_id IN (
SELECT DISTINCT green.pickup_location_id
FROM `bigquery-public-data.new_york_taxi_trips.tlc_green_trips_2022` green
)
AND yellow.dropoff_location_id IN (
SELECT DISTINCT green.dropoff_location_id
FROM `bigquery-public-data.new_york_taxi_trips.tlc_green_trips_2022` green
)
Since this uses two statements, in the JOIN
step in the query it does two joins!
The IN
is more perfomant than the original EXISTS
query written above, and is more comparable to the EXISTS
query where we use WITH green AS...
to do the select distinct. Also in the IN
case we scan 36,272,535
rows (like in the first improved exists).
We can reach a comparable performance to the second improved exists (the one with WITH green AS..., yellow AS...
) if we do
WITH yellow AS (
SELECT DISTINCT yellow.pickup_location_id
FROM `bigquery-public-data.new_york_taxi_trips.tlc_yellow_trips_2022` yellow
)
SELECT DISTINCT yellow.pickup_location_id
FROM yellow
WHERE yellow.pickup_location_id IN (
SELECT DISTINCT green.dropoff_location_id
FROM `bigquery-public-data.new_york_taxi_trips.tlc_green_trips_2022` green
)
Also here, we scan 16,819
rows in the JOIN.
There is one more caveat with IN
, as sometimes IN
and EXISTS
don't always produce the same result! See NOT IN and NOT EXISTS don't always produce the same result
INTERSECT DISTINCT
I'll admit it, I crafted the query so that intersect distinct would make the cut as well, as I particularly like it syntax wise.
INTERSECT DISTINCT
in our case produces exactly the same result as IN
and EXISTS
but has a different limitation: the columns selected need to be the same in both tables AND the result needs to be distinct (there is no such thing as an INTERSECT ALL
).
From a performance profile, this immediately produces the most efficient result, scanning only 16,819
rows in the JOIN (which again, is a SEMI HASH JOIN
).
Conclusion
In conclusion, we saw that SEMI HASH JOIN
can be a powerful optimization, especially when dealing with many-to-many relationships, and can be done in several ways. Each method has its strengths and limitations, but in the end the main optimization is the impact on the query plan of moving from a INNER
to a SEMI
join.