Mockable unit testing methodology completed only with BigQuery
BigQuery, do you use it? I am building data using BigQuery in my work. For quality assurance, I want to test against BigQuery SQL. In this article, I will introduce a unit testing method that can be completed only with BigQuery and can replace Mock data.
Motivation
To put it simply, we want to check whether degregation occurs when modifying BigQuery SQL.
We write BigQuery SQL in our work, but we don't have unit tests for it.
When developing applications in languages such as Python or Javascript, it is common to write unit tests using unit testing frameworks such as XUnit. However, I have not heard of unit tests for SQL (as far as I have observed).
It seems that you can use a tool called dbt (Data Build Tool) to apply unit tests to SQL, but I don't know the pros and cons of that. (Out of curiosity, I'd like to give it a try).
Instead, I was wondering if I could write unit tests for SQL using the standard techniques of the language, rather than learning a new library/tool.
So I came up with my own unit test method for BigQuery SQL.
xUnit
xUnit is a framework for unit testing.
xUnit is the collective name for several unit testing frameworks that derive their structure and functionality from Smalltalk's SUnit.
xUnit has influenced each programming language. It allows unit testing of JUnit for Java and unittest (strictly inspired by JUnit) for Python.
xUnit x Python
Before BigQuery, we first introduce unit tests using Python.
Suppose you have a function get_less_than
, which get_less_than
fruits that cost less than 100 yen.
# fruits.py
class Fruits(object):
def _get_fruits(self):
return [
{"price": 130, "name": "apple"},
{"price": 120, "name": "banana"},
{"price": 110, "name": "grape"},
{"price": 100, "name": "lemon"},
{"price": 90, "name": "orange"},
]
def get_less_than(self, price):
fruits = self._get_fruits()
return list(filter(lambda x: x["price"] < price, fruits))
The _get_fruits
function is a simplified one, where the actual data is referenced.
If you were to write unit tests for the get_less_than
function, the code would look like this.
# test_fruits.py
import unittest
from unittest.mock import patch
from fruits import Fruits
class TestFruits(unittest.TestCase):
@patch("fruits.Fruits._get_fruits")
def test_get_less_than(self, mock_get_fruits):
# Arrange
mock_get_fruits.return_value = [
{"price": 110, "name": "apple"},
{"price": 100, "name": "banana"},
{"price": 90, "name": "orange"},
]
fruits = Fruits()
# Act
actual = fruits.get_less_than(price=100)
# Assert
assert len(actual) == 1
[Best Practices for Writing Unit Tests#Placement of tests](https://docs.microsoft.com/ja-jp/dotnet/core/testing/unit-testing-best-practices#arranging-your- tests), unit tests are easier to understand if they are written in the following three phases.
- Arrange: placement, preparation
- Act: execute Assert: verification
Unit tests can be run with the command
$ python -m unittest test_fruits
.
---
Ran 1 test in 0.001s
OK.
The test succeeded. Now let's make it fail.
Change the get_less_than
logic in the fruits.py code to the wrong one.
The code is as follows.
# fruits.py
class Fruits(object):
...
def get_less_than(self, price):
fruits = self._get_fruits()
# return list(filter(lambda x: x["price"] < price, fruits))
return list(filter(lambda x: x["price"] <= price, fruits))
With this in place, let's run the unit test we just ran.
$ python3 -m unittest test_fruits
F
======================================================================
FAIL: test_get_less_than (test_fruits.TestFruits)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/unittest/mock.py", line 1348, in patched
return func(*newargs, **newkeywargs)
File "test_fruits.py", line 22, in test_get_less_than
assert len(actual) == 1
AssertionError
----------------------------------------------------------------------
Ran 1 test in 0.005s
FAILED (failures=1)
Failed, as expected. Missing logic is almost certain to occur in the course of continued maintenance. When this happens, it is important to be able to recognise that logic is wrong, i.e. to recognise degreasing. One of the advantages of writing unit tests is that you can detect degrees.
xUnit x BigQuery
How do we test BigQuery? First, you need to prepare your source data for writing BigQuery SQL.
-- fruits_ct.sql
-- destination: shop
-- dataset: shop
-- table: fruits
CREATE OR REPLACE TABLE
shop.fruits AS
SELECT
price, name
SELECT price, name
FROM (
SELECT
130 AS price, "apple" AS name
"apple" AS name
UNION ALL
SELECT
120 AS price, "apple" AS name
"banana" AS name
UNION ALL
SELECT
110 AS price, "grape" AS name
"grape" AS name
UNION ALL
SELECT
100 AS price, "lemon" AS name
"lemon" AS name
UNION ALL
SELECT
90 AS price, "orange" AS name )
"orange" AS name )
From the shop.fruits table, prepare the SQL equivalent of the get_less_than
function from the Python code above.
-- fruits_less_than_100.sql
-- destination.
-- dataset: shop
-- table: fruits_less_than_100
SELECT
price, name
SELECT price, name
FROM
shop.fruits
WHERE
price < 100
Write unit tests against the output fruits_less_than_100
table.
-- test_fruits_less_than_100.sql
ASSERT
(
SELECT
COUNT(name) = 1
FROM
shop.fruits_less_than_100) AS "There's one fruit less than 100."
This test ensured that for the fruits_less_than_100
table
- There's one fruit with a price less than 100.
However, this is data generated by reference to real data (shop.fruits table). Therefore, the following problems exist
- The test is unstable because it refers to real data
- The test fails if
orange
is missing in the real data
- The test fails if
- Long feedback cycles.
- The more real data there is, the longer the test execution time becomes.
Therefore, table functions are used to replace real data with mock data.
What is a table function?
From the official BigQuery page, the following is written about table functions.
A table function (also known as a table value function or TVF) is a user-defined function that returns a table. Table functions can be used wherever tables can be used. Table functions are similar to views, but table functions can take parameters.
The definition of a table function has the following sample code.
-- names_by_year.tvf.sql
CREATE OR REPLACE TABLE
FUNCTION mydataset.names_by_year(y INT64) AS
SELECT
YEAR,
name,
SUM(number) AS total
FROM
`bigquery-public-data.usa_names.usa_1910_current`
WHERE
year = y
GROUP BY
year, y
NAME
Features CREATE OR REPLACE TABLE FUNCTION
. It defines a function and takes the table information as the return value. In this case, it is a table with fields year, name, total
.
The defined table function is used as follows.
-- names_by_year.sql
SELECT
*
FROM
mydataset.names_by_year(1950)
ORDER BY
total DESC
LIMIT
5
Table functions can be used in the FROM clause
. This is where I focused my attention.
I wondered if it would be possible to return mock data when calling a function in a FROM clause.
I found a means to do this in the following chapter.
xUnit x BigQuery (Mock)
How to replace the data is to turn it into a table function in the following two orders.
- make the original data into a table function so that it can be replaced by Mock
- replace the FROM clause of the SQL containing the logic with the table function in (1) and convert it to a table function.
The following steps are explained in order.
---The first is to replace the source data' with the
source data'.
The first step is to convert the `original data into a table function so that it can be replaced with Mock.
The original data is the shop.fruits
table, so it is converted to a table function so that it can be replaced by Mock.
The following code is a Mock replaceable table function.
-- fruits_tvf.sql
-- function:.
-- - shop.fruits_inject().
-- - shop.fruits(is_test BOOL)
CREATE OR REPLACE TABLE
FUNCTION shop.fruits_inject() AS
SELECT
*
FROM (
SELECT
130 AS price,
"apple" AS name )
WHERE
1 <> 1;
CREATE OR REPLACE TABLE
FUNCTION shop.fruits(is_test BOOL) AS
SELECT
name,
SELECT name, price
FROM
shop.fruits
WHERE
NOT is_test
UNION ALL
SELECT
name,
SELECT name, price
FROM
shop.fruits_inject()
WHERE
is_test
This SQL contains two statements.
- definition of the table function shop.fruits_inject
- definition of the table function shop.fruits
The first is a function for later Mock replacement (overwriting).
The second is a function with the value is_test BOOL
as an argument, with the following conditional branch.
- If
is_test
is True- Mock data (shop.fruits_inject) is returned
- If is_test is False
- Real data (shop.fruits) is returned.
It is assumed that is_test is False for production code and is_test is True for test code.
The second one is replacing the FROM clause of SQL containing logic with the table function in (1) to make it a table function
.
The SQL containing logic, in this case the fruits_less_than_100
table, is converted to a function.
The following code is the table function replaced by the table function in (1).
-- fruits_less_than_tvf.sql
-- function:
-- - shop.fruits_less_than(is_test BOOL, p INT)
CREATE OR REPLACE TABLE
FUNCTION shop.fruits_less_than(is_test BOOL,
p INT) AS
SELECT
price, name, p INT
FUNCTION shop.fruits_less_than(is_test BOOL p INT)
FROM
shop.fruits(is_test)
WHERE
price < p
In the FROM clause, call the shop.fruits table function defined in item (i) above.
The arguments of the shop.fruits table function are set to the is_test BOOL
passed from the shop.fruits_less_than function.
The shop.fruits_less_than function also has an argument p INT
to allow flexibility in the price of the less_than.
This completes the table functionisation. Next, I will show you the product code and test code.
Let's write the production code first.
-- fruits_less_than_100.sql
-- destination.
-- dataset: shop
-- table: fruits_less_than_100
SELECT
price,
name,
FROM
shop.fruits_less_than(False, 100)
There is nothing important to take away. The production code sets is_test to False.
The following is the main part of this article, the unit test code for BigQuery's SQL.
-- test_fruits_less_than_100.sql
-- arrange
CREATE OR REPLACE TABLE
FUNCTION shop.fruits_inject() AS
SELECT
110 AS price,
"grape" AS name
UNION ALL
SELECT
100 AS price, "lemon" AS name
"lemon" AS name
UNION ALL
SELECT
90 AS price, "lemon" AS name; "orange" AS name; "lemon" AS name
"orange" AS name;
-- ACT
CREATE TEMP TABLE fruits_less_than_100 AS
SELECT
price,
name,
FROM
shop.fruits_less_than(True,
100);
-- assert
ASSERT
(
SELECT
COUNT(name) = 1
FROM
fruits_less_than_100) AS "There's one fruit less than 100."
This SQL has three statements.
- shop.fruits_inject function override (mock insert)
- shop.fruits_less_than function, generated with is_test=True and saved in a temporary table
- assertion against the temporary table in (ii)
Importantly, number (i) overrides the shop.fruits_inject function. Then the shop.fruits_inject function result referenced by the shop.fruits function is switched to the overwritten one.
Afterwards, call the shop.fruits_less_than function at (ii) to save the temporary table, and make an assertion at (iii).
Now let's make the test fail. Rewrite the shop.fruits_less_than function, which contains the logic, as follows.
-- fruits_less_than_tvf.sql
-- function:
-- - shop.fruits_less_than(is_test BOOL, p INT)
CREATE OR REPLACE TABLE
FUNCTION shop.fruits_less_than(is_test BOOL,
p INT) AS
SELECT
price, name, p INT
FUNCTION shop.fruits_less_than(is_test BOOL p INT)
FROM
shop.fruits(is_test)
WHERE
-- price < p
price <= p
Since the price in the WHERE clause is less than or equal to 100 instead of less than 100, two items are extracted, orange
and lemon
.
Let's redefine this function and run the test again. If you do so, you will get the following message.
Query error: there's one fruit less than 100. at [25:1].
As expected, the test failed! 🎉
Once you are able to unit test BigQuery, you will next want to incorporate it into your CI. Since BigQuery is a GCP service, we will make use of Cloud Build, a well-known GCP CI service. The definition file for Cloud Build is shown below.
-- cloudbuild.yaml
steps:
# If you want to receive notifications from Slack, uncomment it.
# Then follow the README.md at the following URL to set it up.
# https://github.com/GoogleCloudPlatform/cloud-builders-community/tree/master/slackbot
# - name: "gcr.io/$PROJECT_ID/slackbot"
# id: "WATCH"
# args: [
# "--build", "$BUILD_ID",
# "--webhook", "$_SLACK_WEB_HOOK"
# ]
- name: gcr.io/cloud-builders/gcloud-slim
id: fruits_ct
entrypoint: 'bash'
args: ['run_query.sh', 'fruits_ct.sql']
- name: gcr.io/cloud-builders/gcloud-slim
id: fruits_tvf
entrypoint: 'bash'
args: ['run_query.sh', 'fruits_tvf.sql']
waitFor:
- fruits_ct
- name: gcr.io/cloud-builders/gcloud-slim
id: fruits_less_than_tvf
entrypoint: 'bash'
args: ['run_query.sh', 'fruits_less_than_tvf.sql']
waitFor:
- fruits_tvf
- name: gcr.io/cloud-builders/gcloud-slim
id: test_fruits_less_than_100
entrypoint: 'bash'
args: ['run_query.sh', 'test_fruits_less_than_100.sql']
waitFor:
- fruits_less_than_tvf
#! /bin/bash
# run_query.sh
if [ $# -eq 0 ]; then
echo "No arguments supplied"
fi
bq query --use_legacy_sql=false < $1
If you integrate Cloud Build with Git platforms such as Github, you can run unit tests on every Git event (Push, Merge, etc.).
- The
id:fruits_ct
,id:fruits_tvf
andid:fruits_less_than_tvf
in the sample cloudbuild.yaml are not meant to run many times, I write this for clarity of the article.
xUnit x BigQuery advantages and disadvantages
This section lists the advantages and disadvantages of BigQuery's Mock replaceable unit tests.
- Advantages.
- You only need to learn BigQuery.
- Complete with only the standard BigQuery functionality
- Short feedback cycle
- Because it does not refer to real data, but mock data
- Testing is stable
- Because it references mock data instead of real data
- You only need to learn BigQuery.
- Disadvantages
- Need to control SQL execution order of BigQuery
- In this case, it is necessary to execute queries in the following order
-
- fruits_tvf.sql
-
- fruits_less_than_tvf.sql
-
- test_fruits_less_than_100.sql
-
- Workaround.
- Use BigQuery Scripting
- Use
waitFor
in Cloud Build
- In this case, it is necessary to execute queries in the following order
- Unexpected behaviour when running in parallel
- Define table functions with `CREATE OR REPLACE TABLE FUNCTION
- shop.fruits_inject function is redefined many times
CREATE TEMP TABLE FUNCTION
is not supported- Workaround.
- Use BigQuery transactions
- Table functions not supported
- Use BigQuery transactions
- Define table functions with `CREATE OR REPLACE TABLE FUNCTION
- Not versatile
- Cannot use the same technique (table functions) for other database engines (MySQL, PostgreSQL, etc.)
- Workaround
- Use dbt (maybe).
- Workaround
- Cannot use the same technique (table functions) for other database engines (MySQL, PostgreSQL, etc.)
- Need to control SQL execution order of BigQuery
Conclusion
This time, we found that we can write unit tests for BigQuery SQL by using this technique. This makes it possible to develop safely while ensuring a certain level of quality for the data.
Other testing techniques you may also want to consider include
- Data model testing with dbt
- Policy testing with Open Policy Agent (OPA)
- SQL decomposition into AST → JSON and testing with OPA
- Unit testing using BigQuery client library
P.S.
I would like to try dbt.
Share
Related tags
- Trying Out container-structure-test on a Docker Image
- Integration Testing for Next.js and DB using Testcontainers
- Trying Million Lint
- Trying the Playwright Component Test
- I want to do Document Testing with Typescript
- How to Mock tRPC Communication on Storybook with MSW
- E2E Testing with Cucumber and Screenplay Design
- Comprehensive Testing Pattern Guide for Web Frontend
- Trying ArchUnit with Typescript
- Investigated and Summarized Testing Perspectives for Web Apps (25 Selections)
- I want to test with Google Apps Script too! (Clasp + Typescript + Jest)
- Things to Keep in Mind When Writing Unit Tests
- Continuously Monitor Visual Regression Tests with CircleCI + BackstopJS (Puppeteer)
- Getting Started with Feature Flags in Unleash
- Automating Synchronization with GitHub Actions and Pull Requests
- The GraphQL Guild Ecosystem is Convenient, Isn't It?
- Crawlee is Handy for Quick Crawling
- Self-hosting a Cache Server with turborepo-remote-cache
- Trying Data Transformation with urql
- How to display Twitter embedded content in an iframe after rendering
- Introduction to LLVM - Compiling JavaScript to LLVM (Rust:inkwell) JIT
- Memo Micro Frontends
- Want to be a Virtual Beauty on Mac! (Zoom + Gachikoe + 3Tene or Reality)