Python UDFs in DuckDB

From simplest use-cases to advanced performance optimizations.

Oct 28, 2024

Intro

DuckDB is an in-memory OLAP database, that is getting significant traction in the Big Data world. While being implemented in C++, it supports extending it’s capabilities using other languages, including Python. This functionality is exposed via Function API in DuckDB. Assuming that we want to implement our UDFs in Python, we can break down the problem into two orthogonal problems:

implementing business logic (behavior) of given UDF;
ensuring smooth integration (binding) between DuckDB core and our UDF code

In this post, we will focus on the latter.

Without further ado, let’s go!

Basics

Luckily, starting is extremely easy. DuckDB can infer binding parameters from idiomatic Python type hints. First, we define type-hinted callable, then we register it in DuckDB session via create_function method. Consider the following example:

import duckdb


def repeat_str(v: str, n: int) -> str:
    return v * n


def f1_simple() -> None:
    conn = duckdb.connect(":memory:")
    conn.create_function("repeat_str", repeat_str)
    ret = conn.execute("SELECT repeat_str('my_str', 3) AS result").fetchall()
    print(ret)


if __name__ == "__main__":
    f1_simple()

This produces (as expected) a single row with a single varchar column, where the output column contains the input column repeated three times:

[('my_strmy_strmy_str',)]

These implicit bindings are supported for most common scalar Python types: bool, float, int, str, and bytes/bytearray.

Of course, the developer is not forced to use implicit behavior. Expected types can also be defined explicitly in create_function call.

import duckdb
from duckdb.typing import VARCHAR, INTEGER


def repeat_str_typeless(v, n):
    return v * n


def f2_explicit() -> None:
    conn = duckdb.connect(":memory:")
    conn.create_function("repeat_str", repeat_str_typeless, [VARCHAR, INTEGER], VARCHAR)
    ret = conn.execute("SELECT repeat_str('my_str', 3) AS result").fetchall()
    print(ret)

if __name__ == "__main__":
    f1_simple()

This produces the same output as code type-hinted Python code. We can also observe, that the callable name is redundant in terms of calling a UDF, all that matters is the name under which the callable was registered in the DuckDB session.

Complex types (arrays, structs)

Let’s inspect types available in duckdb.typing module:

>>> import duckdb.typing
>>> dir(duckdb.typing) 
['BIGINT', 'BIT', 'BLOB', 'BOOLEAN', 'DATE', 'DOUBLE', 'DuckDBPyType', 'FLOAT', 'HUGEINT', 'INTEGER', 'INTERVAL', 'SMALLINT', 'SQLNULL', 'TIME', 'TIMESTAMP', 'TIMESTAMP_MS', 'TIMESTAMP_NS', 'TIMESTAMP_S', 'TIMESTAMP_TZ', 'TIME_TZ', 'TINYINT', 'UBIGINT', 'UHUGEINT', 'UINTEGER', 'USMALLINT', 'UTINYINT', 'UUID', 'VARCHAR', '__all__', '__builtins__', '__cached__', '__doc__', '__file__', '__loader__', '__name__', '__package__', '__path__', '__spec__']

After a quick scan, we can conclude that only scalar types are exposed. What about arrays and structs?

Even though the docs do not document this behavior, at the time of writing this post (with DuckDB 1.1.2 version):

idiomatic type hints for lists are operational
non-standard, TypeScript-inspired generic type hints for dicts are operational

I am not sure whether this is intentional (as DuckDB docs do not mention this exact form), nevertheless, it works!

import duckdb


def reverse_list(v: list[int | str]) -> list[int | str]:
    return v[::-1]


def add_key_to_struct(v: {"a": str}) -> {"a": str, "b": int}:
    return {"a": v["a"], "b": len(v["a"])}


def f3_complex() -> None:
    conn = duckdb.connect(":memory:")
    conn.create_function("reverse_array", reverse_list)
    conn.create_function("add_key_to_struct", add_key_to_struct)
    ret = conn.execute("SELECT reverse_array([1,2,3]) AS result").fetchall()
    print(ret)
    ret = conn.execute("SELECT reverse_array(['a','b','c']) AS result").fetchall()
    print(ret)
    ret = conn.execute("SELECT add_key_to_struct({'a': 'duckdb'}) AS result").fetchall()
    print(ret)


if __name__ == "__main__":
    f3_complex()

[([3, 2, 1],)]
[(['c', 'b', 'a'],)]
[({'a': 'duckdb', 'b': 6},)]

For type-hint-less approach, LIST and STRUCT type is not defined explicitly, instead, we need to wrap our intended type definition into DuckDBPyType.

import duckdb
from duckdb.typing import DuckDBPyType


def reverse_list_typeless(v):
    return v[::-1]


def add_key_to_struct_typeless(v):
    return {"a": v["a"], "b": len(v["a"])}


def f4_complex_typeless() -> None:
    conn = duckdb.connect(":memory:")
    LIST_TYPE = DuckDBPyType(list[int | str])
    STRUCT_INPUT_TYPE = DuckDBPyType({"a": str})
    STRUCT_OUTPUT_TYPE = DuckDBPyType({"a": str, "b": int})
    conn.create_function("reverse_array", reverse_list_typeless, [LIST_TYPE], LIST_TYPE)
    conn.create_function(
        "add_key_to_struct",
        add_key_to_struct_typeless,
        [STRUCT_INPUT_TYPE],
        STRUCT_OUTPUT_TYPE,
    )
    ret = conn.execute("SELECT reverse_array([1,2,3]) AS result").fetchall()
    print(ret)
    ret = conn.execute("SELECT reverse_array(['a','b','c']) AS result").fetchall()
    print(ret)
    ret = conn.execute("SELECT add_key_to_struct({'a': 'duckdb'}) AS result").fetchall()
    print(ret)


if __name__ == "__main__":
    f3_complex_typeless()

Output if 100% consistent with earlier snippet with typed code.

Idempotence

By default, UDFs are marked as idempotent. This means that DuckDB runtime is allowed to reuse call results for the same arguments. If your function is not idempotent (e.g. it relies on randomness), outputs will be confusing.

import random
import duckdb

def random_func() -> float:
    return random.random()

def f5_idempotent_wrong():
    conn = duckdb.connect(":memory:")
    conn.create_function("random_func", random_func)
    ret = conn.execute("SELECT random_func() FROM (VALUES (1,),(2,),(3,))").fetchall()
    print(ret)

if __name__ == "__main__":
    f5_idempotent()

This gives the same “random” value for each row (the output below is just an example, exact values will change from run to run):

[(0.024425629043193342,), (0.024425629043193342,), (0.024425629043193342,)]

Improved version marks UDF as not idempotent via side_effects keyword:

def f6_idempotent_correct():
    conn = duckdb.connect(":memory:")
    conn.create_function("random_func", random_func, side_effects=True)
    ret = conn.execute("SELECT random_func() FROM (VALUES (1,),(2,),(3,))").fetchall()
    print(ret)

This correctly produces different random values for each row (the output below is just an example, exact values will change from run to run):

[(0.5618496699562575,), (0.0006475665491705307,), (0.6747498448194491,)]

Please note that how your function is behaving and how your function is registered is fully independent:

deterministic function marked as deterministic (side_effects=False) and non-deterministic function marked as non-deterministic (side_effects=True) are desired configurations
deterministic function marked as non-deterministic will work functionally, but performance will be decreased due to unneeded function calls
non-deterministic function marked as deterministic will work incorrectly and re-use certain cached values when it’s not desired

Null handling

By default, the presence of NULL argument skips the Python code call and returns NULL unconditionally.

import duckdb


def length_or_minus_one(maybe_null: str) -> int:
    if maybe_null is None:
        return -1
    return len(maybe_null)


def f7_null_handling():
    conn = duckdb.connect(":memory:")
    conn.create_function("length_or_minus_one", length_or_minus_one)
    ret = conn.execute(
        "SELECT length_or_minus_one('s'), length_or_minus_one(''), length_or_minus_one(NULL)"
    ).fetchall()
    print(ret)

if __name__ == "__main__":
    f7_null_handling()
    # output: [(1, 0, None)]

As we can see from the UDF code, this was not intended, because NULL handling is built into our Python function code. Improved version marks null_handling as special, which executes actual code.

def f7_null_handling_force_call():
    conn = duckdb.connect(":memory:")
    conn.create_function(
        "length_or_minus_one",
        length_or_minus_one,
        null_handling="special",
    )
    ret = conn.execute(
        "SELECT length_or_minus_one('s'), length_or_minus_one(''), length_or_minus_one(NULL)"
    ).fetchall()
    print(ret)


if __name__ == "__main__":
    f7_null_handling_force_call()
    # output: [(1, 0, -1)]

We no longer return None for NULL input, -1 is returned instead.

Exceptions

UDF, as any Python code, can throw exceptions. By default, DuckDB runtime re-throws an exception:

def div_or_throw(a: int | float, b: int | float) -> float:
    return a / b


def f8_exception():
    conn = duckdb.connect(":memory:")
    conn.create_function("div_or_throw", div_or_throw)
    ret = conn.execute("SELECT div_or_throw(1, 2), div_or_throw(1, 0)").fetchall()
    print(ret)

if __name__ == "__main__":
    f8_exception()

If for any reason this is undesired, we can force it to return NULL instead. This is driven by exception_handling argument.

def f8_exception_hrow():
    conn = duckdb.connect(":memory:")
    conn.create_function("div_or_throw", div_or_throw, exception_handling="return_null")
    ret = conn.execute("SELECT div_or_throw(1, 2), div_or_throw(1, 0)").fetchall()
    print(ret)  # [(0.5, None)]

Vectorization

All examples discussed earlier executed a single function call for each row. We can confirm this behavior via code that tracks the total call count:

import duckdb


call_count = 0


def func_with_call_count(v: int) -> int:
    global call_count
    call_count += 1
    return call_count


def v9_vectorization_native():
    conn = duckdb.connect(":memory:")
    conn.create_function(
        "call_count_helper",
        func_with_call_count,
        side_effects=True,
    )
    ret = conn.execute(
        "SELECT call_count_helper(i) FROM generate_series(1, 100) s(i)"
    ).fetchall()
    print(ret)  # [(1,), (2,), (3,), (4,), ..., (97,), (98,), (99,), (100,)]

If function call overhead is too high, this is often undesired. Instead, UDFs are called on a batch of data. This vastly reduces overhead and can bring significant performance improvements. Batching can be achieved via type argument. Please note that pyarrow dependency needs to be installed for this to work.

call_count_vec = 0


def func_with_call_count_vectorized(v: int) -> int:
    print(v)
    global call_count_vec
    call_count_vec += 1
    return [call_count_vec for _ in v]


def v9_vectorization():
    conn = duckdb.connect(":memory:")
    conn.create_function(
        "call_count_helper",
        func_with_call_count_vectorized,
        type="arrow",
        side_effects=True,
    )
    ret = conn.execute(
        "SELECT call_count_helper(i) FROM generate_series(1, 100) s(i)"
    ).fetchall()
    print(ret)  # [(1,), (1,), ..., (1,), (1,)]

Let’s notice the most important remarks:

UDF body had to be changed. Since input gives us a batch of data, the output needs to produce a batch that has the same row count as input.
UDF function signature had not changed. This means that type hints are inaccurate and do not model properly that inputs are no longer scalar values (they are pyarrow.ChunkedArray instead.
This means that (as of DuckDB version 1.1.2), PyArrow-based UDFs and type inference for UDFs are mutually exclusive. You either get incorrect type-checking behavior inside your UDFs, or you are forced to define types explicitly in create_function call.

Summary and key learnings

DuckDB allows you to extend its capabilities with UDFs implemented in Python, enabling custom functionality beyond its built-in features.

Native UDFs provide a way to achieve functional parity with other OLAP systems, adding support for functions not natively available in DuckDB.

Vectorized UDFs enable high-performance custom code by processing data in batches, reducing function call overhead.

However, certain conventions for implementing Python UDFs can result in inaccurate or unsupported type hints, which may mislead other Python developers and automated code quality tools. When using these conventions, it’s advisable to include comments and type-checking exclusions.

Łukasz’s Substack

Discussion about this post