Speeding up Python code with mypyc

Compiling Python code for immediate speed gains

Nov 12, 2024

Intro

mypy and other type checkers are popular tools for ensuring code quality in modern production-grade Python code. A little-known fact is the existence of mypyc. Mypyc uses the same facilities as mypy to analyze Python code, but instead of reporting type violations, it can emit IR code which later gets compiled as C-extension. It’s used to compile mypy itself, as well as other popular tools in Python ecosystem, e.g. black.

Dabbling into compiled Python extensions was long on my TODO list, and mypyc is a great substep to achieve that.

Without further ado, let’s go!

Baseline

A toy example to demonstrate the solution will be the 2021/09 Advent of Code problem, which I solved some time ago and published to my GitHub.

Advent of Code problems have a couple of properties which make them suitable for this demonstration:

Problems resemble competitive programming problems, often promoting low-level implementation, native looping, etc. This style of code is notoriously slow and discouraged in Python
Problems are structured as two-step problems, where the second problem is often designed to uncover performance issues with naive implementation achieved in the step before. This tends to promote problems that are heavily CPU-bound, which (once again) are notoriously slow in Python

For subsequent steps, the code was revamped.

aoc_2021_09.py contains all logic:

import functools
import heapq


def get_input(path: str) -> list[list[int]]:
    with open(path) as fh:
        return [[*map(int, line.strip())] for line in fh]


def solve(heightmap: list[list[int]]) -> int:
    ret = 0
    num_rows = len(heightmap)
    num_cols = len(heightmap[0])
    for row_number, row in enumerate(heightmap):
        for col_number, value in enumerate(row):
            to_compare: set[int] = set()
            if row_number > 0:
                to_compare.add(heightmap[row_number - 1][col_number])
            if col_number > 0:
                to_compare.add(heightmap[row_number][col_number - 1])
            if row_number < num_rows - 1:
                to_compare.add(heightmap[row_number + 1][col_number])
            if col_number < num_cols - 1:
                to_compare.add(heightmap[row_number][col_number + 1])
            if all(c > value for c in to_compare):
                ret += value + 1
    return ret


def solve_p2(heightmap: list[list[int]]) -> int:
    # this is basically problem of finding connected components
    visited = [[False] * len(heightmap[0]) for _ in range(len(heightmap))]
    sizes: list[int] = []
    for row_number, row in enumerate(heightmap):
        for col_number, value in enumerate(row):
            size = dfs(visited, heightmap, row_number, col_number)
            if size > 0:
                sizes.append(size)
    return functools.reduce(lambda x, y: x * y, heapq.nlargest(3, sizes))


def dfs(
    visited: list[list[bool]],
    heightmap: list[list[int]],
    row_number: int,
    col_number: int,
) -> int:
    size = 0
    num_rows = len(heightmap)
    num_cols = len(heightmap[0])
    if not visited[row_number][col_number]:
        visited[row_number][col_number] = True
        if heightmap[row_number][col_number] == 9:
            return 0
        size += 1
        if row_number > 0:
            size += dfs(visited, heightmap, row_number - 1, col_number)
        if col_number > 0:
            size += dfs(visited, heightmap, row_number, col_number - 1)
        if row_number < num_rows - 1:
            size += dfs(visited, heightmap, row_number + 1, col_number)
        if col_number < num_cols - 1:
            size += dfs(visited, heightmap, row_number, col_number + 1)
    return size

runner.py contains just invocation:

import aoc_2021_09

my_input = aoc_2021_09.get_input("input.txt")


def benchmark():
    """Will be called by timeit"""
    aoc_2021_09.solve_p2(my_input)


if __name__ == "__main__":
    print(aoc_2021_09.__file__)
    print(aoc_2021_09.solve_p2(my_input))

The reader is left to explore the 2021/09 problem definition and solution code.

Let’s just check if we can run our code:

$ python runner.py 
C:\Users\rogal\repo\play-with-mypyc\aoc_2021_09.py
1050192

We can run the code and get the expected result. Awesome!

Compilation

Let’s start with the installation of mypy with relevant extras:

pip install mypy[mypyc]
...
Installing collected packages: mypy
Successfully installed mypy-1.13.0

Let’s check if our type model is accurate.

$ mypy --strict aoc_2021_09.py 
Success: no issues found in 1 source file

Perfect! Typically, violations found here will result in an aborted compilation process later.

The compilation is as simple as telling mypyc to compile module (-m) named aoc_2021_09:

$ mypyc -m aoc_2021_09
running build_ext
copying build\lib.win-amd64-cpython-312\aoc_2021_09.cp312-win_amd64.pyd ->

In our filesystem, we can notice a new file:

$ ls
__pycache__/  aoc_2021_09.cp312-win_amd64.pyd*  aoc_2021_09.py  build/  ex.txt  input.txt  runner.py

I am on Windows, so *.pyd file is created. On the Linux box, *.so would be created instead.

The created file is fully equivalent to the original *.py file and will be used by the interpreter during execution:

$ python runner.py
C:\Users\rogal\repo\play-with-mypyc\aoc_2021_09.cp312-win_amd64.pyd
1050192

Now, let’s take a look at some numbers!

Benchmarks

The simplest benchmarks (based on timeit) give following results on my laptop.

Run with native Python:

$ python -m timeit 'import runner; runner.benchmark()'
50 loops, best of 5: 7.74 msec per loop

Run with compiled extension:

$ python -m timeit 'import runner; runner.benchmark()'
500 loops, best of 5: 939 usec per loop

Numbers vary from run to run, but they consistently show up as 8x speed-up. As discussed earlier, the example shown here (CPU bound, highly recursive) is almost designed to produce the highest speed-up possible. In my experiments, I’ve seen outputs on the full spectrum:

performance degradations
performance parity
minor speedups
big improvements such as shown in the results above

In practical terms, performance impact must be measured case by case.

Next steps

Of course, toy projects are a great way to start, but they do not represent real-life problems accurately. The next step on my TODO list is to apply these learnings on bigger codebase and hopefully find some performance improvements there.

Wish me luck!

Are you interested in more content like this post?

If you believe that I omitted something essential here, let me know!

Join Łukasz Rogalski’s subscriber chat

Available in the Substack app and on web

Łukasz’s Substack

Discussion about this post