Proving Code Works With Z3

At the PyCon SG Education Summit today, Melvin’s lighting talk on “Writing Proofs in Python” began with a subtle bug in this mid-point calculation (often used in binary search or sort) in languages like Java, C/C++, Go, etc.

low = ...
high = ...
mid = (low + high) / 2

Since the integers are fixed-width, this triggers an overflow when low + high exceeds the maximum integer value.

Even popular libraries like Pandas had this bug until 2019. In fact, even Python’s native list.sort() had this sort of bug until 2015! Read the details.

Z3 can find counterexamples

Melvin showed how to use Z3 - a popular theorem prover library - to prove or find errors in code.

from z3 import *

low, high = BitVecs("low high", 32)
s = Solver()
s.add(low >= 0, high >= 0, low <= high)
mid = (low + high) / 2   # signed 32-bit arithmetic

# Find a case where the "obvious" midpoint is wrong
s.add(mid < low)

print(s.check())
print(s.model())
# Prints something like [high = 2142306236, low = 47120451]

The output has found a case where the midpoint is less than low, which is clearly wrong.

ChatGPT can run Z3 to find issues

I had a conversation with ChatGPT that ran like this:

Me: Prove something with Z3 and teach me.
ChatGPT: OK. Here’s how you find the midpoint overflow error.
Me: How can Z3 help me?
ChatGPT: Find edge cases, counter-examples, refactor safely, validate rules, write tests, model workflows, etc.
Me: Any alternatives to Z3?
ChatGPT: Z3 is best, but consider Dafny for production code, Alloy for data, TLA+ for workflows, …
Me: Any popular Python tutorials have Z3-detectable bugs?
ChatGPT: Several! GeeksforGeeks’ Interpolation search hits a ZeroDivisionError; in Programiz’ Bucket sort, the bucket index can exceed the number of buckets, …
Me: Find production Python code with Z3-detectable bugs.
ChatGPT: Sure! There are plenty of ZeroDivisionErrors in networkx, rich, seaborn, xarray, plotly, matplotlib, …
Me: Find real errors - not nitpicky ones!
ChatGPT: OK, xarray, branca, mizani, shap, matplotlib, …
Me: If these don’t have existing bugs filed, write an issue.
ChatGPT: Sure! Here they are.

Some issues are not bad

One issue was quite relevant. I raised #222 on python-visualization/branca which roughly says:

LinearColormap(["red", "blue"], vmin=0, vmax=1).to_step(n=1) fails. But there are times when I just have one bucket/class to display, so shouldn’t this work?

Another looked pretty relevant, too. I raised #11397 on pydata/xarray which roughly says:

np.linspace(0, 1, num=1) works. Shouldn’t RangeIndex.linspace(0, 1, num=1, dim="x") do the same?

But not all issues are worth reporting

Maybe not. The branca LinearColorMap issue feels more real to me than the xarray RangeIndex.linspace one - maybe because I’ve faced it.

Maybe we could do this:

When you find a bug, use Gen AI to report it. It’s a real need.
If Gen AI finds a bug, report it only if you’ll really need it.

Still, Z3 + ChatGPT is quite effective

When testing your own code, the ability to prove it correct or find counterexamples is very powerful.

Since you don’t need to know how to use these tools (AI does it for you), the cost of using these is very low.

This adds a useful layer of defense against vibe coded technical debt.

Z3 can find counterexamples#

ChatGPT can run Z3 to find issues#

Some issues are not bad#

But not all issues are worth reporting#

Still, Z3 + ChatGPT is quite effective#

Z3 can find counterexamples

ChatGPT can run Z3 to find issues

Some issues are not bad

But not all issues are worth reporting

Still, Z3 + ChatGPT is quite effective