Scott Adams, the author of Dilbert, passed away last month. While his work will live on, I was curious about the best way to build a Dilbert search engine.
The first step is to extract the text. Pavan tested over half a dozen LLMs on ~30 Dilbert strips to see which one did the best.
Summary: Gemini 3 Flash does the best, and would cost ~$20 to process the entire Dilbert archive. But if you want a local solution, Qwen 3 VL 32b is the best.
| Model | Score (%) | Text (40) | Spkr (25) | Caps (15) | Panel (10) | Halluc (10) | ||
|---|---|---|---|---|---|---|---|---|
| gemini-3-flash-preview | 99.3% | 39.9 | 24.4 | 15.0 | 10.0 | 10.0 | ||
| qwen3-vl-32b-instruct | 96.0% | 39.8 | 21.6 | 15.0 | 9.9 | 9.7 | ||
| llama-4-maverick | 85.1% | 38.5 | 16.3 | 13.2 | 9.1 | 8.1 | ||
| llama-4-scout | 84.1% | 39.0 | 16.4 | 12.5 | 8.7 | 7.5 | ||
| gemma-3-27b-it | 81.3% | 37.8 | 13.1 | 14.4 | 8.4 | 7.6 | ||
| nemotron-nano-12b-v2-vl-free | 81.3% | 38.6 | 13.1 | 14.4 | 8.5 | 6.6 | ||
| molmo-2-8b-free | 70.4% | 36.2 | 16.4 | 0.5 | 8.8 | 8.4 |
That accuracy of 99.3% is impressive. Here’s the biggest error it made:
- Dogbert: CHAPTER IV. “TIME MANAGEMENT”
- Dogbert: “ALWAYS POSTPONE MEETINGS WITH TIME-WASTING MORONS.”
Dilbert: “HOW DO YOU DO THAT?” - Dogbert: CAN I GET BACK TO YOU ON THAT?
Can you spot the error? The model attributed the text to Dogbert instead of the computer. (But you could argue that Dogbert is the one typing it…)
Here’s another error:
- Dilbert: I’VE DECIDED WE SHOULD OPERATE ALONG MORE CLASSIC LINES, LIKE DR. FRANKENSTEIN’S LAB.
- Dogbert: YOU KNOW WHAT THAT MAKES YOU?
- Dogbert: I’VE GOT A HUNCH…
- Dilbert: LET’S PRACTICE…
- Dilbert: DOGBERT, FETCH ME A BRAIN!
Dogbert: LIKE YOUR PRESENT MODEL, OR ONE THAT WORKS?
Can you spot the error? In Panel 2, it’s Dilbert speaking, not Dogbert.
In fact, the only transcription error Gemini 3 Flash made was writing “McDONALD’S” instead of “MCDONALD’S” (see panel 2), and not hyphenating a line-break in “PRESEN-TATION” (see panel 4).
Qwen 3 VL 32b made almost as few errors. The bigger gap is in speaker detection, where the models fall of steeply.
I spent 7 years typing out every one of the ~3,000 Calvin & Hobbes strips by hand. For these ~12,000 Dilbert strips, it might take a few hours and a few dollars for the same.