
I wrote about my changing AI opinions. At least some of this is because the industry is moving so fast that the bottlenecks keep shifting.
Here are four examples of how we AI couldn’t do something (the bottleneck), but that became possible, and the bottleneck shifted - changing the way we work.
It’s good to keep this in mind when thinking about AI.
Coding:
- “It can’t write useful code. We can’t get real help.”
- “It writes code but doesn’t know our codebase. We can’t let it touch real projects.”
- But in Feb 2024: Gemini 1.5 Pro has 1M-token context ~ 30K LOC". Cursor indexes code.
- “It understands the repo but can’t ship a fix on its own. We can’t hand it a whole issue.”
- But in Mar 2024: Devin solves 14% of SWE-bench - up from 2%.. Verified SWE-Bench is now 70%+.
- “It ships fixes, but we can’t review them fast enough or trust they’re stable.”
Agents
- “It does one step. We can’t chain actions.”
- “Every integration is bespoke. We can’t connect it to all our systems.”
- “It can act and connect, but over a long task its errors compound. We can’t trust a 20-step run.”
- Now: Mar 2025: METR finds autonomous task horizon doubling ~every 7 months. Reliability is a challenge.
- But Claude Mythos, with a ~16 hour reliable execution, might fix this.
Enterprise knowledge work
- “It only knows the public internet. We can’t use it on our own documents.”
- “It reads our documents but can’t fit enough of them. We can’t ask across the whole corpus.”
- But May 2023: Claude’s 100K-token context and Feb 2024: Gemini 1.5’s 1M tokens reduce chunking needs.
- “It runs on our data, but we can’t trust it without a way to measure when it’s silently wrong.”
- Now: the Morgan Stanley deployment relies on an eval framework - evals are the bottleneck.
Document processing
- “It needs thousands of labeled samples. We can’t stand up new doc types quickly.”
- “It learns fast but reads only text. We can’t handle scans, charts, and tables.”
- But Sep 2023: GPT-4V vision model and May 2024: GPT-4o native multimodal solved this.
- “It sees the page but can’t understand long, layout-heavy documents. We can’t trust it on real multi-page files.”
- Now: NeurIPS 2024: on MMLongBench-Doc, GPT-4o scored under ~50 on multi-page chart/table documents.
- But Gemini 3.5 Flash, GPT 5.5, Claude 4.8 Opus, etc. have excellent vision and need to be tested.