
LLMs are smarter than us in many areas. How do we manage them?
This is not a new problem.
- VC partners evaluate deep-tech startups.
- Science editors review Nobel laureates.
- Managers manage specialist teams.
- Judges evaluate expert testimony.
- Coaches train Olympic athletes.
… and they manage and evaluate “smarter” outputs in many ways:
- Verify. Check against an “answer sheet”.
- Checklist. Evaluate against pre-defined criteria.
- Sampling. Randomly review a subset.
- Gating. Accept low-risk work. Evaluate critical ones.
- Benchmark. Compare against others.
- Red-team. Probe to expose hidden flaws.
- Double-blind review. Mask identity to curb bias.
- Reproduce. Re-running gives the same output?
- Consensus. Aggregate multiple responses. Wisdom of crowds.
- Outcome. Did it work in the real world?
For example:
- Vibe coding: Non-programmers might glance at lint checks (Checklist) and see if it works (Outcome).
- LLM image designs: Developers might check if a few images look good (Sampling) and check a few marketers (Consensus).
- LLM news articles: An journalist might run a Checklist, a Double-blind review with experts, and Verify critical facts (Gating).
You already know many of these. You learnt them in Auditing. Statistics. Law. System controls. Policy analysis. Quality engineering. Clinical epidemiology. Investigative journalism. Design critique.
Worth brushing up skills. They’re more important in the AI era.