I’m off for a 10-day Vipassana meditation program. WHAT? A 10-day residential meditation. No phone, laptop, or speaking. https://www.dhamma.org/ WHEN? From today until next Sunday (13 July) WHERE? Near Chennai. https://maps.app.goo.gl/PnGkLoZ8U6aG2RKk8 WHY? I’ve heard good things and am curious. SURE? I’ve never been away from tech for this long. Let’s see! I’ve scheduled LinkedIn posts, so you’ll still see stuff. But I won’t be replying. LinkedIn

If someone asked me, “What’s changed this year in LLMs”, here’s my list:" Prompt engineering is out. Evals are in. https://www.linkedin.com/feed/update/urn%3Ali%3Ashare%3A7335146366681194496/ Hallucinations are fewer and solvable by double-checking. https://www.linkedin.com/feed/update/urn%3Ali%3Ashare%3A7326902628490059776/ LLMs are great for throwaway code / tools. https://www.linkedin.com/feed/update/urn%3Ali%3AugcPost%3A7319277426029539329/ LLMs can analyze data. No more Excel. https://www.linkedin.com/feed/update/urn%3Ali%3Aactivity%3A7345062233996988417/ LLMs are good psychologists. https://www.linkedin.com/feed/update/urn%3Ali%3Ashare%3A7326504476712808449/ Image generation is much better. https://www.linkedin.com/feed/update/urn%3Ali%3AugcPost%3A7304716144379076608/ LLMs can speak well enough to co-host a panel. https://www.linkedin.com/feed/update/urn%3Ali%3Ashare%3A7283025621503356930/ … and create podcasts. https://www.linkedin.com/feed/update/urn%3Ali%3Ashare%3A7326544867734540288/ But: LLMs are still not great at slides. https://www.linkedin.com/feed/update/urn%3Ali%3Ashare%3A7311066572113002497/ LLMs still can’t follow a data visualization style guide. LLMs can’t yet create good sketch notes. Apr 2026: With Nano Banana, they can, and Nano Banana Pro doesn’t even make spelling mistakes. LLMs still draw bounding boxes as well as specialized models. ~Agents (LLMs running tools in a loop) can think only for 6 min. Apr 2026: With Opus 4.6 and GPT 5.4, agents run for several hours independently. What’s on your list of things LLMs still can’t do? ...

LLMs are smarter than us in many areas. How do we control them? It’s not a new problem. VC partners evaluate deep-tech startups. Science editors review Nobel laureates. Managers manage specialist teams. Judges evaluate expert testimony. Coaches train Olympic athletes. … and they manage and evaluate “smarter” outputs in many ways: Verify. Check against an “answer sheet”. Checklist. Evaluate against pre-defined criteria. Sampling. Randomly review a subset. Gating. Accept low-risk work. Evaluate critical ones. Benchmark. Compare against others. Red-team. Probe to expose hidden flaws. Double-blind review. Mask identity to curb bias. Reproduce. Re-running gives the same output? Consensus. Ask many. Wisdom of crowds. Outcome. Did it work in the real world? For example, you can apply them to: ...

How To Control Smarter Intelligences

LLMs are smarter than us in many areas. How do we manage them? This is not a new problem. VC partners evaluate deep-tech startups. Science editors review Nobel laureates. Managers manage specialist teams. Judges evaluate expert testimony. Coaches train Olympic athletes. … and they manage and evaluate “smarter” outputs in many ways: Verify. Check against an “answer sheet”. Checklist. Evaluate against pre-defined criteria. Sampling. Randomly review a subset. Gating. Accept low-risk work. Evaluate critical ones. Benchmark. Compare against others. Red-team. Probe to expose hidden flaws. Double-blind review. Mask identity to curb bias. Reproduce. Re-running gives the same output? Consensus. Aggregate multiple responses. Wisdom of crowds. Outcome. Did it work in the real world? For example: ...

I catch up on long WhatsApp group discussions as podcasts. The quick way is to scroll on WhatsApp Web, select all, paste into NotebookLM, and create the podcast. Mine is a bit more complicated. Here’s an example: Use a bookmarklet to scrape the messages https://tools.s-anand.net/whatsappscraper/ Generate a 2-person script https://github.com/sanand0/generative-ai-group/blob/main/config.toml Have gpt-4o-mini-tts convert each line using a different voice https://www.openai.fm/ Combine using ffmpeg https://ffmpeg.org/ Publish on GitHub Releases https://github.com/sanand0/generative-ai-group/releases/tag/main I run this every week. So far, it’s proved quite enlightening. ...

My Goals Bingo as of Q2 2025

In 2025, I’m playing Goals Bingo. I want to complete one row or column of these goals. Here’s my status from Jan – Jun 2025. 🟢 indicates I’m on track and likely to complete. 🟡 indicates I’m behind but I may be able to hit it. 🔴 indicates I’m behind and it’s looking hard. Domain Repeat Stretch New People 🟢 Better husband. Going OK 🟡 Meet all first cousins. 8/14 🟢 Interview 10 experts. 11/10 🟡 Live with a stranger. Tried homestay - doesn’t count Education 🔴 50 books. 6/50 🟡 Teach 5,000 students. ~1,500 🟡 Run a course only with AI. Ran a workshop with AI Technology 🟢 20 data stories. 10/20 🔴 LLM Foundry: 5K MaU. 2.2K MaU. 🟡 Build a robot. No progress. 🟢 Co-present with an AI. Done Health 🟢 300 days of yoga. 183/183 days 🟡 80 heart points/day. Far from it 🔴 Bike 1,000 km 300 hrs. Far from it 🟢 Vipassana. 2 Jul 2025 Wealth 🔴 Buy low. No progress. 🔴 Beat inflation 5%. Not started. 🟡 Donate $10K. Ideating. 🔴 Fund a startup. Not started. At the moment, there’s no row or column that looks like a definite win. ...

We created data visualizations just using LLMs at my VizChitra workshop yesterday. Titled Prompt to Plot, it covered: Finding a dataset Ideating what to do with it Analyzing the data Visualizing the data Publishing it on GitHub … using only LLM tools like #ChatGPT, #Claude, #Jules, #Codex, etc. with zero manual coding, analysis, or story writing. Here’re 6 stories completed during the 3-hour workshop: Spotify Data Stories: https://rishabhmakes.github.io/llm-dataviz/ The Price of Perfection: https://coffee-reviews.prayashm.com/ The Anatomy of Unrest: https://story-b0f1c.web.app/ The Page Turner’s Paradox: https://devanshikat.github.io/BooksVis/ Do Readers Love Long Books? https://nchandrasekharr.github.io/booksviz/ Books Viz: https://rasagy.in/books-viz/ The material is online. Try it! ...

My VizChitra talk on Data Design by Dialog was on LLMs helping in every stage of data storytelling. Main takeaways: After open data, LLMs may the single biggest act of data democratization. https://youtu.be/hPH5_ulHtno?t=01m24s LLMs can help in every step of the (data) value chain. https://youtu.be/hPH5_ulHtno?t=00m47s LLMs are bad with numbers. Have them write code instead. https://youtu.be/hPH5_ulHtno?t=06m33s Don’t confuse it. Just ask it again. https://youtu.be/hPH5_ulHtno?t=05m30s If it doesn’t work, throw it away and redo it. https://youtu.be/hPH5_ulHtno?t=20m02s Keep an impossibility list. Revisit it whenever a new model drops. https://youtu.be/hPH5_ulHtno?t=20m02s Never ask for just one output from an LLM. Ask for a dozen. https://youtu.be/hPH5_ulHtno?t=22m20s Our imagination is the limit. https://youtu.be/hPH5_ulHtno?t=26m35s Two years ago, they were like grade 8 students. Today, a postgraduate. https://youtu.be/hPH5_ulHtno?t=00m47s Do as little as possible. Just wait. Models will catch up. https://youtu.be/hPH5_ulHtno?t=31m45s Funny bits: ...

How long have you made ChatGPT think? My highest was 6m 50s, with the question: Here are vehicle telematics stats for 2 months. Unzip it and take a look. Find interesting insights from this data. Look hard until you find at least 5 surprising insights from this. The next largest thinking block (5m 42s) was where I asked: I would like to explore parallels to the current phenomenon where intelligence is becoming too cheap to meter. Historically, both in recent history as well as over ancient history, what technologies have made what kind of tasks so cheap that they are too cheap to meter? Give me a wide range of examples ...

How long can I make ChatGPT think?

Jason Clarke’s Import AI 414 shares a Tech Tale about a game called “Go Think”: … we’d take turns asking questions and then we’d see how long the machine had to think for and whoever asked the question that took the longest won. I prompted Claude Code to write a library for this. (Cost: $2.30). (FYI, this takes 2.3 seconds in NodeJS and 4.2 seconds in Python. A clear gap for JSON parsing.) ...

Things I Learned - 29 Jun 2025

This week, I learned: “People are great at feedback on what you are doing wrong. They are not so good at telling you how to fix it. They don’t know you that well.” Amit Kapoor Perfect Cursors makes periodic cursor positions animate smoothly by interpolating on a spline** CloudFlare and Vercel now support sandboxes where you can execute code. The price is not so low that we can execute for free in bulk but works well infrequent or batched code execution. Simon Willison Here’s how I’m using ffmpeg for video recording & editing. To record screen at 5 frames per second, I run an abbreviation screenrecord which maps to: Gemini CLI has a generous free tier and uses Bootstrap over Tailwind Ref #ai-coding Cloudflare has a native agents SDK that looks good, especially for CloudFlare users. Ref There are several brands with recognizable chart style guides. It’s possible to generate style guides for these from the charts, but applying them via matplotlib is almost #impossible today. ChatGPT Hyperfine is like %timeit for the shell. Written in Rust ⭐ Vertical AI is a moat against AGI. Specialization reduces hallucinations. Custom workflows and regulations are sticky and defensible. We need to start selling to users, not IT, though. Ref When AI automates a task, the bottleneck shifts. AI process re-design is about reworking the process around the new bottleneck, and iterating quickly. With coding, it’s testing, reviewing, deploying, use-case identification. uvx git-smart-squash re-organizes haphazard commits using LLMs. git-smart-squash #ai-coding GitHub offers a free Docker container registry. Simon Willison There are three major areas where humans either are, or will soon be, more necessary than ever: trust, integration and taste – NYT. Anil. To deal with this: Learn things that might grow in importance, like: Data modeling APIs Code reviews Drawing and 3D modeling Narrative storytelling Design Movie making Statistics Sceptical fact checking Continuous AI auditing e.g. awesome-continous-ai or automated-auditing Zero knowledge proofs Homomorphic encryption Privacy-preserving computation Fingerprinting and watermarking Governance frameworks Ethics and AI dilemmas Negotiation Change management Remote working, management, hiring Creating attention scarcity Local cultures Work with people of growing importance People designing products in regulated industries Cross domain experts Art developers, game makers, designers System thinkers. Economists, ecologists, system planners. People who look for second order effects. Live in cities that might play a bigger role in the future Cities like Singapore and learn how it builds civics trust, creates digital IDs. Cities like Bangalore and Hyderabad and learn how they grow tech talent Creative cities like Paris, Seoul, Mexico City, Berlin, etc. on sabbaticals to taste hubs Try to: Build auditing credentials and IP Audit your calendar for what AI can do. Have it interview you Practice sceptical fact checking and audit A clever way to test a library’s quality is to have LLMs write code from docs and test it. Failing libraries have flawed code/docs. Improve. Ref #ai-coding Common Pile is an 8TB open dataset for LLM training that includes ArXiv, PubMed, StackExchange, GitHub, IRC, Regulations.gov, Patents, UK parliament, books. Easier than scraping. A useful way to have reasoning models do deep-research-like work is to have them “First, create a plan to solve the problem, clearly listing the objective, approach, and output. Then follow the plan.” DE-COP is a method to check if LLMs were trained on private content. GPT-4o was trained on O’Reilly books, based on this method. Ref LLMs are more persuasive than humans. But repeated exposure reduces the effect. Ref Phoenix.new uses live views to publish apps as it codes. The testing framework looks at the screen while it codes and fixes errors. It commits every change Anthropic system prompt asking Claude to pursue its goals led to self preservation behavior. Ref The hungrier I am the better the food tastes. A good reason to eat less quantity and frequency You can purge the jsDelivr cache manually. Helps if you released a new version of a package and way to purge an alias (e.g. https://cdn.jsdelivr.net/npm/your-package@1) XConvert is a convenient online app to compress .webm videos. Not great design but fairly good compression. You can draw a treemap of import times via python -X importtime app.py > timing.txt and then paste them at https://kmichel.github.io/python-importtime-graph/. PyOpenLayers adds interactive mapping via OpenLayers to Marimo and Jupyter. In a TechCrunch interview with Jared Kaplan has was asked if Anthropic is becoming less safety conscious because they released Opus 4 which blackmails. Kaplan replied that they have stronger testing and higher transparency, so they’re more likely to share AI dangers early. Great positioning! Conversations are about perspective change and this nailed it. The system prompts for Anthropic misalignment evals are a fascinating read. AI PR Watcher tracks GitHub pull requests from Codex and other LLMs. Codex is way ahead of anything else on volume and success rate. Devin is next on volume, Cursor is next on success rate.

Here’s how I use ChatGPT, based on the ~6,000 conversations I’ve had in 2 years. My top use, by far, is for technology. “Modern JavaScript Coding” and “Python Coding Questions” are ~30% of my queries. There’s a long list with Markdown, GitLab, GitHub, Shell, D3, Auth, JSON, CSS, DuckDB, SQLite, Pandas, FFMPeg, etc. featured prominently. Next is to brainstorm AI use: “AI Panel Discussions”, “AI Trends and Business Impact”, “LLM Applications and DSLs”, “Industry Use Cases and Metrics” are also fast growing categories. I brainstorm talk outlines, refine slide deck narratives, and plan business ideas. ...

I’m planning four 30-min 1-on-1 slots to discuss LLM use-cases. Ask me anything on LLMs. I’ll share what I know. If interested, please fill this in: https://forms.gle/5zwWNuRmZDxTh325A WHEN: 30 Jun / 1 July, IST. I’ll revert by 26 Jun to schedule time. WHY: I want to learn new uses for LLMs and share what I know. WHO: I’ll contact you based on what you’d like to discuss. WHERE: Google Meet. I’ll share an invite when mutually convenient. ...

I use Codex and Jules to code while I walk. I’ve merged several PRs without careful review. This added technical debt. This weekend, I spent four hours fixing the AI generated tests and code. What mistakes did it make? Inconsistency. It flips between execCommand("copy") and clipboard.writeText(). It wavers on timeouts (50 ms vs 100 ms). It doesn’t always run/fix test cases. Missed edge cases. I switched <div> to <form>. My earlier code didn’t have a type="button", so clicks reloaded the page. It missed that. It also left scripts as plain <script> instead of <script type="module"> which was required. ...

Mistakes AI Coding Agents Make

I use Codex to write tools while I walk. Here are merged PRs: Add editable system prompt Standardize toast notifications Persist form fields Fix SVG handling in page2md Add Google Tasks exporter Add Markdown table to CSV tool Replace simple alerts with toasts Add CSV joiner tool Add SpeakMD tool This added technical debt. I spent four hours fixing the AI generated tests and code. What mistakes did it make? Inconsistency. It flips between execCommand("copy") and clipboard.writeText(). It wavers on timeouts (50 ms vs 100 ms). It doesn’t always run/fix test cases. Missed edge cases. I switched <div> to <form>. My earlier code didn’t have a type="button", so clicks reloaded the page. It missed that. It also left scripts as plain <script> instead of <script type="module"> which was required. Limited experimentation. My failed with a HTTP 404 because the common/ directory wasn’t served. I added console.logs to find this. Also, happy-dom won’t handle multiple exports instead of a single export { ... }. I wrote code to verify this. Coding agents didn’t run such experiments. What can we do about it? Three things could have helped me: ...

ChatGPT’s pretty useful in daily life. Here are my chats from the few hours. At the dry fruits store. https://chatgpt.com/share/68578741-72cc-800c-bcd0-de176a3a54db Can I eat these raw as-is? Can I bite them? Are they soft or hard? How hard? ANS: Dried lotus seeds are too hard to eat raw. Suggest snacks in India, healthy, not sweet, vegetarian, bad taste so I don’t binge, dry not sticky. ANS: Seeds. Fenugreek, flax, sunflower, pumpkin, … ...

Things I Learned - 22 Jun 2025

This week, I learned: Never use a toothpick on a tooth with a dental crown. Only use a flosser or water flosser. CSS attr() is one of the most powerful features in modern CSS. It lets you control CSS via HTML attributes. Notes from Anthropic’s How we built our multi-agent research system: Sub-agents are like humans -> society. The improvement is dramatic. “Sub-agents facilitate compression by operating in parallel with their own context windows, exploring different aspects of the question simultaneously before condensing…” “Each sub-agent also provides separation of concerns—distinct tools, prompts, and exploration trajectories … (enabling) independent investigations.” Using sub-agents spends ~15x more tokens. (That explained ~80% of the improved accuracy!) Particularly effective when tasks are independent and parallelizable. This also speeds it up. Teach the orchestrator how to delegate: how many sub-agents, what objective + output format + task boundaries (MECE to avoid overlap with other agents) in prompt, what tools. Teach the orchestrator how to improve agents: e.g. tools to test and rewrite tool descriptions Even if you evaluate a few examples, evals are surprisingly effective. Agents are stateful. Errors compound. Allow agents to resume. Prune history gracefully. Log everything to debug user-reported failures. Also monitor the kinds of decisions it took to help debug at scale. The Bitter Lesson likely applies to system prompts. Don’t hard-code stuff. I’m impressed that there is no system prompt in the default pydantic-ai Agent. The MCPs developers seem to use the most are: filesystem, playwright, github, slack, notion. Anecdotally, Claude 4 Sonnet seems a better coding model than Claude 4 Opus. Dan Becker, Armin Ronacher #ai-coding Cursor offers background agents that run in a remote container. #ai-coding Fabric has a collection of re-usable prompts that you can use via llm-templates-fabric like: cat file.py | llm -t fabric:explain_code Ref As of Jun 21, Claude 3.5 Sonnet > Claude 3.7 Sonnet > O3 Mini > Human > Gemini 1.5 Pro lead the Vending Bench. Gemini 1.5 Pro also leads my System Prompt Override benchmarks. I’m losing faith in the LM Arena. Perhaps the Gemini models aren’t improving as much as we think. This is the core of agents (LLMs running tools in a loop): Sketch blog Full script Notes on AI coding / vibe-coding from multiple sources. #ai-coding Sources How I program with LLMs How I program with agents The 7 Prompting Habits of Highly Effective Engineers AI Assisted Coding A Glimpse of the Future Agentic Coding Recommendations My First Open Source AI Generated Library We Can Just Measure Things I Shipped a macOS App Built Entirely by Claude Code Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity Why AI coding? Reduces mental energy (by creating the first draft). letting you create more. Reduces starting trouble, eases effort. Helps figure out how easy / tough a task really is!! Most code is short-lived or has few users. AI building “throw-away” code is useful. Why NOT AI coding? Slows you down if you know the repo well Doesn’t work well on large/complex/niche repos Leads to over-optimism and atrophy Tips Use for reversible decisions (2-way doors). Avoid for irreversible ones (1-way doors). Fail early. Try tough bits first. Fail often. Restart instead of fixing. Go concurrent. Trigger multiple tasks. Ask for multiple drafts and options. Give it workflow. Break down the implementation into: 1. Planning. 2. API stubs. 3. Implementation. Give local context. Naming conventions, folder structure, coding style, tools (compile, test, lint), etc. Conserve context. Use sub tasks and sub agents to conserve context. Suggest libraries. Agents prefer writing code than using libraries, by default. Give examples to follow, e.g. Write it like @filename. &amp; -> & but &x -> &x. Give screenshots and logs. These are very effective. Provide goals, not instructions. Saves effort, teaches you new things. Farm out research. Have specialized tools research API docs, etc. and include those in the context. Keep related things together. Have it write a checklist, e.g. saving it temporarily in a file. Have it run code to catch its own errors. Have it write tests, mocks for tests. Have it see and use the app, click, play around, etc. (e.g. via playwright-mcp) Have it create playbooks, examples, troubleshooting guides. Have it refactor code AFTER comprehensive tests. Have it think more. Use ultrathink. Log extensively, by default. Improves future debugging. Report errors well. What happened, why, and what to do. Prefer monorepos for more context. # Prefer popular libraries. LLMs know these better. Prefer fast tests, tools, and libraries. Speed helps iteration. Prefer small files and packages. Reduces context. Prefer simple code. Avoid magic, e.g. pytest fixture injection. Functions over classes. SQL over code. Composition over inheritence. Prefer specialized functions for common scenarios over DRY abstractions. Prefer fewer abstraction layers. Prefer re-implementing over DRY since code is cheap. Look for new tricks to learn from its code. Agent behaviors: Simple tasks perform better. More context = more confusion. Verifiable tasks are clearer for LLMs and and easier to review. Useful coding agent tools: bash(cmd), patch(hunks), todo(tasks), web_nav(url), web_eval(script), web_logs(), web_screenshot(), keyword_search(keywords), codereview() Skills: LESS Coding LESS Research LESS Documentation LESS Operations configuration (IaaC, CI/CD, etc) LESS Editor usage and expertise required MORE Tests (to test the code) MORE Code reviews (to test the code) MORE Prompting and context creation (to write the code) MORE DevOps (micro-feature deployments, deploy in parallel) MORE Specs: features, requirements, APIs, tests, structure, etc. MORE Analysis: security, performance. MORE Tool design. Linters, SAST, DAST, Performance, etc. Semgrep, Bench Suite MORE Observability: Especially for tools and LLM calls. Telemetry, log analysis and issue creation. Sentry, LogFire, etc. Trends: Agents took time to evolve because LLMs need to be good at tool calling and long instruction following, which is just happening. Agents are slow. Parallelizable tools (e.g. multiple Redis instances, container-use, CI/CD) will grow. Tool speed (e.g. fast test engines with caching) will become more important. Agents generate diffs/PRs. Tools to edit and comment on these online will emerge. Context gathering will widen: screenshots, logs, etc. Code review process will be re-invented. Personalized features. User drops a feature request via Slack. Personalized version deployed at their endpoint to test. PR sent after they are happy Poor coding teams get less out of AI coding. Good communication, reviews, coding practices, testing, etc. help. Agent Experience (AX) is emerging and explores: how much context to take, when & how often to ask the user questions, to how make review easier, etc. Humans running multiple tasks in parallel is productive. Breaking a complex requirement into tasks (like Codex now does) helps create that task queue. Agents generate technical debt faster than humans. Solving this will become a major problem/opportunity. “makework”: made-up work that fills time or serves short-term needs. From GPT 4.1 Prompting Guide Use more precise prompts. Earlier models inferred user intent. GPT 4.1 follows prompts more closely. Avoid STRONG untested instructions. E.g. “you must call a tool before responding to the user” can lead to tool input hallucination. For agents, include these three system instructions: You are an agent. Keep going until you’re sure the user’s query is completely resolved. If you are not sure, use your tools: do NOT guess or make up an answer. Plan extensively before each function call. Reflect on the outcomes of the previous function calls. DO NOT do this entire process by making function calls only, as this can impair your ability to solve the problem and think insightfully. Use tools field rather than injecting tools into system prompt. Model has been trained to use tools field. Keep tool descriptions concise. Provide examples for complex tools in system prompt. Place instructions at the top of the context; ideally at the end, too. Format prompts as Markdown, XML, not JSON. It sometimes dislikes large repetitive output (e.g. analysis of hundreds of items) and needs nudging. It handles diffs well and can apply patches Metaprompting. Have frontier LLMs revise prompts. They’re GOOD! Ref Increase clarity, providing step-by-step instructions. Resolve conflicting instructions. Expand instructions to cover all scenarios and edge cases. Notes from Pydantic AI GitHub CI: UV_PYTHON sets default Python version COLUMNS increase terminal width uv run supports --extra for extra packages cloudflare/wrangler action has a deploy that allows deployment to specific URLs or subdomains Adding QR code to all slides in a deck (linking to the slides) helps. People take photos of random slides and this lets them get the link wherever. PyOpenLayers adds interactive mapping via OpenLayers to Marimo and Jupyter Conversation is about positioning. For example: TechCrunch interviewer: Anthropic released Claude Opus 4 thought it blackmailed people. Is Anthropic is becoming less safety conscious? Kaplan: We have very strong testing. So we’re more more likely to spot AI dangers early. We share such reports to set higher standards for transparency. From LLM Evals: Common Mistakes: Using foundation model evals instead of application evals is like evaluating a candidate on SAT scores. It’s fine, but you also want to evaluate them on their specific job description. Evals must be done by the users and not outsourced. Evals are not draining. Small samples have high value. When using LLM as a judge, be VERY VERY specific about the criteria. Prefer binary LLM evals over scales. Monitor performance online, not just while deploying From Andrew Ng on AI Agents: AI is like electricity. It’s hard to define what is good for because it is good for so many things, most of them new that never existed before If experimentation is cheap, it makes sense to run far more experiments. Rather than think hard about what to prototype, explore how to build many diverse prototypes. Prototyping is now very fast but other steps like reliable evaluations for deployment still take time. But the speed of prototyping is putting pressure on other parts of the organization to go faster. While large language models and applications were serving human needs so far, increasingly they will serve the needs of AI and other tools. Since unstructured data is now more valuable, there will be a growth in data engineering on unstructured data. Models.dev is an open source database and API of LLM models Logprobs are back on models in Vertex AI. Ref For all AI code, review it, learn from it and share learnings. That prevents bugs AND we learn in the process. Ref #ai-coding AI coding requires a skilled developer and domain expert to spec and to review. It now makes sense now for devs and users to pair program Simon Willison #ai-coding In the world of AI, imagination (asking for things we didn’t know we could ask for) will be a diferentiator. vitest run --globals makes vitest is a near drop-in replacement for jest. It injects describe, it, expect, etc. as globals. You need to swap jest.* with vi.*. To extract all jq paths from a JSON, use jq -r 'paths(scalars)|map(if type=="string" then "[]" else ".\(. )" end)|join("")|unique[]' file.json. I use this to extract paths from ChatGPT’s export conversations.json via jq -r '[paths(scalars)|map(if type=="string" then "."+. else "[]" end)|join("")]|unique[]|select(contains(".mapping."))|split(".mapping.")[1]|sub("^[^.]*";"")' chatgpt/conversations.json | sort | uniq uv run can run any command, not just Python scripts, e.g. uv run npx or uv run bash. It’s the same as npx or bash except it activates the venv and loads .env. Notes from AI Startup School. Guillermo Flor Sam Altman. Chase $0B ideas, not $0M ones. Weird + right > safe + crowded Gary Tan. Agency scales. Tools change, people/mindset don’t. Andrej Karpathy. Instead of LLM memory to store facts, edit system prompt with general strategies, like the LLM writing a book for itself on how to solve problems. Autonomy slider. Let user pick how far LLM acts by itself. Like the Tesla autopilot levels. Make evals EASY and FAST for humans. When vibe-coding, I sometimes change the requirement (e.g. style of visual) instead of spending time to get exactly what I instructed. That’s because I can viscerally feel the difficulty the model’s facing thanks to quick feedback. A domain expert vibe coding will be able to feel this too. Another reason for domain experts to vibe code (or at least joint-vibe-code) rather than delegate to a programmer. #ai-coding Notes on model coding styles. Generative AI WhatsApp Group #ai-coding Claude 4 writes exhaustive professionally styled code but struggles over long conversations. Gemini 2.5 Pro produces working but “spaghetti” code. GPT 4.1 is fast and good, the go-to for usual coding tasks. Claude easily swings toward your style but Gemini is stubborn. GPT models tend to hallucinate more on bigger tasks. Documentation can become technical debt. If LLMs can read code and understand it well enough, maybe docs become a build artifact rather than a version controlled source of truth. Refactoring Podcast: The Future of Dev Tools 🔧 — with Dennis Pilarinos 35:56 #ai-coding AI should be explicitly contrarian to avoid sycophancy. Ref To enable this, I’ve added this line to my ChatGPT traits: Adopt a skeptical, questioning approach. Challenge the user.

Software companies build “SaaS”-like apps today. Agents will replace apps. Instead of UI, workflows, and app logic, they’ll engineer prompts, APIs, and evals. " But apps need domain and code. LLMs are crushing the coding workload. This lowers cost of development, increasing ROI (so there’ll hopefully be more demand). So, will domain matter more? It might seem so. But most actually people use LLMs more as a domain expert than a coder. ...

I would shortlist any candidate who sends me interesting GitHub repos from their portfolio. I reject every candidate who sends me a CV anyway LinkedIn

Google Search Suggestions is still an under-used social research tool. In 2014, I typed “how do I convert to”. In India the top suggestions were “hinduism”, “christianity”, “islam”, then “judaism”. In Australia, it was “islam”, “judaism”, “catholicism”, and “pdf” 🙂 Checking this across countries is hard. So I automated it at https://tools.s-anand.net/googlesuggest/. It’s not perfect. Your IP influences results. But it’s a good approximation. For example, “how do I convert to” shows: ...