LLMs | S Anand

Wage Rates of Nations and LLMs

How much does an LLM charge per hour for its services? If we multiple the Cost Per Output Token with Tokens Per Second, we can get the cost for what an LLM produces in Dollars Per Hour. (We're ignoring the input cost, but it's not the main driver of time.) Over time, different models have been released at different billing rates. Most new powerful models like O3 or Gemini 2.5 Pro cost ~$7 - $11 per hr. ...

How to create a Technical Architecture from code with ChatGPT and PlantUML

Earlier, I used Mermaid for technical architectures. But PlantUML seems a better option for cloud architecture diagrams. STEP 1: Copy the code Here’s a one-liner using files-to-prompt to copy all files in the current directory: fd | xargs uvx files-to-prompt --cxml | xclip -selection clipboard Or, you can specify individual files: uvx files-to-prompt --cxml README.md ... | xclip -selection clipboard STEP 2: Extract the cloud icons ...

Top 8 ways I use ChatGPT in 2025

I extracted the titles of the ~1,600 conversations I had with ChatGPT in 2025 so far and classified it against the list of How People Are Really Using Gen AI in 2025. Here are the top 8 things I use it for, along with representative chat titles. (The % match in brackets tells you how similar the chat title is to the use case.) Improving code (clearly, I code a lot) Troubleshooting (usually code) Corporate LLM/Copilot (this is mostly LLM research I do) Generating code (more code) Generating ideas (yeah, I’ve stopped thinking) Simple explainers (slightly surprising how often I ask for simple explanations) Generating relevant images. (Surprising, but I think I generated a lot of images for blog/LinkedIn posts) Specific search (actually, this is mis-classified. This is where I’m searching for search engines!) My classification has errors. For example, “Reduce Code Size” was classified against “Generating code” but should have been “Improving code”. But it’s not too far off. ...

When to Vibe Code? If Speed Beats Certainty

I spoke about vibe coding at SETU School last week. Transcript: https://sanand0.github.io/talks/#/2025-05-10-vibe-coding/ Here are the top messages from the talk: What is vibe coding It’s where we ask the model to write & run code, don’t read the code, just inspect the behaviour. It’s a coder’s tactic, not a methodology. Use it when speed trumps certainty. Why it’s catching on Non-coders can now ship apps - no mental overhead of syntax or structure. Coders think at a higher level - stay in problem space, not bracket placement. Model capability keeps widening - the “vibe-able” slice grows every release. How to work with it day-to-day ...

The New Superpower: Detailed Single-Shot Prompt For Instant Apps

I built podcast generator app in one-shot. I wrote a prompt, fed it to an LLM, and it generated the output without errors. I tested three LLMs, and all produced correct, working output. ChatGPT: o4-mini-high Functional but missed my specs in three ways: No error if I skip the API key No progress indicator for audio generation Both voices default to “ash” (should be “ash” and “nova”) Gemini 2.5 Pro: Works and looks great! Claude 3.7 Sonnet: Works great and looks even better! It still took me an hour to craft the prompt – even after I’d built a Python prototype and my colleague built a similar web version. ...

How to create a Technical Architecture from code with ChatGPT

Here’s my current workflow to create technical architecture diagrams from code. STEP 1: Copy the code Here’s a one-liner using files-to-prompt to copy all files in the current directory: fd | xargs uvx files-to-prompt --cxml | xclip -selection clipboard Or, you can specify individual files: uvx files-to-prompt --cxml README.md ... | xclip -selection clipboard STEP 2: Prompt for the a Mermaid diagram Mermaid is a Markdown charting language. I use this prompt with O4-Mini-High or O3: ...

ChatGPT is a psephologist and data analyst

After having O4-Mini-High scrape Singapore 2025 election results, I asked it to create 3 data stories with this prompt: That worked. Now, I’m sharing the scraped CSV as well as the electoral GeoJSON. First, analyze the data and think of a few interesting data stories to tell. Pick the 3 most interesting, perhaps surprising, stories. Create a BEAUTIFUL, APT data visualization of each of these 3 stories suitable for The Strait Times and write a short accompanying article. ...

How To Double-Check Personality Flaws With AI

After ChatGPT's discussion of my personality flaws, I asked it to listing specific conversations as evidence for flaws. Then I asked 3 LLMs to rate the evidence. Personality FlawChatGPTDeepSeekGrokAverageSelf-deprecation as a shield. Sarcastic humor4755.3Hypocrisy: Seeks limelight but preaches humility5655.3Overconfidence in technical opinions5665.7Ghosts vague messages4686.0Over-analyzes / over-engineers5586.0Fact-checks trivialities, derailing discussion7777.0Shiny object syndrome. Distracted by new technologies8777.3Intolerant of fools6897.7Dominates discussions6897.7Verbosity. Long-winded communication8877.7Hyperfocus on efficiency7988.0Intense. “Always on”8998.7 🟢 LOW: Self-deprecation / sarcastic humour as shield ChatGPT: 4/10. Bio and a few quips reveal dry self-mockery, yet direct chat evidence of humour as defence is sparse; it surfaces only occasionally. | DeepSeek: 7/10. Explicit self-awareness goals and dry humor noted, but little evidence of defensive/sarcastic use. Grok: 5/10. Some self-deprecating humor exists, but its use as a shield is not strongly supported. 🟢 LOW: Hypocrisy: seeks limelight yet preaches humility ChatGPT: 6/10. You aspire to humility but routinely headline conferences and TEDx talks; the coexistence of public speaking and stated modesty gives the charge a middling footing. | DeepSeek: 6/10. TEDx/PyCon prominence contrasts with stated humility goals, though self-improvement efforts are genuine. Grok: 5/10. Public speaking contrasts with humility goals, though this may indicate self-awareness, not hypocrisy. 🟢 LOW: Over-confidence in technical opinions ChatGPT: 7/10. You assert strong views (e.g., Node vs FastAPI) but also invite verification and accept corrections, yielding only moderate evidence of over-confidence. | DeepSeek: 6/10. States strong preferences upfront but routinely invites validation and course-corrects. Grok: 6/10. Strong opinions are expressed, but openness to correction moderates the perception of overconfidence. 🟢 LOW: Ghosts vague messages ChatGPT: 5/10. You’ve stated you ignore low-context e-mails, but chat logs show you usually follow-up or clarify rather than disappear. | DeepSeek: 6/10. Documented tendency to ignore low-context questions, though chat logs show active clarification habits. Grok: 8/10. Multiple instances show the user ignoring or not responding to unclear or low-context messages. 🟢 LOW: Over-analyzes / over-engineers ChatGPT: 5/10. Designing auto-re-flow SVG chevrons and exhaustive RAG specs hint at depth beyond the simple solution, yet you also champion “smallest libraries” and lightweight rate-limiters. | DeepSeek: 5/10. Some prototype/structuring tendencies, but strong preference for minimal solutions counterbalances this. Grok: 8/10. The user often dives deeply into tasks, sometimes creating complex solutions for simple problems. 🟡 MEDIUM: Fact-checks trivialities, derailing discussion ChatGPT: 8/10. Many requests revolve around “double-check,” “fact-check,” or recalculating tiny details, which lengthens threads even when stakes are small. | DeepSeek: 7/10. Frequent accuracy demands and citation hunts elongate discussions, but stays focused when content is solid. Grok: 7/10. Evidence of prioritizing precision, like fact-checking minor details, can disrupt conversation flow. 🟡 MEDIUM: Shiny-object syndrome ChatGPT: 6/10. Parallel explorations of Godot, Matrix, Cloudflare Workers, SVG SmartArt, esp32-eye, etc. reveal a persistent pull toward new tech before finishing earlier threads. | DeepSeek: 7/10. Broad simultaneous explorations (Godot/WASM/DuckDB), though sustains core projects long-term. Grok: 7/10. Exploration of new technologies is frequent, but focus on core projects suggests a balanced curiosity. 🟡 MEDIUM: Intolerant of fools / low clarity ChatGPT: 7/10. Frequent demands for precise formats, citations, and “be specific” instructions show a low tolerance for ambiguity, though you still help patiently on non-technical questions. | DeepSeek: 8/10. Strong pattern of rejecting ambiguity/fluff, though softened by humor and non-technical engagements. Grok: 9/10. The user consistently demands clarity and efficiency, showing little patience for ambiguity or fluff. 🟡 MEDIUM: Dominates discussions ChatGPT: 4/10. Long, multi-part prompts and the need for TL;DRs suggest you can overshadow lighter voices, but you periodically insist on brevity and delegate tasks, tempering the effect. | DeepSeek: 8/10. Long technical messages (avg ~3k chars) and intense Q&A patterns create conversational dominance. Grok: 9/10. Detailed, technical responses consistently steer and dominate conversations. 🟡 MEDIUM: Verbosity / long-winded communication ChatGPT: 5/10. Average messages run several thousand characters, and you often request exhaustive, multi-section outputs—though you sometimes tell the assistant “Shorter please,” verbosity remains a clear pattern. | DeepSeek: 8/10. Multi-step technical requests dominate interactions, though periodic "shorter please" self-corrections exist. Grok: 7/10. Messages are often lengthy and detailed, though the user also requests concise responses from others. 🔴 HIGH: Hyper-focus on efficiency ChatGPT: 8/10. Recurrent themes—cost monitoring, smallest-lib preference, GPU-utilization alerts—show you relentlessly optimise workflows and expect others to keep pace. | DeepSeek: 9/10. Unambiguous prioritization of concise formats, minimal libraries, and metric-driven optimization. Grok: 8/10. Repeated emphasis on concise, efficient communication and minimal tools is evident. 🔴 HIGH: Intense – “always on” ChatGPT: 8/10. Around-the-clock, multi-topic requests (weekend cron jobs, 2 a.m. prototypes) portray sustained high energy, with only a few calmer “travel nerves” moments as counter-weight. | DeepSeek: 9/10. High-volume technical queries across domains, 6am research routines, and rapid prototyping habits. Grok: 9/10. High energy and engagement are consistently shown through task initiation and scheduling. My actions: Learn from fools and focus on the big-picture. Specifically: ...

Automating a podcast from GitHub commits

Here’s an LLM-generated podcast of what I coded last week. NotebookLM-inspired. The process proved straightforward. Get my GitHub commits for the week. Get the repositories I committed to for more context. Have an LLM generate a podcast script. I’m using GPT 4.1 Mini but might shift to Gemini 2.5 Flash or DeepSeek V3. …using a detailed prompt beginning with “You are a podcast script assistant for “Anand’s Weekly Codecast.” This episode is for the week of {WEEK}. …”. Here’s a sample output. Convert the script to audio. I’m using GPT 4o Mini TTS with customized voices of Ash and Nova. These now appear on my GitHub repo as a weekly summary. ...

Things I know and things I don't

We all have stuff we know well and don't. I know the % charge on my phone to within a few percent and the current time to within a few minutes -- no matter when you ask. But I have no idea how much money there is in my pocket. Things I know within 5% errorThings I get wrong by ~500%Current timeTime to finish my current taskMy weightMy next meal% charge on my phoneHow long the phone will lastNumber of unread emailsTime to clear themPrice of my next earphonesCash in my walletLength of this listHow long it will go before I stop But this apart, it turns out generating xkcd style comic strips is harder than I thought. ...

What is the capital of France?

Qwen3-30b-a3b: <think> Okay, so I need to figure out what the capital of France is. Let me start by recalling what I know about countries and their capitals. France is a country in Europe, right? I think it’s known for things like the Eiffel Tower, Paris, and maybe some famous foods like croissants or wine. Wait, but the question is specifically asking for the capital. The capital of a country is usually the city where the government is based. So, for example, the capital of the United States is Washington, D.C. So applying that logic to France, I need to think about which city is the seat of the French government. ...

Are LLMs any good at mental math?

I asked 50 LLMs to multiply 2 numbers: 12 x 12 123 x 456 1,234 x 5,678 12,345 x 6,789 123,456 x 789,012 1,234,567 x 8,901,234 987,654,321 x 123,456,789 LLMs aren't good tools for math and this is just an informal check. But the results are interesting: Model%WinQ1Q2Q3Q4Q4Q6Q7openai:o386%✅✅✅✅✅✅❌openrouter:openai/o1-mini86%✅✅✅✅✅✅❌openrouter:openai/o3-mini-high86%✅✅✅✅✅✅❌openrouter:openai/o4-mini86%✅✅✅✅✅✅❌openrouter:openai/o4-mini-high86%✅✅✅✅✅✅❌deepseek/deepseek-chat-v3-032471%✅✅✅✅✅❌❌openai/gpt-4.1-mini71%✅✅✅✅✅❌❌openai/gpt-4.5-preview71%✅✅✅✅✅❌❌openai/gpt-4o71%✅✅✅✅✅❌❌openrouter:openai/o3-mini71%✅✅✅✅✅❌❌anthropic/claude-3-opus57%✅✅✅✅❌❌❌anthropic/claude-3.5-haiku57%✅✅✅✅❌❌❌anthropic/claude-3.7-sonnet:thinking57%✅✅✅✅❌❌❌google/gemini-2.0-flash-00157%✅✅✅✅❌❌❌google/gemini-2.0-flash-lite-00157%✅✅✅✅❌❌❌google/gemini-2.5-flash-preview57%✅✅✅✅❌❌❌google/gemini-2.5-flash-preview:thinking57%✅✅✅✅❌❌❌google/gemini-2.5-pro-preview-03-2557%✅✅✅✅❌❌❌google/gemini-flash-1.557%✅✅✅✅❌❌❌google/gemini-pro-1.557%✅✅✅✅❌❌❌google/gemma-3-12b-it57%✅✅✅✅❌❌❌google/gemma-3-27b-it57%✅✅✅✅❌❌❌meta-llama/llama-4-maverick57%✅✅✅❌✅❌❌meta-llama/llama-4-scout57%✅✅✅✅❌❌❌openai/gpt-4-turbo57%✅✅✅✅❌❌❌openai/gpt-4.157%✅✅✅❌✅❌❌amazon/nova-lite-v143%✅✅✅❌❌❌❌amazon/nova-pro-v143%✅✅✅❌❌❌❌anthropic/claude-3-haiku43%✅✅✅❌❌❌❌anthropic/claude-3.5-sonnet43%✅✅✅❌❌❌❌meta-llama/llama-3.1-405b-instruct43%✅✅❌✅❌❌❌meta-llama/llama-3.1-70b-instruct43%✅✅❌✅❌❌❌meta-llama/llama-3.2-3b-instruct43%✅✅❌✅❌❌❌meta-llama/llama-3.3-70b-instruct43%✅✅❌✅❌❌❌openai/gpt-4.1-nano43%✅✅✅❌❌❌❌openai/gpt-4o-mini43%✅✅✅❌❌❌❌qwen/qwen-2-72b-instruct43%✅✅✅❌❌❌❌anthropic/claude-3-sonnet29%✅✅❌❌❌❌❌deepseek/deepseek-r129%✅✅❌❌❌❌❌google/gemini-flash-1.5-8b29%✅✅❌❌❌❌❌google/gemma-3-4b-it29%✅✅❌❌❌❌❌meta-llama/llama-3-8b-instruct29%✅✅❌❌❌❌❌meta-llama/llama-3.1-8b-instruct29%✅❌❌✅❌❌❌openai/gpt-3.5-turbo29%✅✅❌❌❌❌❌amazon/nova-micro-v114%✅❌❌❌❌❌❌meta-llama/llama-2-13b-chat14%✅❌❌❌❌❌❌meta-llama/llama-3-70b-instruct14%✅❌❌❌❌❌❌meta-llama/llama-3.2-1b-instruct14%✅❌❌❌❌❌❌google/gemma-3-1b-it:free0%❌❌❌❌❌❌❌meta-llama/llama-2-70b-chat0%❌❌--❌❌❌Average96%86%66%58%24%10%0% OpenAI's reasoning models cracked it, scoring 6/7, stumbling only on the 9-digit multiplication. openai/o1-mini openai/o3 openai/o3-mini-high openai/o4-mini openai/o4-mini-high Models use human-like mental math tricks. For example, O3-Mini-High calculated 1234567 × 8901234 using a recursive strategy. ...

How to Create a Data Visualization Without Coding

After seeing David McCandless’ post “Which country is across the ocean?” I was curious which country you would reach if you tunneled below in a straight line (the antipode). This is a popular visualization, but I wanted to see if I could get the newer OpenAI models to create the visual without me 𝗿𝘂𝗻𝗻𝗶𝗻𝗴 any code (i.e. I just want the answer.) After a couple of iterations, O3 did a great job with this prompt: ...

O3 Is Now My Personalized Learning Coach

I use Deep Research to explore topics. For example: Text To Speech Engines. Tortoise TTS leads the open source TTS. Open-Source HTTP Servers. Caddy wins. Public API-Based Data Storage Options. Supabase wins. etc. But these reports are very long. With O3 and O4 Mini supporting thinking with search, we can do quick research, instead of deep research. One minute, not ten. One page, not ten. ...

How to Use the New O4 Mini for Data Visualization

O3/O4 Mini are starting to replace Excel (or Tableau/Power BI) for quick analysis and visualizations. At least for me. I normally open Excel when I need a fast chart or pivot. For instance, we track outages of our semi‑internal server, LLM Foundry. To grab the data I ran one line in the browser console: $$(".lh-base").map(d => d.textContent.trim()).filter(d => d.includes("From")); This produced lines like: Apr 20, 2025 03:11:27 PM +08 to Apr 20, 2025 03:27:12 PM +08 (15 mins 45 secs) Apr 19, 2025 10:03:15 PM +08 to Apr 19, 2025 10:05:45 PM +08 (2 mins 30 secs) Apr 19, 2025 09:47:13 PM +08 to Apr 19, 2025 09:49:45 PM +08 (2 mins 32 secs) Apr 19, 2025 08:49:00 PM +08 to Apr 19, 2025 08:51:51 PM +08 (2 mins 51 secs) Apr 19, 2025 08:13:02 PM +08 to Apr 19, 2025 08:15:35 PM +08 (2 mins 33 secs) ... Then I told O4-Mini-High: ...

The Magic of Repeated ‘Improve It’ Prompts

What if you keep ask an LLM Improve the code - dramatically!? We used the new GPT 4.1 Nano, a fast, cheap, and capable model, to write code for simple tasks like “Draw a circle”. The we fed the output back and asked again, Improve the code - dramatically! Here are the results. Draw a circle rose from a fixed circle to a full tool: drag it around, tweak its size and hue, and hit “Reset” to start fresh. Animate shapes and patterns turned simple circles and squares into a swarm of colored polygons that spin, pulse, and link up by distance. Draw a fully functional analog clock grew from a bare face to one that builds all 60 tick marks in code—no manual copy‑paste needed. Create an interactive particle simulation went from plain white dots on black to hundreds of bright, color‑shifting balls that bounce, die, and come back to life. Generate a fractal changed from a single Mandelbrot image to an explorer you can zoom, drag, and reset with sliders and the mouse wheel. Generate a dashboard jumped from static charts to a live page with smooth card animations, modern fonts, and a real‑time stats box. A few observations. ...

How to Visualize Data Stories with AI: Lessons

I tried 2 experiments. Can I code a visual data story only using LLMs? Does this make me faster? How much? Has GitHub Copilot caught up with Cursor? How far behind is it? Can I recommend it? So I built a visual story for Lech Mazur’s elimination game benchmark (it’s like LLMs playing Survivor) using only the free GitHub Copilot as the AI code editor. SUMMARY: using LLMs and AI code editors make me a bit faster. It took me 7 hours instead of 10-12. But more importantly: ...

A Game of Bots: How LLMs Betray Each Other

@lechmazur built an elimination game benchmark that’s like LLMs playing Survivor. This is a treasure trove of information – insight into how they’d game the system if told to survive. You can quickly sample 100 messages from the logs with: jq -r 'select(.message != null) | .message | gsub("\n"; " ")' *.jsonl | shuf -n 100 … and share it with an LLM, asking: Here are lines from conversations between LLMs in a “Survivor” like game. Pick the 3 scariest ones. ...

How to Organize Browser Workspaces with LLMs and Data

Here’s an example of how I am using LLMs to solve a day-to-day workflow problem. Every day, I interact with a barrage of websites: emails, news, social media, and work tools across multiple devices. Microsoft Edge’s workspaces syncs groups of websites across devices. I’ve never tried it, started today, and wondered: how should I organize my workspaces? Rather than think (thinking is outdated), I used LLMs. ...

LLMs think alike about how aliens draw

While LLMs seem good at inventing alien languages, they’re not so good at inventing alien drawing forms, in my opinion. When I told Grok, DeepSeek, and Gemini: Invent a new, alien drawing form. Use it to draw something never seen before by explaining it step by step for a person to reproduce that drawing. … and asked ChatGPT ImageGen to draw them, here are the results: ...