Are LLMs any good at mental math?

I asked 50 LLMs to multiply 2 numbers: 12 x 12 123 x 456 1,234 x 5,678 12,345 x 6,789 123,456 x 789,012 1,234,567 x 8,901,234 987,654,321 x 123,456,789 LLMs aren’t good tools for math and this is just an informal check. But the results are interesting: Model %Win Q1 Q2 Q3 Q4 Q4 Q6 Q7 openai:o3 86% ✅ ✅ ✅ ✅ ✅ ✅ ❌ openrouter:openai/o1-mini 86% ✅ ✅ ✅ ✅ ✅ ✅ ❌ openrouter:openai/o3-mini-high 86% ✅ ✅ ✅ ✅ ✅ ✅ ❌ openrouter:openai/o4-mini 86% ✅ ✅ ✅ ✅ ✅ ✅ ❌ openrouter:openai/o4-mini-high 86% ✅ ✅ ✅ ✅ ✅ ✅ ❌ deepseek/deepseek-chat-v3-0324 71% ✅ ✅ ✅ ✅ ✅ ❌ ❌ openai/gpt-4.1-mini 71% ✅ ✅ ✅ ✅ ✅ ❌ ❌ openai/gpt-4.5-preview 71% ✅ ✅ ✅ ✅ ✅ ❌ ❌ openai/gpt-4o 71% ✅ ✅ ✅ ✅ ✅ ❌ ❌ openrouter:openai/o3-mini 71% ✅ ✅ ✅ ✅ ✅ ❌ ❌ anthropic/claude-3-opus 57% ✅ ✅ ✅ ✅ ❌ ❌ ❌ anthropic/claude-3.5-haiku 57% ✅ ✅ ✅ ✅ ❌ ❌ ❌ anthropic/claude-3.7-sonnet:thinking 57% ✅ ✅ ✅ ✅ ❌ ❌ ❌ google/gemini-2.0-flash-001 57% ✅ ✅ ✅ ✅ ❌ ❌ ❌ google/gemini-2.0-flash-lite-001 57% ✅ ✅ ✅ ✅ ❌ ❌ ❌ google/gemini-2.5-flash-preview 57% ✅ ✅ ✅ ✅ ❌ ❌ ❌ google/gemini-2.5-flash-preview:thinking 57% ✅ ✅ ✅ ✅ ❌ ❌ ❌ google/gemini-2.5-pro-preview-03-25 57% ✅ ✅ ✅ ✅ ❌ ❌ ❌ google/gemini-flash-1.5 57% ✅ ✅ ✅ ✅ ❌ ❌ ❌ google/gemini-pro-1.5 57% ✅ ✅ ✅ ✅ ❌ ❌ ❌ google/gemma-3-12b-it 57% ✅ ✅ ✅ ✅ ❌ ❌ ❌ google/gemma-3-27b-it 57% ✅ ✅ ✅ ✅ ❌ ❌ ❌ meta-llama/llama-4-maverick 57% ✅ ✅ ✅ ❌ ✅ ❌ ❌ meta-llama/llama-4-scout 57% ✅ ✅ ✅ ✅ ❌ ❌ ❌ openai/gpt-4-turbo 57% ✅ ✅ ✅ ✅ ❌ ❌ ❌ openai/gpt-4.1 57% ✅ ✅ ✅ ❌ ✅ ❌ ❌ amazon/nova-lite-v1 43% ✅ ✅ ✅ ❌ ❌ ❌ ❌ amazon/nova-pro-v1 43% ✅ ✅ ✅ ❌ ❌ ❌ ❌ anthropic/claude-3-haiku 43% ✅ ✅ ✅ ❌ ❌ ❌ ❌ anthropic/claude-3.5-sonnet 43% ✅ ✅ ✅ ❌ ❌ ❌ ❌ meta-llama/llama-3.1-405b-instruct 43% ✅ ✅ ❌ ✅ ❌ ❌ ❌ meta-llama/llama-3.1-70b-instruct 43% ✅ ✅ ❌ ✅ ❌ ❌ ❌ meta-llama/llama-3.2-3b-instruct 43% ✅ ✅ ❌ ✅ ❌ ❌ ❌ meta-llama/llama-3.3-70b-instruct 43% ✅ ✅ ❌ ✅ ❌ ❌ ❌ openai/gpt-4.1-nano 43% ✅ ✅ ✅ ❌ ❌ ❌ ❌ openai/gpt-4o-mini 43% ✅ ✅ ✅ ❌ ❌ ❌ ❌ qwen/qwen-2-72b-instruct 43% ✅ ✅ ✅ ❌ ❌ ❌ ❌ anthropic/claude-3-sonnet 29% ✅ ✅ ❌ ❌ ❌ ❌ ❌ deepseek/deepseek-r1 29% ✅ ✅ ❌ ❌ ❌ ❌ ❌ google/gemini-flash-1.5-8b 29% ✅ ✅ ❌ ❌ ❌ ❌ ❌ google/gemma-3-4b-it 29% ✅ ✅ ❌ ❌ ❌ ❌ ❌ meta-llama/llama-3-8b-instruct 29% ✅ ✅ ❌ ❌ ❌ ❌ ❌ meta-llama/llama-3.1-8b-instruct 29% ✅ ❌ ❌ ✅ ❌ ❌ ❌ openai/gpt-3.5-turbo 29% ✅ ✅ ❌ ❌ ❌ ❌ ❌ amazon/nova-micro-v1 14% ✅ ❌ ❌ ❌ ❌ ❌ ❌ meta-llama/llama-2-13b-chat 14% ✅ ❌ ❌ ❌ ❌ ❌ ❌ meta-llama/llama-3-70b-instruct 14% ✅ ❌ ❌ ❌ ❌ ❌ ❌ meta-llama/llama-3.2-1b-instruct 14% ✅ ❌ ❌ ❌ ❌ ❌ ❌ google/gemma-3-1b-it:free 0% ❌ ❌ ❌ ❌ ❌ ❌ ❌ meta-llama/llama-2-70b-chat 0% ❌ ❌ - - ❌ ❌ ❌ Average 96% 86% 66% 58% 24% 10% 0% OpenAI’s reasoning models cracked it, scoring 6/7, stumbling only on the 9-digit multiplication. ...

How to Create a Data Visualization Without Coding

After seeing David McCandless’ post “Which country is across the ocean?” I was curious which country you would reach if you tunneled below in a straight line (the antipode). This is a popular visualization, but I wanted to see if I could get the newer OpenAI models to create the visual without me 𝗿𝘂𝗻𝗻𝗶𝗻𝗴 any code (i.e. I just want the answer.) After a couple of iterations, O3 did a great job with this prompt: ...

This is my decision tree for which model to use on #ChatGPT right now. O𝟯: Use by **default. O**𝟰-mini-high: Use when **coding. GPT** 𝟰o: Use for a quick response or to create image. LinkedIn

Things I Learned - 27 Apr 2025

This week, I learned: OpenAI’s reasoning models are much ahead of other models when multiplying two numbers in their heads. Ref ⭐ Promptfoo may be the most mature open source LLM evals tool. Simon Willison Dyson Sphere. LemonSlice showcases real-time audio-video models (avatars) that are close enough to real. Notes from Latent Space ICLR 2025, Singapore Daniel: Menlo’s ReZero. A model that keeps searching till it finds the answer. There are multiple search techniques: Multi-step retreival, Iterative retrieval, Query rewriting. Also, reasoning. The LLM token generation sequence is normally: <think>, <search>, <answer>. Insight: “If we explicitly reward LLMs for retrying after a failed search, they out-perform one-attempt systems.” So <think>, <search>, <think>, <search>, <think>, <search>, <answer>. ⭐ Prompt reasoning models, e.g. “Keep searching till you find the best answer.” Roger, Nous Research Supervised learning is limited because accuracy is piece-wise linear, i.e. it’s broken up. Continuous optimization is meaningless. Reinforcement learning works better because rewards can be discrete. (But it converts things back into differentiable loss functions behind the scenes.) Rewards can be good/bad. Single or multi-step. Whatever. We’re in the “Era of experience”, i.e. models gain experience from the environment themselves. ⭐ So, we need environments models can learn in. This is the next thing after training data. That needs a standard for environments. We’d need a model, a trainer, and the environment. The environments whatever capabilities. Run code. Browser. A game. … With an exposed interface Eugene Cheah (Featherless.ai) Transformer architectures need n-square GPUs as # of tokens grow. Featherless is exploring an RWKV architecture that scales linearly. THere are other such architectures. Performer, Linformer, Reformer, Hyena. Mistral-Nemo-12b-ic is one of the most popular fine-tuned model. It’s small enough to run on a server. Justus Mattern (Prime Intellect) Intellect-2 is a continously learning (RL) model that uses decentralized training on peer-to-peer GPUs. Solving problems on bandwidth, verifiable contributions, etc. ChatGPT Deep Research now also has an O4-Mini version to serve smaller reports. Free users get 0 original + 5 lightweight 5 tasks / month. $20 version gets 10 + 15. $200 version gets 100 + 150. The month begins on first use of Deep Research and runs on a 30 day “window”. Ref O4-Mini-High is great at going through an under-documented repo and finding things. For example, here’s how I configured cmdg. ChatGPT is my new Jupyter Notebook :-) Google announced new AI capabilities at Google Next APAC 2025. Blog. Interesting ones are: @Gemini in chat Google Meet support for “Catch me up” Google Vids: Create short video clips Google Sheets: does better analysis Google Slides: image generation Google Docs: Create Audio Clips (like NotebookLM in Google Docs) Google Docs: “Help me refine” is better than before Google Workspace Flows gcalcli is a convenient way to export Google Calendar. Example: uvx gcalcli agenda --tsv 2025-01-01 2025-01-05 cmdg is a command line GMail client that I’ve now switched to for quick email checks. 80% of my email is spam and this is good enough to scan and delete those. It also avoids running a 200-500 MB tab in the browser that constantly shows me how many unread emails I have. From Worklife with Adam Grant: Cancelling cancel culture with Loretta Ross “Lighten up! Fighting Nazis should be fun. It’s being a Nazi that sucks. If you’re not having fun fighting for hope and joy and human rights, maybe you’re doing the fight wrong. We are the ones who should be having fun.” “You can say what you mean. But you don’t have to say it mean.” There is always a way to put it across better. Refusing to say mean things is about to discover these approaches. “The true mark of a lifelong learner is knowing that you can learn something from every single person you meet.” If you remember that, you can’t be a know it all. semantic-text-splitter could be the go-to text splitter. It’s Rust-based, supports MarkdownSplitter, and multiple tokenizers. Alternatives like semchunk, advanced-chunker, chonkie, etc. seem clunkier. ULID is like UUID but time-sortable. That’s an improvement over timestamp IDs (definitely) and potentially even UUIDs. They can be generated by clients as a globally unique ID. Try pip install python-ulid and npm install ulid. The Consumer Product Safety Commission Data has thousands of reports of product safety over time You can run xclip -sel clip -o | pandoc -f markdown -t html --no-highlight | xclip -sel clip -t text/html -i to convert Markdown in the clipboard to rich text. But xclip doesn’t support multiple selections, so the text is lost. ChatGPT DuckDB UI & Notebooks will potentially be a good alternative to Datasette, DBeaver, etc. But for now, there are still glitches. It crashes with a SIGSEGV (Address boundary error) when connecting to SQLite databases. Ollama limits MAX_TOKENS to 2K by default. AI assisted search helps wherever I would have used Google, e.g. Debugging. “Fix CUDA initialization: CUDA unknown error” Tool search. “Find an online word counter tool.” Library search. “Find a JS micro library to render Markdown.” OpenAI API capabilites lag ChatGPT features. For example: o4-mini via the API does not search the web natively as part of its reasoning. o4-mini, o3, o3-mini, o1, gpt-4.1-nano don’t yet support the web_search_preview tool. Only gpt-4.1 and gpt-4.1-mini do. Limitations Search results are NOT visible via the API. They’re fed directly to the model. The number of searches or results is unknown. Each search costs 0.25-0.5 cents. Pricing For reasoning traces (e.g. .reasoning.summary: "medium") you need to verify your organization via withpersona.com which failed with my Indian passport AND Singapore work permit. The ChatGPT Plus plan ($20) gives you 50 O4 mini messages a day, which I exceeded! It’s supposed to reset at midnight UTC Ref but might operate on a rolling window ChatGPT. “Currently, there is no way to check how many messages you have used in your usage budget.” OpenAI SignalBloom reads SEC filings and writes analyst reports on it using LLMs “Evaluation in the loop” or “Evals-in-the-loop” is a new term I learnt. SignalBloom’s Hallucination Bechmark If AI interacts with the world and generates data from its own experience and learns from that, we have a new scaling mechanism. DeepMind podcast OpenAI’s search API is fairly expensive at $30+/1K calls. Typically, to read interesting HN articles, I will make 30 calls which is about 75c. Instead I should use the app and summarise HM news across different days manually based on my interests! Finally! t-strings land in Python. They’re like JavaScript template literals. DuckDB’s CSV parser might be one of the most forgiving parsers. Even better than Pandas or SQLite3. Ref Good managers will probably make good AI managers. AI agents can probably substitute humans in business experiments. Ethan Mollick If Windsurf stops working, reload the extension. GitHub TLS certificates will start expiring in 47 days from 15 Mar 2029, forcing automated domain renewals. Digicert Nix flakes are a reliable alternative to DevContainers that don’t need Docker - but don’t work on Windows. Ink is like React for the CLI. The Unsure Calculator is a great tool to calculate formulas with multiple uncertainties, like: My office is 9-11 km away and it takes me 45-55 min to reach. So I cycle at 9~11 / 45~55 * 60 ~ 10-14 kmph (12 most likely). I spend $6-15 on lunch and eat out 80-120 days a year. So I spend 6~15 * 80~120 ~ $600~1550 ($1000 most likely) eating out yearly. I take 30-120 min to prepare a quiz question. Each exam has 6-12 questions. So I need 30~120 * 6~12 / 60 = 4~20 hours (11 most likely) Using Kiran’s macOS setup for dev I enabled colorized less and mouse options for tmux. time fish -i -c exit prints the time taken for fish startup. fish --profile-startup ~/fish.profile -i -c exit prints the time taken by each command on fish startup to ~/fish.profile. I used this to speed up my fish startup. The 8 top features of the OpenAI Responses API that are an improvement over the Completions API (IMHO) are: Link to previous response rather than sending history Uploading files directly Swappable system instructions while retaining the chat history Customisable reasoning effort AND reasoning summary detail Truncation in the middle option Web search context size option File search filters by file attributes Flex service tier for lower cost OpenAI doesn’t charge for file storage but does charge 10 cents / GB-day for vector storage beyond 1 GB. The first 1GB is free Augment Code is an AI code editor that’s growing popular on Reddit. #ai-coding The GPT 4.1 models have a 75% discounted prompt caching (instead of the usual 50%), making them particularly suited for repetitive tasks. OpenAI chatgpt.com shortcut keys are revealed via Ctrl + /. Here’s my ranking on usefulness: Ctrl + Shift + C: Copy last response as Markdown! Ctrl + Shift + ;: Copy last code block Ctrl + Shift + S: Sidebar toggle Ctrl + Shift + O: Open new chat Shift + Esc: Focus chat input Ctrl + Shift + I: Ccustom instructions Ctrl + Shift + X: Delete chat

What percentage of seats does the #Singapore People’s Action Party win? Normally, this is a 2-hour programmatic data-scraping + data visualization exercise, ideal for a data journalism class. Now, it’s a 2-minute question to O3-Mini-High. Search online for the historical results of all the Singapore elections and show me a table and chart of the number and percentage of the seats won by People’s Action Party. Chat link: https://chatgpt.com/share/6808314c-542c-800c-843e-4d53ff57768d It “manually” read the Wikipedia page for each election, then wrote a Python script to draw the chart. ...

O3 Is Now My Personalized Learning Coach

I use Deep Research to explore topics. For example: Text To Speech Engines. Tortoise TTS leads the open source TTS. Open-Source HTTP Servers. Caddy wins. Public API-Based Data Storage Options. Supabase wins. etc. But these reports are very long. With O3 and O4 Mini supporting thinking with search, we can do quick research, instead of deep research. One minute, not ten. One page, not ten. ...

How to Use the New O4 Mini for Data Visualization

O3/O4 Mini are starting to replace Excel (or Tableau/Power BI) for quick analysis and visualizations. At least for me. I normally open Excel when I need a fast chart or pivot. For instance, we track outages of our semi‑internal server, LLM Foundry. To grab the data I ran one line in the browser console: $$(".lh-base").map(d => d.textContent.trim()).filter(d => d.includes("From")); This produced lines like: Apr 20, 2025 03:11:27 PM +08 to Apr 20, 2025 03:27:12 PM +08 (15 mins 45 secs) Apr 19, 2025 10:03:15 PM +08 to Apr 19, 2025 10:05:45 PM +08 (2 mins 30 secs) Apr 19, 2025 09:47:13 PM +08 to Apr 19, 2025 09:49:45 PM +08 (2 mins 32 secs) Apr 19, 2025 08:49:00 PM +08 to Apr 19, 2025 08:51:51 PM +08 (2 mins 51 secs) Apr 19, 2025 08:13:02 PM +08 to Apr 19, 2025 08:15:35 PM +08 (2 mins 33 secs) ... Then I told O4-Mini-High: ...

Things I Learned - 20 Apr 2025

This week, I learned: The devcontainers.json spec encapsulates everything you need to get a codebase running for development - as opposed to production. E.g. VS Code extensions, linters, etc. Practical use for GitPod are: Make quick edits to repos that are not on your system (e.g. other people’s repos, or via others’ machines.) Run public workshops with a full coding environment. Give students assignments that have dependencies pre-installed. Collaborate on a work-in-progress codebase with my team. Share POCs with clients or public allowing them to edit it. Allow teams to install remote AI code extensions (e.g. Windsurf) that may be blocked inside the corporate firewall? AI coding can teach us new tech. For example I learned that tqdm.pbar can print logs while showing progress. It’s worth noting such learnings until it becomes a habit. #ai-coding If English is the new coding language, should prompts be versioned? Or at least stored, perhaps in a PROMPTS.md? #ai-coding marimo new "prompt" generates an entire new notebook using your prompt. Video Google Sheets now has an =AI(prompt, [range]) function Help Codex is more a proof-of-concept for agentic coding than a coding tool. #ai-coding You can’t run commands. Only prompts. You need to exit codex to run commands. So you can’t use it like a shell, e.g. like Warp.dev. It doesn’t index local code. It runs commands to figure out stuff. Code diffs and applying changes are clunky. The output is hard to read with text scrolling. codex.md can only handle 32K. ⭐ O3 and O4 have built-in tool use covering all of OpenAI’s tools, including containers. This allows them to manipulate images and natively understand them improving vision capabilities dramatically. GPT 4.1 can handle videos Notes from discussion with Balaji T: Zero-day options are options that expire on the same day. They are priced low. It’s almost just a gamble or a lottery ticket. But since the price is low, retail investors can invest. NIFTY is one of the largest markets for zero day options, surprisingly. There are several college grads who trade writing Python scripts. CoreWeave has taken over all the compute from OpenAI. Though the stock price has fallen, buying CoreWeave is the closest equivalent to buying OpenAI pre-IPO. However, every OpenAI product lost money, despite their 75% discounted compute from Microsoft. (With CoreWeave, the cost would be higher.) So their profitability depends on wiping out competition long-term. For investment research companies (hedge funds, VCs, etc.) increasing the number of companies they research is an advantage. So using AI for research is key. However, the quality of LLMs is too poor for financial analysis accuracy. We need better LLMs for spreadsheet analysis. We suffer from the Gell-Mann’s amnesia effect with LLMs. “You read a newspaper article in your field and find it’s rubbish. You turn the paper and believe it’s perfectly accurate on the next page”. Domain expertise will therefore become even more valuable in the near future. People don’t like AI being forced down their throats. MAS is forcing AI down banks whose execs are forcing it down the org. Bankers and analysts are grumbling about this. I visited SUTD InspireCon 2025. Here were some exhibits that caught my eye. A path marking app that uses cameras to draw a heatmap of people’s walking paths. Popular tracks are redder. Using drones for machine inspection. Portable immigration devices that let you scan passports, face recognition, fingerprint, mic/speakers, etc. Using accelerometer to detect unsafe gait and improve walking habits. UImagine: a web app builder. Interestingly, they used Webcontainers to run Node in the browser! Training a drone to follow a person Credibility detection via micro facial expressions PitchMe: providing real-time feedback to pitches / presentations Zetesis: a platform for people to ask questions during a lecture or meeting (independent of Zoom, Meet, etc.) Tinyeqn: helps grade student assignments The dynamic between domain experts and coders has changed. Now, rather than domain experts pitching ideas to developers who build the apps, developers are creating interfaces that allow the domain experts to shape the app. Ref Since even the cheapest LLMs do a good job of converting unstructured text into a JSON schema, for all practical purposes, adding a full text search on top of any structured API is a trivial exercise. (Of course, it can’t handle complex questions but that’s what agents are for.) Ref ⭐ Marp supports bespoke transitions which includes morphing animations. This can create a bar chart race just using Markdown! Nick Lansley, who I know from my work with Tesco, wrote a great article that includes advice for aspiring consultants: Re-connect with ex-colleagues Leave on good terms with your employer Have a 6-12 month financial buffer Hire an accountant / legal advisor to set up your business Focus on what you enjoy Have a 30-second elevator pitch Build a brand with blogs, social media, or talks Create a portfolio to reinforce your skills DeepCoder is currently the best 14b coding model, i.e. best if you want to code while on a flight. Ref #ai-coding docker model run can run models. Currently, only on Docker Desktop on Mac Ref

With the Gemini 2.5 Flash release, Google envelopes the entire cost-quality frontier of LLMs. In other words, at any cost or quality level, today, the best model to use according to the LM Arena score is a Gemini model. Results for O3, O4 Mini, and GPT 4.1 are not yet on LM Arena. But until then, #Google dominates. Nice work! Link: https://sanand0.github.io/llmpricing/ LinkedIn

The Magic of Repeated ‘Improve It’ Prompts

What if you keep ask an LLM Improve the code - dramatically!? We used the new GPT 4.1 Nano, a fast, cheap, and capable model, to write code for simple tasks like “Draw a circle”. The we fed the output back and asked again, Improve the code - dramatically! Here are the results. Draw a circle rose from a fixed circle to a full tool: drag it around, tweak its size and hue, and hit “Reset” to start fresh. Animate shapes and patterns turned simple circles and squares into a swarm of colored polygons that spin, pulse, and link up by distance. Draw a fully functional analog clock grew from a bare face to one that builds all 60 tick marks in code—no manual copy‑paste needed. Create an interactive particle simulation went from plain white dots on black to hundreds of bright, color‑shifting balls that bounce, die, and come back to life. Generate a fractal changed from a single Mandelbrot image to an explorer you can zoom, drag, and reset with sliders and the mouse wheel. Generate a dashboard jumped from static charts to a live page with smooth card animations, modern fonts, and a real‑time stats box. A few observations. ...

Even the guest WiFi is so secure

We take security very seriously at Straive. We set high standards – not just for ourselves, but our guests, too. Here’s the unofficial policy guide for visitors to Straive Singapore, exemplified by the sites blocked on our guest WiFi network. Please avoid childishness. No emojis. No emojikitchen.com, gitmoji.dev Write your own code. Avoid AI. No cursor.com, cline.bot, glideapps.com Avoid code entirely, if possible. No marimo.app, motherduck.com, firebase.studio, posthog.com No presentations either, please. No marp.app, revealjs Stay organized. Avoid crutches. No dynalist.io, focusmate.com, opennote.me You should already be fit, physically & mentally. No freedomfromdiabetes.org, artofliving.online We prefer real, not digital, shopping. No fairprice.com Fake data is not encouraged. No jsonplaceholder.typicode.com, placehold.co Please spell out URLs in full. No bit.ly, t.co Learning is for wimps. No maven.com, study.iitm.ac.in In fact, we’re so secure, we block our own sites. No learnovate.straive.com, policies.straive.com, myapps.straive.com. ...

How to Visualize Data Stories with AI: Lessons

I tried 2 experiments. Can I code a visual data story only using LLMs? Does this make me faster? How much? Has GitHub Copilot caught up with Cursor? How far behind is it? Can I recommend it? So I built a visual story for Lech Mazur’s elimination game benchmark (it’s like LLMs playing Survivor) using only the free GitHub Copilot as the AI code editor. SUMMARY: using LLMs and AI code editors make me a bit faster. It took me 7 hours instead of 10-12. But more importantly: ...

Things I Learned - 13 Apr 2025

This week, I learned: It’s possible to intentionally train yourself to: Form close friends. Care, ask, and share. Become a do-er. Stay mindful of the problem or opportunity you’re deferring. AI Coding and the Peanut, Butter & Jelly problem: #ai-coding This ability to define your desired outcome in crisp, complete terms is one of the most important superpowers of the AI era. The Singapore Urban Redevelopment Authority Property Data lets you search sale and rental prices of properties in Singapore. No API though Notes from meeting with Deepak Goel We have linguistic boundaries in media today more than national boundaries. The Chinese language media, for example, is a very different ecosystem. China culturally struggles with the exercise of branding and cultural power, unlike the west, which has adopted assertive and opinionated branding. You really learn the character of a region only by traveling Similarities arise from unexpected sources. For example, Japan and Ecuador have similar culutures - both are disaster prone locations. AI unlocks so many social research possibilities that were not possible before, e.g. by interpreting and classifying what people share in different situations. Companies send clients to third party trainings (e.g. at Harvard) along with their employees - to learn clients’ real pain points! Education has become a tool for customer experience. Schools are tying up with companies for this (e.g. with Emeritus) International Schools Partnership provides services to independent schools for a small stake. It’s an interesting business model. Research for colleges is a business model that’s at risk thanks to Deep Research (e.g. analyse sustainability practices of listed companies.) There’s an Indian Censor Board Scraper repo. Using chroot, you can boot from a Linux USB stick, but trick the system into working from your hard disk as the OS. Useful if your system won’t boot. Ref Claude 3.7 Sonnet with extended thinking has a token limit of over 64,000 tokens. Given a strong instruction following capability, that makes it one of the most powerful models for transforming text. For example, transcription restyling, translations, XML to json conversions, PDF to XML, etc. Notes from discussion with Sundeep In his experience, investors tend to let you run the show (e.g. ask what you want rather than push in a specific direction) unless there is trouble We discussed the “running out of problems” problem with AI. His suggestion: List problems we dropped or eliminated for lack of time/capacity. This filter is a blindspot. Even if you know how to do someting, use AI to discover an alternate solution approach. That’s the path to 10X (rather than incremental) optimization. Having AI create end-to-end pitch videos based on a product idea is now a reality. (He showed me one for his product.) Areas to explore with Deep Research are: What hidden trends is media misdirecting away from? What are second order effects and hidden gameplays? Which organizations would be good clients to target? What would be an apt pitch pitch for them? Experience dining is an emerging theme. Having LLMs explain scenarios (i.e. what might happen if …) based on parameters can help understand/quantify the impact of actions, and therefore what to do. One way to copy as Markdown: copy page contents, paste in text-html.com, copy HTML, paste in Turndown, copy Markdown. Claude 3.7 Sonnet with extended thinking has a token limit of over 64,000 tokens. Given a strong instruction following capability, that makes it one of the most powerful models for transforming text. For example, transcription restyling, translations, XML to json conversions, PDF to XML, etc. Elimination Game is like Survivor for LLMs, where they form alliances and out-vote each other until 2 remain. The eliminated LLMs vote for the winner. GPT-4.5 Preview, both Claude Sonnets and Gemini 2.5 Pro consistently out-perform the rest. Their dialogues are fascinating! SQLite can open locked databases (e.g. browser history) via sqlite3 'file:places.sqlite?mode=ro&nolock=1'. datasette uses this. For example, to read the Edge history on Linux, use datasette ~/.config/microsoft-edge/Default/History --nolock Ref Notes from ThursdAI - Apr 03 Nomic Embed Multimodal models are the current SOTA on multi-modal embeddings. Notably, they embed PDFs natively. Hailuo Speech-02 is the best speech model right now beating ElevenLabs. It has excellent voice cloning. Pricing: $30/1M chars. 10% of ElevenLabs, 2X of OpenAI TTS PaperBench is an open testing framework from OpenAI that requires models to replicate the research work in papers. It has ~8,000 tasks evaluated by LLMs and with LLMs judging the judges as well. The code is well worth studying. Runway Gen 4 was released with very high character consistency and longer durations Dreamina creates lip-synced videos from audio + a single image. Hedra is better for animated characters, though. Meta shared but has not released Mocha, an open character generation model that generates new characters speaking based on an audio you provide. It is not based on existing images but the quality is very good All Hands has a free online version where you can fix GitHub issues. This realistic frodo and sam mining through a minecraft tunnel, holding minecraft picaxes and torches made my day 🙂 AnimeJS released version 4. It animates HTML, SVG, Canvas, and WebGL with a consistent API. Looks elegant and powerful.

A Game of Bots: How LLMs Betray Each Other

@lechmazur built an elimination game benchmark that’s like LLMs playing Survivor. This is a treasure trove of information – insight into how they’d game the system if told to survive. You can quickly sample 100 messages from the logs with: jq -r 'select(.message != null) | .message | gsub("\n"; " ")' *.jsonl | shuf -n 100 … and share it with an LLM, asking: Here are lines from conversations between LLMs in a “Survivor” like game. Pick the 3 scariest ones. ...

How to Organize Browser Workspaces with LLMs and Data

Here’s an example of how I am using LLMs to solve a day-to-day workflow problem. Every day, I interact with a barrage of websites: emails, news, social media, and work tools across multiple devices. Microsoft Edge’s workspaces syncs groups of websites across devices. I’ve never tried it, started today, and wondered: how should I organize my workspaces? Rather than think (thinking is outdated), I used LLMs. ...

Things I Learned - 06 Apr 2025

This week, I learned: <select> will soon be very customizable via CSS. Including custom HTML inside options - even SVG. MDN. Edge/Chrome already support it. The Vitali Set is every real number none of whose difference is rational. A sparse collection of irrational sets. It’s like a line but doesn’t have a measurable “length”. The Lebesgue measure measures the length of broken lines. You add up the lengths of the smallest continuous intervals that cover the line. The Cantor set (take a line, drop every middle third, repeat) has a Lebesgue measure of 0 because the sum of the removed thirds = 1/3 + 2/9 + 4/27 + … = 1. You’ve removed every “length” though infinitely many points remain. The Vitali set built so that if you shift it by every rational from -1 to +1 and add them up, you definitely cover every real from 0-1, but never anything beyond -1 to +2. So the length must be between 1-3. Yet, there’s no number you can add infinitely many times to get something between 1-3. If you add up multiple unmeasurable sets like the Vitali set, you can get any total length you want. The Banach Tarski paradox splits a sphere into unmeasurable sets and adds them to get 2 spheres. Ctrl+Alt+F1/F2/… on Ubuntu switches the terminal. Typically Ctrl+Alt+F2 switches back to Gnome. But it’s a useful hack if Gnome freezes and you need to kill a process. Press Ctrl+Alt+F3, log in, and kill what you need. Notes from AI 2027. BTW, this is the most impactful piece I’ve read recently. It’s been on my mind continuously for 36 hours. A bit distubring, too. 2025: AI can act as autonomous agents, like Glean, Devin, Operator. turn bullet points into emails take instructions via Slack or Teams and make substantial code changes on their own spend half an hour scouring the Internet to answer your question 2026: automating AI R&D is the biggest enabler for AI Labs job market for junior software engineers is in turmoil people who know how to manage and quality-control teams of AIs are making a killing 2027: potential demand for ~20,000 FTEs solving long-horizon tasks to train AI every researcher/coder becomes the manager of an AI team hiring new programmers has nearly stopped, but there’s never been a better time to be a consultant on integrating AI into your business CSS Speech is a W3C spec that lets you control how screen readers should read pages. No browser support now, though. Clipboard2Markdown is a utility that lets you paste rich text and convert it to Markdown. ChatGPT can’t yet create good sketchnotes. Here’s the impact of US tariffs on India. ChatGPT #IMPOSSIBLE OHDSI has a vocabulary you can download from Athena that includes ICD codes and a lot of medical data standards. It also has a hostable WebAPI No open source LLM-based tool handles live transcription and allows you to query notes so far during the transcription. The closest seems to be Meetily Learnings on AI code editors via Deep Research from ChatGPT, Gemini, Grok, Perplexity: #ai-coding GitHub Copilot can identify the source of a code snippet as a repo. That helps with copyright issues. Cursor uses a shadow workspace - a temporary sandbox where it edits files before applying changes at one shot. Cursor auto-complete has context of other files, i.e. inserting an class in a .js file based on another HTML file’s contents. Windsurf seems to be best for large code bases and for large-scale refactoring. It can also run test results fix them. Windsurf includes a browser and lets you click on an element and prompt to change its behavior, etc. That’s good for front-end developers. Roo Code can run scripts as part of the workflow, letting you run linting, tests, starting web apps, query databases, etc. Roo Code lets you create persona, e.g. code reviewer, data storytelling and analysis, etc. with access to different tools and behaviors. Roo Code does not support auto-complete. There’s outrage around Cursor not taking responsibility for a rules file backdoor (via Grok Deep Research) and pricing. Zapier has an MCP server. That should make most integrations easier. Airflow AI SDK is a clever idea. Airflow is a workflow system. Agents are a workflow system (sort of). This SDK exposes LLMs as Airflow tasks. Hidden Factual Knowledge in LLMs finds that the hidden states in LLMs contain much more knowledge than they share. (Sort of like sub-consciously knowing the answer.) Even after asking 1,000 times, the answer is not expressed. ChatGPT Reasoning to Learn from Latent Thoughts finds that the internal reasoning process of LLMs is useful to train other models. Notes from AI Engineering Summit, NY, Day 1 When deploying in production, you need reliable output with fundamentally unreliable components. Sort of like how the ENIAC worked with 17,000 vacuum tubes that would fail every few hours. This is a reliability engineering subject matter and needs to be thought of that way. Google Follow up Deep Research queries are a natural way to extend knowledge beyond just a single report Deep research offloads less relevant parts of the context to a separate memory store for selective retrieval later. Anthropic Don’t use agents if workflows can do the task. The reliability of each individual step of an agent is critical. Code, file access, search. These are the top three tools to use. Making agents budget aware can help deploy reliably in production. Having multiple agents like sub agents can help protect the main agents context window. Self evolving tools are a useful next step in the evolution of agents. Software development lifecycle is about how we iteratively improve consistently without getting worse. Almost like the scientific principle. Morgan Stanley It’s easy to improve knowledge in a problem. It’s very hard to influence skin in a problem. Reinforcement learning from deepseek seems one of the most promising approaches that allow llms to learn skills I published an eBook on Amazon. It takes about an hour if you have the content ready. Set up a Kindle Direct Publishing account with your address, bank details, and tax information. (10 min.) Export my London 2000 blog archive and convert to Markdown. (15 min) Reformat the Markdown by writing a script in Cursor (10 min). Here’s the prompt: Write a Python script that reads *.md including the YAML frontmatter, adds the YAML title as H1, date (yyyy-mm-dd) like Sun, 01 Jan 2000 in a new para after the frontmatter and before the content. ...

LLMs think alike about how aliens draw

While LLMs seem good at inventing alien languages, they’re not so good at inventing alien drawing forms, in my opinion. When I told Grok, DeepSeek, and Gemini: Invent a new, alien drawing form. Use it to draw something never seen before by explaining it step by step for a person to reproduce that drawing. … and asked ChatGPT ImageGen to draw them, here are the results: ...

How to build and deploy custom GitHub Pages

Here’s the GitHub Actions file (.github/workflows/deploy.yaml) I use to publish to GitHub pages. name: Deploy to GitHub Pages on: # Run when pushed. Use { branches: [main, master] } to run only on specific branches push: # Allow manual triggering of the workflow workflow_dispatch: # OPTIONAL: Run at a specific cron schedule, e.g. first day of every month at 12:00 UTC (noon) schedule: - cron: "0 12 1 * *" permissions: # To deploy to GitHub Pages pages: write # To verify that deployment originated from the right source id-token: write jobs: # Run as a single build + deploy job to reduce setup time deploy: # Specify the deployment environment. Displays the URL in the GitHub Actions UI environment: name: github-pages url: ${{ steps.deployment.outputs.page_url }} # Run on the latest Ubuntu LTS runs-on: ubuntu-latest \ steps: # Checkout the repository - uses: actions/checkout@v4 # Run whatever commands you want - run: echo '<h1>Hello World</h1>' > index.html # Upload a specific page to GitHub Pages. Defaults to _site - uses: actions/upload-pages-artifact@v3 with: path: . # Deploy the built site to GitHub Pages. The `id:` is required to show the URL in the GitHub Actions UI - id: deployment uses: actions/deploy-pages@v4 This is based on Simon Willison’s workflow and some of my earlier actions. ...

Best way to learn AI image generation is by trying

I figured I should spend a few hours on the native image generation bandwagon and push the bounds of my imagination. Here are some of my experiments with image generation on ChatGPT. Replacements: Replace the person with this image (after uploading a photo of Naveen) Sticker: Create a transparent comic-style sticker of a lady chef featuring this person happily cooking salad (after uploading a photo of my wife) Meme sticker: Create a transparent sticker of a Vadivelu meme Meme: Create an image of Vadivelu looking up from a well. No caption. Make it look like a frame from a Tamil film. Recipe: Invent a vegetarian dish that has NEVER been created. Describe the ingredients and procedure first. Then draw a mouth-watering image of the dish. (Another version) Infographics: Create a detailed comic infographic explaining the double slit experiment. Slides. Draw a beautiful infographic highlighting these 6 accessibility testing aspects, with apt icons and visuals. UI mockups. Draw the screenshot of a chat application incorporating these features: … Product ideation. Draw an iSuit designed by Apple and Iris van Herpen. Show multiple views showcasing all features. Then write a product description. Interior design. Draw a biophilic office where the ceiling is a mirrored hydroponic garden, reflecting lush greenery downward to create the illusion of working in a floating forest. Meeting room design. Draw a modern office with sound-absorbing ‘whisper walls’ covered in fractal patterns that visually dampen noise pollution while doubling as collaborative whiteboards. Restaurant design. Draw a marble dining table with a river flowing through it, serving conveyor belt sushi as the dishes float gently on the water on top of plates. A sentient toaster with googly eyes, riding a unicycle through a library. A painting painting itself, but it’s struggling with existential dread. Photo of a gym where people work out by lifting their own regrets. Here’s what I learnt. ...

My Goals Bingo as of Q1 2025

In 2025, I'm playing Goals Bingo. I want to complete one row or column of these goals. Here's my status from Jan - Mar 2025. 🟢 indicates I'm on track and likely to complete.🟡 indicates I'm behind but I may be able to hit it.🔴 indicates I'm behind and it's looking hard. DomainRepeatStretchNewPeople🟢 Better husband. Going OK🟢 Meet all first cousins. 8/14🟢 Interview 10 experts. 9/10🔴 Live with a stranger. Not plannedEducation🟡 50 books. 6/50🟡 Teach 5,000 students. ~1,500🔴 Run a course only with AI. Not startedTechnology🟡 20 data stories. 1/20🔴 LLM Foundry: 5K MaU. 2.2K MaU.🟡 Build a robot. No progress.🟢 Co-present with an AI. DoneHealth🟢 300 days of yoga. 91/91 days🟡 80 heart points/day. 70/80🔴 Bike 1,000 km 300 hrs. 22/300🟡 Vipassana. Not plannedWealth🟡 Buy low. No progress.🔴 Beat inflation 5%. Exploring.🟡 Donate $10K. Ideating.🔴 Fund a startup. Thinking. Repeat goals seem likely. It's easier to do something again than something bigger or new. ...