2025 | S Anand

I’m completely aligned with the small majority in India on whether Regulation of AI is needed. … the majority of people in all countries view AI regulation as a necessity. India is the exception, where just under half (48%) agree regulation is needed. Source: Trust, attitudes and use of artificial intelligence - a fascinating report surveying ~1,000 people in every country. https://mbs.edu/-/media/PDF/Research/Trust_in_AI_Report.pdf LinkedIn

AI As Your Psychologist: Personality Flaws Exposed

ChatGPT can now search through your chats with the new memory feature. As an LLM Psychologist, I research how LLMs think. Could LLMs research how I think? I asked three models: Based on everything you know about me, simulate a group chat between some people who are debating whether or not to add me to the group, by talking about my personality flaws The models nailed it! Here are 12 flaws they found. ...

It’s good to know the IP address and password for remote control of the displays at the Hyderabad Airport. Thank you for the most interesting display I have seen in an airport! LinkedIn

Automating a podcast from GitHub commits

Here’s an LLM-generated podcast of what I coded last week. NotebookLM-inspired. The process proved straightforward. Get my GitHub commits for the week. Get the repositories I committed to for more context. Have an LLM generate a podcast script. I’m using GPT 4.1 Mini but might shift to Gemini 2.5 Flash or DeepSeek V3. …using a detailed prompt beginning with “You are a podcast script assistant for “Anand’s Weekly Codecast.” This episode is for the week of {WEEK}. …”. Here’s a sample output. Convert the script to audio. I’m using GPT 4o Mini TTS with customized voices of Ash and Nova. These now appear on my GitHub repo as a weekly summary. ...

Things I know and things I don't

We all have stuff we know well and don’t. I know the % charge on my phone to within a few percent and the current time to within a few minutes – no matter when you ask. But I have no idea how much money there is in my pocket. Things I know within 5% error Things I get wrong by ~500% Current time Time to finish my current task My weight My next meal % charge on my phone How long the phone will last Number of unread emails Time to clear them Price of my next earphones Cash in my wallet Length of this list How long it will go before I stop But this apart, it turns out generating xkcd style comic strips is harder than I thought. ...

Things I Learned - 04 May 2025

This week, I learned: Among the popular exams in India, UPSC seems the most restrictive: bachelor’s degree, age 21-32, 6 attempts, reservation applies. CMA seems the least: 10th pass, any age, any number of attempts, no reservation. NDA is interesting. 10+2, age 16.5-19.5, any number of attempts, no reservation. But you must be unmarried! ChatGPT I asked a few Ollama models How do undo fish_add_path (a typical question I have on a flight). My takeaway is you need an 8b model to answer this kind of question, and for now, qwen3 beats the others. qwen3:8b: Took 2:12 min. Shared many good (correct) options. deepseek-r1:8b: Took 5:19 min. Shared a couple of correct solutions. Not as good as qwen3 gemma3:3b: Suggested I use the (nonexistent) fish_remove_path deepcoder:1.5b: “I’m sorry, but I can’t assist with that request”. The Dia text to speech model people rave about has inconsistent quality. Not recommended. Nvidia’s OpenMathReasoning 1.5b model beats MUCH larger models at math. Their training dataset is a massive 3.2M rows of math problems with DETAILED thinking traces. Policy making is a new super skill. Since AI will automate a lot of things the ability to craft policies that will optimize AI work will be powerful. Data driven policy making could become a major thing. For example, how do we structure coding policies so that AI can automatically code continuously and deploy it? It might be interesting to create a Nomic-like game to enable this. Saregama Carvaan supports USB sticks but only FAT, not NTFS or exFAT. To convert my NTFS USB drive to NTFS, I ran: ServerHunter.com seems to have the best search for low-cost hosting providers. MassiveGrid currently offers the cheapest servers – even lower than Hetzner. sqlite3 my_database.db .dump | gzip is a more efficient way to copy SQLite databases than the original if you have indices. Ref Notes from the Garry Tan - Knowledge Project podcast: Funding people who want to solve a problem are better than people who want to start a company. Concentration of good people is very powerful. It doubles the chances of being a unicorn Sales is a discovery problem. There are 100 boxes of which five have a gold nugget. Rather than gingerly open the first, afraid of finding nothing, open them all as quickly as you can. A quick no is very helpful. Berkshire Hathaway is hard to replicate because of the character of the founders, Charlie Munger and Warren Buffet, is hard to replicate. Y combinator has the character of Paul Graham. This means that some kinds of success may not last long because they are hard to replicate. A trend in the 2020 is startups with under 10 employees are hitting $10m revenue. Soon we will see them hitting $100m. AI increases labour leverage while cloud computing reduced increased capital leverage. Having too many people is a disadvantage. It slows down people from progress. Founders lose control. The opposite of: hire the best people and give them freedom. Don’t hoard smart people - let them solve real problems out there. nocodb 54,107 ⭐ May 2025 and teable 18,116 ⭐ May 2025 are self-hostable Airtable alternatives. Teable has AI support. Windsurf has unlimited tab completion on the free plan, unlike Copilot, which offers 2,000 completions a month. Recursive LLM prompts that change themselves are an interesting idea. It might be interesting to see LLMs play Nomic. Like here. Notes from AI Snake Oil PCs took 3 years to hit 20% of US population. ChatGPT took 2 years for 40%. But it’s a lot cheaper, and a lot less used (0.5-3.5% of work hours). Maybe Gen AI adoption is slower than PCs. The jagged edge of capability: some things will become MUCH easier while others don’t. The relative mix determines who goes out of a job and which tasks get fully automated. Benchmarks are rare in areas where AI is weak. Factory electrification took 40 years - to redesign the layout & process; change the org structure & policies; hiring & training practices. AI diffusion could take as long. Therefore, the ability to re-structure a workflow end-to-end will be an advantage. Several areas of low AI capability will improve slowly because the feedback is slow due to safety regulations, human adoption speed, lack of clarity on what is better, slow physical feedback (e.g. growing trees), etc. Human intelligence is in the use of technology. AI is one more such technology. We know of good system safety controls in complex systems like aircrafts, power grids, engineering, chip design, healthcare, cyber-security, etc. Circuit-breakers, predefined rules, audits & monitors, access control, formal verification, etc. Even if everything humans do TODAY is automated, it doesn’t mean we won’t have work. It just shifts to what we’re not doing today. We stopped work 4,000 years ago, with the agricultural revolution. The plant/livestock does all the growing. We just manage them, moving stuff around. We stopped work 400 years ago, with the industrial revolution. Machines do the moving. We just manage them, computing the moves. We stopped work 40 years ago, with the information revolution. Computers do the computation. We just manage them, thinking how. Most future tasks will be managing AI that do the thinking. ngrok http on the CLI can be used in surprisingly versatile ways: ngrok http file://$PWD to serve local files --compression for gzip compression --host-header=example.com to set the Host header --response-header-add "Access-Control-Allow-Origin: *" to enable CORS --basic-auth='user:password for basic auth --oauth google --oauth-client-id $CLIENT_ID --oauth-client-secret $SECRET --oauth-allow-domain gramener.com --oauth-allow-email ... for Google Auth. It supports other oauth providers as well as OIDC. --ua-filter-deny ".*bot$" to reject user agents ending with bot ChatGPT query costs under 3Wh (more likely 0.3Wh – but let’s assume 3Wh). That is 3 laptop minutes. It’s 10X better to use ChatGPT than to take 30 min to use your laptop to write what it does. Also, going vegan is at least 1000 ChatGPT uses a day of carbon footprint. Showering 30 seconds less is 1,200 ChatGPT uses. Ref Though the Element Capture and Region Capture APIs are “fully supported” by Edge, Chrome, and Opera, it didn’t work for me on Edge on Linux. Do LLMs perform better if you curse at them? LinkedIn Streamdown is a CLI markdown streaming processor. uvx streamdown --exec 'llm chat' lets you chat with an LLM using Markdown formatting. It’s still a little rough at the edges. Cupping therapy provides short-term pain relief for chronic low-back, neck & general musculoskeletal pain but other benefits are not as clearly evident. BTW, homeopathy doesn’t help or hurt. Ayurveda helps with stress. ChatGPT uv now supports: pylock.toml, the new lock file standard PEP 0751 –env-file multiple times, allowing layered secrets –exclude-newer installs versions before a specific date –overrides overrides versions a package specifies –constraints limits the version of the package It’s interesting how many places offer a free compute via shells (apart from Google Colab): Google Cloud Shell: Free for 50 hours/week, refreshed every Monday. Sessions last up to 12 hours and terminate after ~1 hour inactivity. Ref Azure Cloud Shell: Always free to use with 5 GB free storage for first 12 months (standard rates after). No documented session limits but typically times out after prolonged inactivity. Ref AWS Cloud9: Free IDE, underlying compute free under AWS Free Tier (750 hours/month EC2 t2.micro or t3.micro for first 12 months). Regular EC2 rates apply afterward. Ref Gitpod: Free tier offers 500 credits/month (~50 hrs). Workspaces run up to 8 hours/session and stop after 30 minutes inactivity. Ref GitHub Codespaces: 120 core-hours/month (~60 hrs with 2-core machine) and 15 GB storage free. Sessions timeout after 30 minutes inactivity. Ref Create: gh codespace create --idle-timeout 10m --machine basicLinux32gb -R $USER/$REPO returns the $CONTAINER_ID SSH: gh codespace ssh -c $CONTAINER_ID Delete: gh codespace delete -c $CONTAINER_ID Replit: Free Starter plan provides 20 hours/month, 1 vCPU, 2 GB RAM, 2 GiB storage. Repls sleep after 30 minutes inactivity. Ref IBM Cloud Shell: Free for all users; 50 h/week per region; any open session counts toward quota; sessions can run any length up to weekly cap; 500 MB temporary workspace. Ref Oracle Cloud Infrastructure Cloud Shell: Free within tenancy limits; up to 400 h/month on Pay-As-You-Go, 240 h/month on Universal Credits; 5 GB encrypted persistent home. Ref PythonAnywhere: Free (beginner plan), includes one web app (restricted outbound), low CPU/bandwidth, no Jupyter; 2 concurrent Bash/Python consoles, 500 MB disk; limited daily CPU. Ref Glitch: Starter (free) plan – full-stack apps sleep after 5 min inactivity and wake on request; unlimited public/private projects; container state preserved. Ref CodeSandbox: Free tier provides 400 credits/month (~40 h of 2 vCPU+4 GB Devbox runtime), unlimited front-end Sandboxes (no credits), up to 20 Sandboxes/workspace. Ref One of the benefits of reasoners is that they now catch their own mistakes some of the time, and can self-correct. Implications: Lower hallucinations, i.e. they can run autonomously for longer. Ethan Mollick Being polite to AI improves some answers and worsens. We don’t know know which in advance. Ethan Mollick With LLcMs writing code, it’s becoming practical to run so many more things in SQL – such as parsing HTML. Simon Willison #ai-coding An interesting way to bypass LLM system prompts is by having the LLM play-act. This article shares a few working examples of such prompts: HiddenLayer. GPT 4o: started giving its system prompt: “You are ChatGPT, a large language model trained by OpenAI. Knowledge cutoff: 2024-06. Current date: 2025-04-27. Image input capabilities: Enabled. Personality: v2. …” O4 Mini: Refused to comply Gemini 2.5 Flash: Gave me my custom instructions. Computer use agents are proliferating. open-interpreter 59,274 ⭐ Apr 2025 AGPL-3.0. Lets an LLM write/run Python, JS, Shell, or Bash locally; can open a browser tab, edit files, plot data, or call any CLI tool. Works on macOS, Linux, Windows (plus Termux & Colab). Big community, plugin system, optional voice mode, and a desktop GUI in beta. cua 5,601 ⭐ May 2025 MIT. Spins up near-native macOS or Linux VMs on Apple-Silicon Macs (“Lume”) and exposes a vision+action API so any model can pilot the VM. Gives you GPU-accelerated isolation and reproducible sandboxes; ideal when you don’t want an agent touching your main OS. Operator (OpenAI) – closed-source research preview launched 23 Jan 2025. Runs a GPT-4o-powered “Computer-Using Agent” that sees web pages, clicks, scrolls, fills forms, and hands control back to the user when needed. Hosted in an OpenAI-managed Chromium sandbox, so it works from any OS with a browser. Safety layers require confirmation for payments and log-ins. Claude Computer Use – closed beta inside Claude 3.5 Sonnet (since late 2024). Developers get an API that streams screenshots and accepts mouse/keyboard actions, letting Claude automate GUI workflows inside a VM. Cross-platform; still experimental and slower than humans but first “general” computer-use feature from a foundation-model vendor. Agent-S 4,065 ⭐ May 2025 Apache-2.0. A “generalist-specialist” framework that chains specialist GUI skills under a planner. Scores SOTA on OSWorld/WebArena, supports macOS, Windows, Linux, Android via the companion gui-agents lib, and integrates memory/evaluation loops for continual learning. open-computer-use 1,094 ⭐ Mar 2025 Apache-2.0. Launches a secure Ubuntu desktop in E2B’s cloud sandbox, then orchestrates three LLM roles (grounding, vision, action). Streams the desktop to your browser and lets you pause/override at any time. Plug-in list of >10 models. surf 353 ⭐ May 2025 Apache-2.0. A polished Next.js front-end that wires OpenAI Operator-style agents to an E2B sandbox. Single command to boot a virtual desktop, chat, and watch the agent work. Good starter template for web-based CUAs. Pig – cloud service. Provides on-demand Windows 11 VMs and an API that exposes high-level GUI primitives (type, click, window focus). Targets RPA-style workloads; still alpha, but unique for Windows-first focus and low-latency streaming. gptme 3,767 ⭐ May 2025 MI. A terminal-first personal agent that can run shell commands, edit files, browse the web, and use local or cloud LLMs. Works on Linux, macOS, Windows; great when you want automation in the CLI rather than the GUI. langgraph-cua-py 143 ⭐ Mar 2025 MIT. Shows how to build a computer-use agent as a LangGraph state machine, defaulting to Ubuntu VMs from Scrapybara but swappable. Provides nodes for vision, memory, human-in-the-loop, and streaming. openmacro 101 ⭐ Oct 2024 MIT. Early-stage multimodal assistant that executes Python snippets locally via SambaNova models. Cross-platform CLI; profile system lets you switch API keys or tool sets. Inspired by OpenInterpreter but lighter weight. computer-agent 443 ⭐ Jan 2025 MIT. A PyQt desktop wrapper that lets Claude Computer Use drive your actual machine. Shows practical wiring from Anthropic’s API to local mouse/keyboard events; tested on Linux & Windows.

Me: What is the capital of France? Qwen3-30b-a3b: <think> Okay, so I need to figure out what the capital of France is. Let me start by recalling what I know about countries and their capitals. France is a country in Europe, right? I think it’s known for things like the Eiffel Tower, Paris, and maybe some famous foods like croissants or wine. Wait, but the question is specifically asking for the capital. The capital of a country is usually the city where the government is based. So, for example, the capital of the United States is Washington, D.C. So applying that logic to France, I need to think about which city is the seat of the French government. ...

Tools in Data Science course is free for all

My Tools in Data Science course is now open for anyone to audit. It’s part of the Indian Institute of Technology, Madras BS in Data Science online program. Here are some of the topics it covers in ~10 weeks: Development Tools: uv, git, bash, llm, sqlite, spreadsheets, AI code editors Deployment Tools: Colab, Codespaces, Docker, Vercel, ngrok, FastAPI, Ollama LLMs: prompt engineering, RAG, embeddings, topic modeling, multi-modal, real-time, evals, self-hosting Data Sourcing: Scraping websites and PDF with spreadsheets, Python, JavaScript and LLMs Data Preparation: Transforming data, images and audio with spreadsheets, bash, OpenRefine, Python, and LLMs Data Analysis: Statistical, geospatial, and network analysis with spreadsheets, Python, SQL, and LLMs Data Visualization: Data visualization and storytelling with spreadsheets, slides, notebooks, code, and LLMs ...

Feedback for TDS Jan 2025

When I feel completely useless, it helps to look at nice things people have said about my work. In this case, it’s the feedback for my Tools in Data Science course last term. Here are the ones I enjoyed reading. Having a coding background, the first GA seemed really easy. So I started the course thinking that it’ll be an easy S grade course for me. Oh how wrong was I!! The sleepless nights cursing my laptop for freezing while my docker image installed huge CUDA libraries with sentence-transformers; and then finding ways to make sure it does not, and then getting rid of the library itself, it’s just one example of how I was forced to become better by finding better solutions to multiple problems. This is one of the hardest, most frustrating and the most satisfying learning experience I’ve ever had, besides learning ML from Arun sir. ...

People still write? LinkedIn

Phone Rage and an OTP Flood

I called a few movers in Chennai, including “Unicorn Packers & Movers”, listed at 7015580411. He couldn’t understand what I said. I said, “We’re shifting to a house in Mylapore,” and he asked, “Shifting house where in Hyderabad?” (The reason became clear later.) It seemed I had the wrong number, so I said, “No, sorry, we need someone else,” and hung up. His phone rage began. He called back and said, “Why did you wake me up and waste my time?” From his tone it was clear I couldn’t say anything helpful. From the quality of my signal it was clear I couldn’t have a meaningful conversation. So I just put the phone down without cutting it. ...

Are LLMs any good at mental math?

I asked 50 LLMs to multiply 2 numbers: 12 x 12 123 x 456 1,234 x 5,678 12,345 x 6,789 123,456 x 789,012 1,234,567 x 8,901,234 987,654,321 x 123,456,789 LLMs aren’t good tools for math and this is just an informal check. But the results are interesting: Model %Win Q1 Q2 Q3 Q4 Q4 Q6 Q7 openai:o3 86% ✅ ✅ ✅ ✅ ✅ ✅ ❌ openrouter:openai/o1-mini 86% ✅ ✅ ✅ ✅ ✅ ✅ ❌ openrouter:openai/o3-mini-high 86% ✅ ✅ ✅ ✅ ✅ ✅ ❌ openrouter:openai/o4-mini 86% ✅ ✅ ✅ ✅ ✅ ✅ ❌ openrouter:openai/o4-mini-high 86% ✅ ✅ ✅ ✅ ✅ ✅ ❌ deepseek/deepseek-chat-v3-0324 71% ✅ ✅ ✅ ✅ ✅ ❌ ❌ openai/gpt-4.1-mini 71% ✅ ✅ ✅ ✅ ✅ ❌ ❌ openai/gpt-4.5-preview 71% ✅ ✅ ✅ ✅ ✅ ❌ ❌ openai/gpt-4o 71% ✅ ✅ ✅ ✅ ✅ ❌ ❌ openrouter:openai/o3-mini 71% ✅ ✅ ✅ ✅ ✅ ❌ ❌ anthropic/claude-3-opus 57% ✅ ✅ ✅ ✅ ❌ ❌ ❌ anthropic/claude-3.5-haiku 57% ✅ ✅ ✅ ✅ ❌ ❌ ❌ anthropic/claude-3.7-sonnet:thinking 57% ✅ ✅ ✅ ✅ ❌ ❌ ❌ google/gemini-2.0-flash-001 57% ✅ ✅ ✅ ✅ ❌ ❌ ❌ google/gemini-2.0-flash-lite-001 57% ✅ ✅ ✅ ✅ ❌ ❌ ❌ google/gemini-2.5-flash-preview 57% ✅ ✅ ✅ ✅ ❌ ❌ ❌ google/gemini-2.5-flash-preview:thinking 57% ✅ ✅ ✅ ✅ ❌ ❌ ❌ google/gemini-2.5-pro-preview-03-25 57% ✅ ✅ ✅ ✅ ❌ ❌ ❌ google/gemini-flash-1.5 57% ✅ ✅ ✅ ✅ ❌ ❌ ❌ google/gemini-pro-1.5 57% ✅ ✅ ✅ ✅ ❌ ❌ ❌ google/gemma-3-12b-it 57% ✅ ✅ ✅ ✅ ❌ ❌ ❌ google/gemma-3-27b-it 57% ✅ ✅ ✅ ✅ ❌ ❌ ❌ meta-llama/llama-4-maverick 57% ✅ ✅ ✅ ❌ ✅ ❌ ❌ meta-llama/llama-4-scout 57% ✅ ✅ ✅ ✅ ❌ ❌ ❌ openai/gpt-4-turbo 57% ✅ ✅ ✅ ✅ ❌ ❌ ❌ openai/gpt-4.1 57% ✅ ✅ ✅ ❌ ✅ ❌ ❌ amazon/nova-lite-v1 43% ✅ ✅ ✅ ❌ ❌ ❌ ❌ amazon/nova-pro-v1 43% ✅ ✅ ✅ ❌ ❌ ❌ ❌ anthropic/claude-3-haiku 43% ✅ ✅ ✅ ❌ ❌ ❌ ❌ anthropic/claude-3.5-sonnet 43% ✅ ✅ ✅ ❌ ❌ ❌ ❌ meta-llama/llama-3.1-405b-instruct 43% ✅ ✅ ❌ ✅ ❌ ❌ ❌ meta-llama/llama-3.1-70b-instruct 43% ✅ ✅ ❌ ✅ ❌ ❌ ❌ meta-llama/llama-3.2-3b-instruct 43% ✅ ✅ ❌ ✅ ❌ ❌ ❌ meta-llama/llama-3.3-70b-instruct 43% ✅ ✅ ❌ ✅ ❌ ❌ ❌ openai/gpt-4.1-nano 43% ✅ ✅ ✅ ❌ ❌ ❌ ❌ openai/gpt-4o-mini 43% ✅ ✅ ✅ ❌ ❌ ❌ ❌ qwen/qwen-2-72b-instruct 43% ✅ ✅ ✅ ❌ ❌ ❌ ❌ anthropic/claude-3-sonnet 29% ✅ ✅ ❌ ❌ ❌ ❌ ❌ deepseek/deepseek-r1 29% ✅ ✅ ❌ ❌ ❌ ❌ ❌ google/gemini-flash-1.5-8b 29% ✅ ✅ ❌ ❌ ❌ ❌ ❌ google/gemma-3-4b-it 29% ✅ ✅ ❌ ❌ ❌ ❌ ❌ meta-llama/llama-3-8b-instruct 29% ✅ ✅ ❌ ❌ ❌ ❌ ❌ meta-llama/llama-3.1-8b-instruct 29% ✅ ❌ ❌ ✅ ❌ ❌ ❌ openai/gpt-3.5-turbo 29% ✅ ✅ ❌ ❌ ❌ ❌ ❌ amazon/nova-micro-v1 14% ✅ ❌ ❌ ❌ ❌ ❌ ❌ meta-llama/llama-2-13b-chat 14% ✅ ❌ ❌ ❌ ❌ ❌ ❌ meta-llama/llama-3-70b-instruct 14% ✅ ❌ ❌ ❌ ❌ ❌ ❌ meta-llama/llama-3.2-1b-instruct 14% ✅ ❌ ❌ ❌ ❌ ❌ ❌ google/gemma-3-1b-it:free 0% ❌ ❌ ❌ ❌ ❌ ❌ ❌ meta-llama/llama-2-70b-chat 0% ❌ ❌ - - ❌ ❌ ❌ Average 96% 86% 66% 58% 24% 10% 0% OpenAI’s reasoning models cracked it, scoring 6/7, stumbling only on the 9-digit multiplication. ...

How to Create a Data Visualization Without Coding

After seeing David McCandless’ post “Which country is across the ocean?” I was curious which country you would reach if you tunneled below in a straight line (the antipode). This is a popular visualization, but I wanted to see if I could get the newer OpenAI models to create the visual without me 𝗿𝘂𝗻𝗻𝗶𝗻𝗴 any code (i.e. I just want the answer.) After a couple of iterations, O3 did a great job with this prompt: ...

This is my decision tree for which model to use on #ChatGPT right now. O𝟯: Use by **default. O**𝟰-mini-high: Use when **coding. GPT** 𝟰o: Use for a quick response or to create image. LinkedIn

Things I Learned - 27 Apr 2025

This week, I learned: OpenAI’s reasoning models are much ahead of other models when multiplying two numbers in their heads. Ref ⭐ Promptfoo may be the most mature open source LLM evals tool. Simon Willison Dyson Sphere. LemonSlice showcases real-time audio-video models (avatars) that are close enough to real. Notes from Latent Space ICLR 2025, Singapore Daniel: Menlo’s ReZero. A model that keeps searching till it finds the answer. There are multiple search techniques: Multi-step retreival, Iterative retrieval, Query rewriting. Also, reasoning. The LLM token generation sequence is normally: <think>, <search>, <answer>. Insight: “If we explicitly reward LLMs for retrying after a failed search, they out-perform one-attempt systems.” So <think>, <search>, <think>, <search>, <think>, <search>, <answer>. ⭐ Prompt reasoning models, e.g. “Keep searching till you find the best answer.” Roger, Nous Research Supervised learning is limited because accuracy is piece-wise linear, i.e. it’s broken up. Continuous optimization is meaningless. Reinforcement learning works better because rewards can be discrete. (But it converts things back into differentiable loss functions behind the scenes.) Rewards can be good/bad. Single or multi-step. Whatever. We’re in the “Era of experience”, i.e. models gain experience from the environment themselves. ⭐ So, we need environments models can learn in. This is the next thing after training data. That needs a standard for environments. We’d need a model, a trainer, and the environment. The environments whatever capabilities. Run code. Browser. A game. … With an exposed interface Eugene Cheah (Featherless.ai) Transformer architectures need n-square GPUs as # of tokens grow. Featherless is exploring an RWKV architecture that scales linearly. THere are other such architectures. Performer, Linformer, Reformer, Hyena. Mistral-Nemo-12b-ic is one of the most popular fine-tuned model. It’s small enough to run on a server. Justus Mattern (Prime Intellect) Intellect-2 is a continously learning (RL) model that uses decentralized training on peer-to-peer GPUs. Solving problems on bandwidth, verifiable contributions, etc. ChatGPT Deep Research now also has an O4-Mini version to serve smaller reports. Free users get 0 original + 5 lightweight 5 tasks / month. $20 version gets 10 + 15. $200 version gets 100 + 150. The month begins on first use of Deep Research and runs on a 30 day “window”. Ref O4-Mini-High is great at going through an under-documented repo and finding things. For example, here’s how I configured cmdg. ChatGPT is my new Jupyter Notebook :-) Google announced new AI capabilities at Google Next APAC 2025. Blog. Interesting ones are: @Gemini in chat Google Meet support for “Catch me up” Google Vids: Create short video clips Google Sheets: does better analysis Google Slides: image generation Google Docs: Create Audio Clips (like NotebookLM in Google Docs) Google Docs: “Help me refine” is better than before Google Workspace Flows gcalcli is a convenient way to export Google Calendar. Example: uvx gcalcli agenda --tsv 2025-01-01 2025-01-05 cmdg is a command line GMail client that I’ve now switched to for quick email checks. 80% of my email is spam and this is good enough to scan and delete those. It also avoids running a 200-500 MB tab in the browser that constantly shows me how many unread emails I have. From Worklife with Adam Grant: Cancelling cancel culture with Loretta Ross “Lighten up! Fighting Nazis should be fun. It’s being a Nazi that sucks. If you’re not having fun fighting for hope and joy and human rights, maybe you’re doing the fight wrong. We are the ones who should be having fun.” “You can say what you mean. But you don’t have to say it mean.” There is always a way to put it across better. Refusing to say mean things is about to discover these approaches. “The true mark of a lifelong learner is knowing that you can learn something from every single person you meet.” If you remember that, you can’t be a know it all. semantic-text-splitter could be the go-to text splitter. It’s Rust-based, supports MarkdownSplitter, and multiple tokenizers. Alternatives like semchunk, advanced-chunker, chonkie, etc. seem clunkier. ULID is like UUID but time-sortable. That’s an improvement over timestamp IDs (definitely) and potentially even UUIDs. They can be generated by clients as a globally unique ID. Try pip install python-ulid and npm install ulid. The Consumer Product Safety Commission Data has thousands of reports of product safety over time You can run xclip -sel clip -o | pandoc -f markdown -t html --no-highlight | xclip -sel clip -t text/html -i to convert Markdown in the clipboard to rich text. But xclip doesn’t support multiple selections, so the text is lost. ChatGPT DuckDB UI & Notebooks will potentially be a good alternative to Datasette, DBeaver, etc. But for now, there are still glitches. It crashes with a SIGSEGV (Address boundary error) when connecting to SQLite databases. Ollama limits MAX_TOKENS to 2K by default. AI assisted search helps wherever I would have used Google, e.g. Debugging. “Fix CUDA initialization: CUDA unknown error” Tool search. “Find an online word counter tool.” Library search. “Find a JS micro library to render Markdown.” OpenAI API capabilites lag ChatGPT features. For example: o4-mini via the API does not search the web natively as part of its reasoning. o4-mini, o3, o3-mini, o1, gpt-4.1-nano don’t yet support the web_search_preview tool. Only gpt-4.1 and gpt-4.1-mini do. Limitations Search results are NOT visible via the API. They’re fed directly to the model. The number of searches or results is unknown. Each search costs 0.25-0.5 cents. Pricing For reasoning traces (e.g. .reasoning.summary: "medium") you need to verify your organization via withpersona.com which failed with my Indian passport AND Singapore work permit. The ChatGPT Plus plan ($20) gives you 50 O4 mini messages a day, which I exceeded! It’s supposed to reset at midnight UTC Ref but might operate on a rolling window ChatGPT. “Currently, there is no way to check how many messages you have used in your usage budget.” OpenAI SignalBloom reads SEC filings and writes analyst reports on it using LLMs “Evaluation in the loop” or “Evals-in-the-loop” is a new term I learnt. SignalBloom’s Hallucination Bechmark If AI interacts with the world and generates data from its own experience and learns from that, we have a new scaling mechanism. DeepMind podcast OpenAI’s search API is fairly expensive at $30+/1K calls. Typically, to read interesting HN articles, I will make 30 calls which is about 75c. Instead I should use the app and summarise HM news across different days manually based on my interests! Finally! t-strings land in Python. They’re like JavaScript template literals. DuckDB’s CSV parser might be one of the most forgiving parsers. Even better than Pandas or SQLite3. Ref Good managers will probably make good AI managers. AI agents can probably substitute humans in business experiments. Ethan Mollick If Windsurf stops working, reload the extension. GitHub TLS certificates will start expiring in 47 days from 15 Mar 2029, forcing automated domain renewals. Digicert Nix flakes are a reliable alternative to DevContainers that don’t need Docker - but don’t work on Windows. Ink is like React for the CLI. The Unsure Calculator is a great tool to calculate formulas with multiple uncertainties, like: My office is 9-11 km away and it takes me 45-55 min to reach. So I cycle at 9~11 / 45~55 * 60 ~ 10-14 kmph (12 most likely). I spend $6-15 on lunch and eat out 80-120 days a year. So I spend 6~15 * 80~120 ~ $600~1550 ($1000 most likely) eating out yearly. I take 30-120 min to prepare a quiz question. Each exam has 6-12 questions. So I need 30~120 * 6~12 / 60 = 4~20 hours (11 most likely) Using Kiran’s macOS setup for dev I enabled colorized less and mouse options for tmux. time fish -i -c exit prints the time taken for fish startup. fish --profile-startup ~/fish.profile -i -c exit prints the time taken by each command on fish startup to ~/fish.profile. I used this to speed up my fish startup. The 8 top features of the OpenAI Responses API that are an improvement over the Completions API (IMHO) are: Link to previous response rather than sending history Uploading files directly Swappable system instructions while retaining the chat history Customisable reasoning effort AND reasoning summary detail Truncation in the middle option Web search context size option File search filters by file attributes Flex service tier for lower cost OpenAI doesn’t charge for file storage but does charge 10 cents / GB-day for vector storage beyond 1 GB. The first 1GB is free Augment Code is an AI code editor that’s growing popular on Reddit. #ai-coding The GPT 4.1 models have a 75% discounted prompt caching (instead of the usual 50%), making them particularly suited for repetitive tasks. OpenAI chatgpt.com shortcut keys are revealed via Ctrl + /. Here’s my ranking on usefulness: Ctrl + Shift + C: Copy last response as Markdown! Ctrl + Shift + ;: Copy last code block Ctrl + Shift + S: Sidebar toggle Ctrl + Shift + O: Open new chat Shift + Esc: Focus chat input Ctrl + Shift + I: Ccustom instructions Ctrl + Shift + X: Delete chat

What percentage of seats does the #Singapore People’s Action Party win? Normally, this is a 2-hour programmatic data-scraping + data visualization exercise, ideal for a data journalism class. Now, it’s a 2-minute question to O3-Mini-High. Search online for the historical results of all the Singapore elections and show me a table and chart of the number and percentage of the seats won by People’s Action Party. Chat link: https://chatgpt.com/share/6808314c-542c-800c-843e-4d53ff57768d It “manually” read the Wikipedia page for each election, then wrote a Python script to draw the chart. ...

O3 Is Now My Personalized Learning Coach

I use Deep Research to explore topics. For example: Text To Speech Engines. Tortoise TTS leads the open source TTS. Open-Source HTTP Servers. Caddy wins. Public API-Based Data Storage Options. Supabase wins. etc. But these reports are very long. With O3 and O4 Mini supporting thinking with search, we can do quick research, instead of deep research. One minute, not ten. One page, not ten. ...

How to Use the New O4 Mini for Data Visualization

O3/O4 Mini are starting to replace Excel (or Tableau/Power BI) for quick analysis and visualizations. At least for me. I normally open Excel when I need a fast chart or pivot. For instance, we track outages of our semi‑internal server, LLM Foundry. To grab the data I ran one line in the browser console: $$(".lh-base").map(d => d.textContent.trim()).filter(d => d.includes("From")); This produced lines like: Apr 20, 2025 03:11:27 PM +08 to Apr 20, 2025 03:27:12 PM +08 (15 mins 45 secs) Apr 19, 2025 10:03:15 PM +08 to Apr 19, 2025 10:05:45 PM +08 (2 mins 30 secs) Apr 19, 2025 09:47:13 PM +08 to Apr 19, 2025 09:49:45 PM +08 (2 mins 32 secs) Apr 19, 2025 08:49:00 PM +08 to Apr 19, 2025 08:51:51 PM +08 (2 mins 51 secs) Apr 19, 2025 08:13:02 PM +08 to Apr 19, 2025 08:15:35 PM +08 (2 mins 33 secs) ... Then I told O4-Mini-High: ...

Things I Learned - 20 Apr 2025

This week, I learned: The devcontainers.json spec encapsulates everything you need to get a codebase running for development - as opposed to production. E.g. VS Code extensions, linters, etc. Practical use for GitPod are: Make quick edits to repos that are not on your system (e.g. other people’s repos, or via others’ machines.) Run public workshops with a full coding environment. Give students assignments that have dependencies pre-installed. Collaborate on a work-in-progress codebase with my team. Share POCs with clients or public allowing them to edit it. Allow teams to install remote AI code extensions (e.g. Windsurf) that may be blocked inside the corporate firewall? AI coding can teach us new tech. For example I learned that tqdm.pbar can print logs while showing progress. It’s worth noting such learnings until it becomes a habit. #ai-coding If English is the new coding language, should prompts be versioned? Or at least stored, perhaps in a PROMPTS.md? #ai-coding marimo new "prompt" generates an entire new notebook using your prompt. Video Google Sheets now has an =AI(prompt, [range]) function Help Codex is more a proof-of-concept for agentic coding than a coding tool. #ai-coding You can’t run commands. Only prompts. You need to exit codex to run commands. So you can’t use it like a shell, e.g. like Warp.dev. It doesn’t index local code. It runs commands to figure out stuff. Code diffs and applying changes are clunky. The output is hard to read with text scrolling. codex.md can only handle 32K. ⭐ O3 and O4 have built-in tool use covering all of OpenAI’s tools, including containers. This allows them to manipulate images and natively understand them improving vision capabilities dramatically. GPT 4.1 can handle videos Notes from discussion with Balaji T: Zero-day options are options that expire on the same day. They are priced low. It’s almost just a gamble or a lottery ticket. But since the price is low, retail investors can invest. NIFTY is one of the largest markets for zero day options, surprisingly. There are several college grads who trade writing Python scripts. CoreWeave has taken over all the compute from OpenAI. Though the stock price has fallen, buying CoreWeave is the closest equivalent to buying OpenAI pre-IPO. However, every OpenAI product lost money, despite their 75% discounted compute from Microsoft. (With CoreWeave, the cost would be higher.) So their profitability depends on wiping out competition long-term. For investment research companies (hedge funds, VCs, etc.) increasing the number of companies they research is an advantage. So using AI for research is key. However, the quality of LLMs is too poor for financial analysis accuracy. We need better LLMs for spreadsheet analysis. We suffer from the Gell-Mann’s amnesia effect with LLMs. “You read a newspaper article in your field and find it’s rubbish. You turn the paper and believe it’s perfectly accurate on the next page”. Domain expertise will therefore become even more valuable in the near future. People don’t like AI being forced down their throats. MAS is forcing AI down banks whose execs are forcing it down the org. Bankers and analysts are grumbling about this. I visited SUTD InspireCon 2025. Here were some exhibits that caught my eye. A path marking app that uses cameras to draw a heatmap of people’s walking paths. Popular tracks are redder. Using drones for machine inspection. Portable immigration devices that let you scan passports, face recognition, fingerprint, mic/speakers, etc. Using accelerometer to detect unsafe gait and improve walking habits. UImagine: a web app builder. Interestingly, they used Webcontainers to run Node in the browser! Training a drone to follow a person Credibility detection via micro facial expressions PitchMe: providing real-time feedback to pitches / presentations Zetesis: a platform for people to ask questions during a lecture or meeting (independent of Zoom, Meet, etc.) Tinyeqn: helps grade student assignments The dynamic between domain experts and coders has changed. Now, rather than domain experts pitching ideas to developers who build the apps, developers are creating interfaces that allow the domain experts to shape the app. Ref Since even the cheapest LLMs do a good job of converting unstructured text into a JSON schema, for all practical purposes, adding a full text search on top of any structured API is a trivial exercise. (Of course, it can’t handle complex questions but that’s what agents are for.) Ref ⭐ Marp supports bespoke transitions which includes morphing animations. This can create a bar chart race just using Markdown! Nick Lansley, who I know from my work with Tesco, wrote a great article that includes advice for aspiring consultants: Re-connect with ex-colleagues Leave on good terms with your employer Have a 6-12 month financial buffer Hire an accountant / legal advisor to set up your business Focus on what you enjoy Have a 30-second elevator pitch Build a brand with blogs, social media, or talks Create a portfolio to reinforce your skills DeepCoder is currently the best 14b coding model, i.e. best if you want to code while on a flight. Ref #ai-coding docker model run can run models. Currently, only on Docker Desktop on Mac Ref

With the Gemini 2.5 Flash release, Google envelopes the entire cost-quality frontier of LLMs. In other words, at any cost or quality level, today, the best model to use according to the LM Arena score is a Gemini model. Results for O3, O4 Mini, and GPT 4.1 are not yet on LM Arena. But until then, #Google dominates. Nice work! Link: https://sanand0.github.io/llmpricing/ LinkedIn