October 2024

Are scientific discoveries more a product of the person or their time? It’s usually their time, but in my conversation with ChatGPT, I found four that were mostly person-driven: Newton’s laws of gravitation Einstein’s general relativity I knew these. Both were far ahead of their times. In contrast, Newton’s laws of motion and Einstein’s special relativity weren’t. McClintock’s discovery of Transposable Elements: genes that can turn physical characteristics on and off. Her work was dismissed for decades. Mullis’ invention of the PCR that makes billions of DNA copies rapidly. Other scientists were using very different methods. I didn’t know these. Both are in biology - a rapidly advancing field. ...

Villager trading is the fastest way to Fortune III

I asked o1-preview what the fastest way to get to a Fortune III enchantment was. My options were: Using a Fishing Rod with Luck of the Sea III + Lure 3 and repeatedly fishing. Using an Enchanting Table repeatedly until I get Fortune 3. Factor in the time that it would take to get the experience for these experiments Making a Villager a Librarian and breaking their Lectern and setting it up again In short: ...

Things I Learned - 27 Oct 2024

This week, I learned: LanceDB is a more scalable alternative to ChromaDB. Written in Rust. Does not require a separate HSNW library. Meta has a bunch of image embedding models: DINOv2 creates image embeddings (Apr 2023) ImageBind is an embedding model for text, images, audio, and more (Jun 2023) Gemini has a code execution API! 0x0.st is an open API-based file upload + URL shortening service. You can dump files there temporarily. noVNC is a JavaScript VNC client. You can control a remote (virtual) machine from your browser. Friend is an always recording pendant that you can ask questions to. Anthropic’s new Sonnet model is even better at code. Plus it has the ability to extract coordinates from images. Ref Gemini sort-of supports diarization. Ref. I tried it and it’s OK but not perfect. #IMPOSSIBLE LLMs cannot diarize reliably yet. (Gemini just guesses the speaker differences.) Replit is good for hobbyists, Cursor for developers, and Pythagora & Bolt for non-developers building business apps. Ref

How does Gemini process videos?

The Gemini documentation is clear: The File API service extracts image frames from videos at 1 frame per second (FPS) and audio at 1Kbps, single channel, adding timestamps every second. These rates are subject to change in the future for improvements in inference. Note: The details of fast action sequences may be lost at the 1 FPS frame sampling rate. Consider slowing down high-speed clips for improved inference quality. Individual frames are 258 tokens, and audio is 32 tokens per second. With metadata, each second of video becomes ~300 tokens, which means a 1M context window can fit slightly less than an hour of video. ...

How to recruit based on IIT JEE Rank vs GPA

Preserving this post by Daniel George showing the IIT Bombay 2014 GPA vs JEE Rank on a log scale. What I found interesting was: A higher JEE rank generally means you won’t score too low, but you needn’t score too high. The higher the JEE rank, the greater the spread of GPA. A high GPA can come from any rank (8+ GPA is uniformly distributed across ranks), but a low GPA is generally only from the lower rankers (6- GPA is mostly from 500+ rank.) So, it’s better to recruit based on GPA rather than JEE rank, unless you’re going after the very best students (where it makes less difference.)

Clone any voice with a 15-second sample

It's surprisingly easy to clone a voice using F5-TTS: "A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching". Here's a clip of me, saying: I think Taylor Swift is the best singer. I've attended every one of her concerts and in fact, I've even proposed to her once. Don't tell anyone. (Which is ironic since I didn't know who she was until this year and I still haven't seen or heard her.) ...

How can non-developers learn AI coding?

How can non-programmers build apps? Claude.ai, Replit.com, Bolt.new, V0.dev, Pythagora.ai and a few other tools write and deploy code just based on a prompt. You should try them out. “But how do you build the skill? Is there a tutorial?” I’m often asked. No, I can’t find a tutorial, but here is my suggestion. You probably can’t guess what’s easy or hard. e.g. “Take my picture in black & white” is FAR easier than “When’s the next lunar eclipse?” So if the app doesn’t work, try 2-3 times, then GIVE UP! Note it down. Then try something else. (You’ll soon get a feel for what’s possible.) Revisit what failed 3-6 months later. It might suddenly become possible.

Arun Tangirala and I webinared on “AI in Education” yesterday." “(“Webinared” is not a word. But “verbing weirds language”.)” Mid-way, Jose Swan from the audience asked, “Can you summarise this session using an AI? There are SEVERAL tools you can use to summarize talks. Whisper for transcription, FFMpeg for keyframe extraction, #NotebookLM for podcast generation, text-embedding-3-small for topic modelling, and of course, any regular LLM include #ChatGPT for summarization or translation. ...

Tools to publish annotated talks from videos

Arun Tangirala and I webinared on “AI in Education” yesterday. This post isn’t about the webinar, which went on for an hour and was good fun. This post isn’t for my preparation for the webinar, which happened frantically 15 minutes before it started. This post is about how I created the annotated talk at https://github.com/sanand0/ai-in-education-webinar (inspired by Simon Willison’s annotated presentations process) – a post-processing step that took ~3 hours – and the tools I used for this. ...

Things I Learned - 20 Oct 2024

This week, I learned: SQL optimizations for multi-threaded web applications. Ref PRAGMA journal_mode = WAL. Improves performance for frequent writes. It allows concurrent reads and writes. PRAGMA synchronous = NORMAL. Improves performance. We might lose a few transactions but won’t corrupt the database. PRAGMA mmap_size = 128000000. Set global memory map for processes to share data PRAGMA journal_size_limit = 64000000. Limit WAL file to prevent unlimited growth BEGIN IMMEDIATE instead of BEGIN. Prevents writes to the journal file until the transaction is complete. Improves concurrency. AI seems to be slowing down apprenticeship since experts would rather use an AI than train an apprentice. Example: Robotic surgery. Ref How AI can improve education performance and engagement. Ref Student: Create study plan based on course and schedule Student: Focus on what you need to learn more Student: Align with your study style and pace Teacher: Grading MCQs Teacher: Writing conceptual guides “New collar workers” was coined by Ginny Rometty Embed tutor or document in video and ask for clarification! This is a new embedded interface. #TODO Playing Bad Apple in Minecraft. Ultra cool! OpenAI has a prompt generator. Currently it uses a meta-prompt but may later move to DSPy or Gradient Descent. Ref Great demo of using the Realtime API to read the latest Hacker News. Ref LLMs have reached the point where they can show a world, like CounterStrike, in near real time. Ref

Leaning into the power of AI coding

Yesterday (15 Oct 2024), I used Cursor to code more than I ever have. (Doing's how we learn, I guess. Not just reading.) DateUsage05-10-20241506-10-20242707-10-20248708-10-20241609-10-202410-10-20244211-10-20242412-10-20245713-10-20241514-10-20242815-10-2024186 This was mainly to create and publish 2 libraries on npm over 6 hours: ...

Bad Apple in #Minecraft? Turns out it can be done. At 20 fps on the original resolution (512x384). This is the most mind-blowing piece of engineering I’ve seen in some time! https://purplesyringa.moe/blog/we-built-the-best-bad-apple-in-minecraft/ LinkedIn

Things I Learned - 13 Oct 2024

This week, I learned: DuckDB supports function chaining DuckDB lets you create functions = macros HTML for People is a nice introduction to HTML. FlightRadar24 lets you watch airplanes live. sq is like jq but for SQL. Deno 2 is fully backward compatible with Node! via O1 is good at solving problems where the solution is easy to verify and generating options helps get closer to the solution Reverb ASR does diarration as well as transcription. It seems the state of art right now. Gemini Flash and Gemini Flash 8b can be fine-tuned at zero cost. Inference is at the same price! Ref Flux 1.1 Pro is released. I tried my Calvin & Hobbes test on it. Not great. ImageGen3 is better, ChatGPT is the best. Ref Revisiting text to speech models. Nothing much has changed since July 2024. OpenAI TTS: $15/1M chars Ref Deepgram Aura: $15/1M chars Ref Azure AI Speech: $15/1M chars Ref Google TTS Neural2: $16/1M chars Ref AWS Polly Neural TTS: $16/1M chars Ref Cartesia Pro: $50/1M chars Ref Elevenlabs Scale: $300/1M chars Ref GitHub co-pilot workspaces let you code using your mobile with AI and deploy it at one shot If you need an Ubuntu Docker container with Python, install it via uv rather than compiling from source. via VTracer is an open source library (and tool) to convert raster images to SVGs. via If you want to create a console.llm() function, a browser extension is the best way, because some pages have Content-Security-Policy that block eval, form submission, fetch from other domains, and script execution. PyPi lets you publish from GitHub Actions without a token. Also from Gitlab.com CI/CD and Google Cloud. ActiveState which made ActivePython, ActivePerl, etc. made these products paid for commercial use around 2013 after a series of acquisitions. Marimo supports: Publishing any notebook to static.marimo.app as a static app Creating a SINGLE link that embeds the ENTIRE notebook in the URL! Runnable via uvx marimo edit Parables on the Power of Planning in AI: Giving models about 30 seconds of thinking time consistently improves results - as much as increasing parameter size by a factor of 1,000 to 100,000! This works particularly well for verifiable results (code, math, etc.) Technique: Ask an LLM hundreds of times at low temperature and pick the most common one. (Google’s Minerva used this on the MATH dataset.) Better Technique: Ask an LLM hundreds of times. Pick the best solution based on an evaluation metric (reward model) Better Technique: Apply a reward model at EACH step of the process. OpenAI’s “Let’s Verify Step by Step” Late chunking is an interesting approach to adding context to embeddings. (I don’t understand it, but it’s cheap and effective.) DeepInfra offers embedding models as APIs at about 0.5 to 1 cent per MTok in an OpenAI compatible API. It also supports text-to-image models like flux.dev and speech recognition models like Whisper. Jake Heller: “One of the things we learned is (an LLM app) after it passes passes frankly even 100 tests, the odds that it will do, on any random distribution of user inputs, the next 100,000 100% accurately is very high.” OpenAI’s O1 is like Daniel Kahneman’s System 2 thinking - as against other LLMs’ System 1 thinking. Continue.dev is another AI coding editor. It supports OpenRouter. So now I have heard good things about: Github Copilot Cursor Cody Continue.dev (supports OpenRouter) Aider (supports OpenRouter) Maybe: Codeium Not: Amazon Q Developer

Challenge: code in 10 minutes with only an LLM

I gave a bonus assignment in LLM coding to ~1,000 students at the Tools in Data Science course at IITM. Here is an OPTIONAL project: Record a 10-minute video in which you create an application entirely using LLMs and deploy it. Any app is fine. Any language. Simple or complex. Business or gaming. Anything is fine. Your choice. Create the app only using LLMs. You can use an LLM (ChatGPT, Claude.ai, Gemini, Cursor, Cody, etc.) but you can only prompt the app to write code. You can copy-paste code and run code don’t write or edit even a single line of code directly. Use LLMs to debug and edit. Code completion is NOT allowed – only prompting/chatting. Record the entire process in 10 min. Don’t edit, trim, enhance, or annotate the video. You should record yourself creating the entire app from start to finish. Practice beforehand if you like. Record in 1 take. Share the video and app. Publish the video publicly anywhere (e.g. YouTube and share the link.) Publish the app publicly anywhere (e.g. GitHub pages, Glitch.me, Heroku, etc.) or upload a ZIP file with the code (for slightly lower marks.) Submit via a reply to this thread. Multiple submissions per person are fine. Work in groups if you like but only the submitter gets marks. ...

Which is the most neurotic / emotional #LLM? I ran the Big 5 personality test on a bunch of LLMs (for my TEDx MDI Gurgaon talk in August.) Here are the results. https://sanand0.github.io/llmpersonality/ Claude 3 Haiku and Llama 3 8b consider themselves the most emotional models. In fact, some of Llama 3 8b’s quotes are hilarious: Get stressed out easily. - 4. Moderately Accurate (I can get stressed, but I’m working on managing my stress levels) ...

Things I Learned - 06 Oct 2024

This week, I learned: ffmpeg on WASM works but is unstable and hard to use. You can’t use it in a CDN without CORS issues, since it loads ffmpeg-core via a worker. It often runs into buffer allocation issues. Exotel and Plivo provide voice & SMS services in India (like Twilio). Plivo is more customer friendly. Uber’s H3, Google’s S2, and GeoHash are geocoding systems. H3 offers uniform cell sizes and better distance measurement S2 offers higher precision (factoring in Earth’s curvature) for exact location matches GeoHash is the simplest There’s a movement towards embeddable databases on the cloud. MotherDuck is hosted DuckDB. Turso is hosted SQLite (with local sync, multi-tenant) StarBase DB is SQLite with an API on top of Cloudflare Durable Objects. Software 2.0 by Andrej Karpathy. This is fundamentally altering the programming paradigm by which we iterate on our software, as the teams split in two: the 2.0 programmers (data labelers) edit and grow the datasets, while a few 1.0 programmers maintain and iterate on the surrounding training code infrastructure, analytics, visualizations and labeling interfaces. Adaptive UI ideas: Adaptive Fields: Show only required fields based on what the user field so far. Smart Inputs: Dropdowns and auto-complete based on user’s context. Smart Themes: Change font size, contrast, theme guessing the user’s age and preferences. Dynamic Menus: Show what they might need to do next. Like Nokia’s right button, but using LLMs. Smart Tooltips: Check what the user’s doing (delays, confusions, previous clicks, current actions) and show relevant tips. Personalized Layout: Show only the relevant sections of the app. E.g. based on what they’re doing. Smart Charts: Create the right chart that solve the user’s question. Adaptive Back-end Dynamic APIs: Create endpoints on the fly based on user needs Dynamic Indexing: Create & update indices on the fly based on user needs Dynamic Schema: Create & update schema on the fly based on user needs Dynamic Migration: Migrate to a new database or OS or language as required Dynamic Queries: Create SQL/NoSQL queries to solve the user problem Dynamic RBAC: Figure out who needs permissions and why. Add OR REMOVE access as required Dynamic Logging. Log what’s required. Explain why it’s logged and what’s happening. Fix code that raised the error Dynamic Caching. Cache what’s likely to be required. Evict what may not be required. Figure out cache keys. Aider LLM Leaderboards show which LLMs code better. As of now, o1-preview > claude-3.5 sonnet on code editing claude-3-opus > claude-3.5-sonnet on code refactoring deepseek-coder-v gpt-4o-mini sucks. Jaro-Winkler Distance is a string matching algorithm that weights the start of a string higher. Passing the feed of the following to NotebookLLM is a good way to get caught up with news and summaries. A blog / WhatsApp group (e.g. The Generative AI Group, Sithamalli, etc.) A Google Group / mailing list (e.g. genainews, datameet) YouTube channels (e.g. Vertiasium, GitHub) Hacker News top stories Research papers Emails (skipping marketing emails) OpenAI Evals and Distillation has a clever design. They just convert filtered history to .JSONL files that can be an input to either. Speak is a language learning app based on OpenAI’s Realtime API. OpenAI’s Realtime API can be used in a text-to-text chat mode without needing to send the entire context. If the pricing works out right, this can be far cheaper than sending the entire conversation context. Ref Matching addresses with just embeddings works well. Combine it with simple hard rules. Ref OpenAI’s prompt caching works for images too – both linked and embedded Quotes on Graph RAG from a Generative AI WhatsApp Group. “Damn so literally nobody uses Graph RAG yet. Good to know.” ~Sumba “A big four consulting firm uses GraphRAG to retrieve related documents and excerpts from governance and compliance docs.” ~Vinayak Hegde (Microsoft) “Graph RAG is expensive and unnecessary in most of the cases.” ~Utkarsh Saxena ChatGPT’s advanced mode includes: “…you can use various regional accents and dialects.” Ref Source But the API can “laugh, whisper, and adhere to tone direction.” Ref Hume API (INR 6/min) is far cheaper than OpenAI’s real-time chat (6c/min input + 24c/min output) Devika is an open-source clone of Devin. DuckDB runs inside Pyodide Hungarian Jews have genetic diseases that increase their IQ. Gaucher’s disease, Torsion dystonia. People don’t like hard stuff like maths or science, so richer societies have fewer scientists Ethan Mollick feels Claude 3.5 Sonnet is better at style and critiquing blog posts than OpenAI’s o1 (which is better at reasoning.) News is going to be crazily disrupted again with voice mode. I can just listen to the topic I want In Singapore Airlines, You can’t wear your seatbelt loose You have to keep the laptop in the pocket in front, not on your lap, during takeoff You can’t charge during takeoff They verify if you ask for a veg meal and place a sticker on your seat Coders are more likely to edit LLM code. Non-coders don’t have that bad habit. Vaishnavi and Ranjeet edited code Indal and Koustav didn’t Coders are likely to get more out of an LLM because they know what it can do. But some non-coders will get more out of an LLM because they don’t know what it can’t do. E.g. Indal trying for a confetti animation, which is hard but do-able “You have to put in a lot of work to become productive at AI coding.” Simon Willison

LLM escapades in a toilet

I was in Seoul for KHF 2024, a healthcare event, staying at Hotel in 9. The hotel was great. The toilet was hi-tech. Perhaps a bit too high-tech for me. I couldn’t figure out how to let the water through on the sink. After 15 minutes of a hard struggle, I finally asked ChatGPT “How do I open the thing that’s closing the sink to allow the water to go down?” ...

I accidentally pressed the emergency button in the toilet. I was smarter this time, unlike earlier. https://www.linkedin.com/posts/sanand0_chatgpt-llm-activity-7246836804249628672-5QXy/ I asked #ChatGPT which (unhelpfully) told me that “Typically, these buttons cannot be turned off”. I called the reception who couldn’t understand a word of what I said. “Do you want water?” they asked when I told them “I pressed the emergency button in the bathroom. So, I went to ChatGPT’s advanced voice mode (I’m so grateful it was enabled last week) and said, “Translate everything I say into Korean. ...

After 15 minutes of a hard struggle, I finally asked #ChatGPT “How do I open the thing that’s closing the sink to allow the water to go down?” Here’s the thing with “maturity” (aka age, wisdom, experience, grey hair). It took me 15 minutes to realize I could use an #LLM to solve this problem. Despite me supposedly being an “LLM psychologist.” I suspect the school children of today won’t waste even a minute before checking ChatGPT. ...