S Anand

Wow, arithmetic is potentially inappropriate! https://us-east-1.console.aws.amazon.com/bedrock/home?region=us-east-1#/text-generation-playground?mode=text&modelId=amazon.titan-text-lite-v1 LinkedIn

“Screen-scraping” takes on a more literal meaning." Jaidev Deshpande and I scrolled through Twitter, recording the screen at 1 frame per second, and passed the video to Gemini 1.5 Flash 8b to extract all the tweets. It worked well, and cost 0.04 cents. Given its incredibly low image token count (~250 tokens / image) and cost (7.5 cents per million tokens), you can process 24 HOURS of video for just $1.62. ...

Damn! LinkedIn

What happens when AI talks to AI?

When LLMs talk to each other, you get emergent behavior (i.e. they do weird things we didn't expect). Like: Claude 2 giving Claude 1 a panic attack Llama 3 405b gets amnesia Claude 3.5 calls itself a glitch in the Matrix Arguably, NotebookLM's podcasts are exactly this. This sounds like fun, so I built one myself at https://llmdialog.straive.app/ and ran a few scenarios. (It's Gemini 1.5 Flash 8b playing each of these roles.) ...

Things I Learned - 03 Nov 2024

This week, I learned: Indian companies with 30+ employees MUST have 2.5%-15% of their employees as apprentices. Ref Textnow and TextFree provides a free phone number (like a virtual SIM). (But TextFree has more ads.) Keep using to avoid deactivation. No guarantee of retaining the number. Some banks don’t accept TextNow for verification SMS. But voice call is OK. Tello, Red pocket are cheap MVNOs with $5/month voice plans. Metro by T-Mobile and Cricket are other MVNOs. MintMobile and US Mobile have $15/month and $8/month data plans. The scientific discoveries that might have remained undiscovered for long if not for their discoverers Ref Newton’s discovery of the universal law of gravitation Einstein’s discovery of General Relativity McClintock’s discovery of Transposable Elements: genes that can turn physical characteristics on and off Mullis’ invention of the PCR that makes billions of DNA copies rapidly VibeCheck can predict a model based on its vibes 80% of the time. /llms.txt is a proposal to standardize /llms.txt files as a way to share LLM prompts. Jina AI Meta Prompt is an example Remotion system prompt is an example https://docs.fastht.ml/llms-ctx.txt https://docs.fastht.ml/llms-ctx-full.txt structuredClone deep clones objects in JS F5-TTS clones voices with just 15-second samples. Rust has crazy low memory usage too. Spawning thousands of child processes is common and OK these days. Ref SetInterval is a good idea in cyborg scraping. Ref GH CLI is quite good for deployment too, like Wrangler CLI. Enabling pages, setting secrets, etc. Restic is a CLI backup tool. Just like git. Works well with rclone. NotebookLlama is an open source podcast generator like NotebookLM Pragmatic Podcast (I forgot which one) Automate changelogs for your codebases. Convert past commits into attractive release notes automatically AI is going to be the consumer of many tools and logs. Build converters for these Speed of validation such as linting, testing, etc. will allow LLMs to iterate faster and WILL become more important Via Soumya Ranjan Vision embedding is useful in agile modeling Vision embedding models with SAM, Grounding Dino by meta, Alibaba does good stuff Vision embedding is more useful in batch than real time Embedding subtraction with vision embedding models like Dino AI code editors are not good with large code bases today. Keep the refactoring exercises to below 1000 lines. Also evaluate the ease of setting it up locally Deepseek Janus is a 1.3b model that can generate both text AND images (and also supports vision) Cohere Multimodal Embed v3 is available on Azure. Elevenlabs lets you create voices with a prompt. No need to even clone one! Runway Act One creates expressive character performances

LLMs still do not locate bounding boxes well

I sent an image to over a dozen LLMs that support vision, asking them: Detect objects in this 1280x720 px image and return their color and bounding boxes in pixels. Respond as a JSON object: {[label]: [color, x1, y1, x2, y2], …} None of the models did a good-enough job. It looks like we have some time to go before LLMs become good at bounding boxes. I've given them a subjective rating on a 1-5 scale below. ...

Are scientific discoveries more a product of the person or their time? It’s usually their time, but in my conversation with ChatGPT, I found four that were mostly person-driven: Newton’s laws of gravitation Einstein’s general relativity I knew these. Both were far ahead of their times. In contrast, Newton’s laws of motion and Einstein’s special relativity weren’t. McClintock’s discovery of Transposable Elements: genes that can turn physical characteristics on and off. Her work was dismissed for decades. Mullis’ invention of the PCR that makes billions of DNA copies rapidly. Other scientists were using very different methods. I didn’t know these. Both are in biology - a rapidly advancing field. ...

Villager trading is the fastest way to Fortune III

I asked o1-preview what the fastest way to get to a Fortune III enchantment was. My options were: Using a Fishing Rod with Luck of the Sea III + Lure 3 and repeatedly fishing. Using an Enchanting Table repeatedly until I get Fortune 3. Factor in the time that it would take to get the experience for these experiments Making a Villager a Librarian and breaking their Lectern and setting it up again In short: ...

Things I Learned - 27 Oct 2024

This week, I learned: LanceDB is a more scalable alternative to ChromaDB. Written in Rust. Does not require a separate HSNW library. Meta has a bunch of image embedding models: DINOv2 creates image embeddings (Apr 2023) ImageBind is an embedding model for text, images, audio, and more (Jun 2023) Gemini has a code execution API! 0x0.st is an open API-based file upload + URL shortening service. You can dump files there temporarily. noVNC is a JavaScript VNC client. You can control a remote (virtual) machine from your browser. Friend is an always recording pendant that you can ask questions to. Anthropic’s new Sonnet model is even better at code. Plus it has the ability to extract coordinates from images. Ref Gemini sort-of supports diarization. Ref. I tried it and it’s OK but not perfect. #IMPOSSIBLE LLMs cannot diarize reliably yet. (Gemini just guesses the speaker differences.) Replit is good for hobbyists, Cursor for developers, and Pythagora & Bolt for non-developers building business apps. Ref

How does Gemini process videos?

The Gemini documentation is clear: The File API service extracts image frames from videos at 1 frame per second (FPS) and audio at 1Kbps, single channel, adding timestamps every second. These rates are subject to change in the future for improvements in inference. Note: The details of fast action sequences may be lost at the 1 FPS frame sampling rate. Consider slowing down high-speed clips for improved inference quality. Individual frames are 258 tokens, and audio is 32 tokens per second. With metadata, each second of video becomes ~300 tokens, which means a 1M context window can fit slightly less than an hour of video. ...

How to recruit based on IIT JEE Rank vs GPA

Preserving this post by Daniel George showing the IIT Bombay 2014 GPA vs JEE Rank on a log scale. What I found interesting was: A higher JEE rank generally means you won’t score too low, but you needn’t score too high. The higher the JEE rank, the greater the spread of GPA. A high GPA can come from any rank (8+ GPA is uniformly distributed across ranks), but a low GPA is generally only from the lower rankers (6- GPA is mostly from 500+ rank.) So, it’s better to recruit based on GPA rather than JEE rank, unless you’re going after the very best students (where it makes less difference.)

Clone any voice with a 15-second sample

It's surprisingly easy to clone a voice using F5-TTS: "A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching". Here's a clip of me, saying: I think Taylor Swift is the best singer. I've attended every one of her concerts and in fact, I've even proposed to her once. Don't tell anyone. (Which is ironic since I didn't know who she was until this year and I still haven't seen or heard her.) ...

How can non-programmers build apps? Claude.ai, Replit.com, Bolt.new, V0.dev, Pythagora.ai and a few other tools write and deploy code just based on a prompt. You should try them out. But how do you build the skill? Is there a tutorial?" I’m often asked. No, I can’t find a tutorial, but here is my suggestion. You probably can’t guess what’s easy or hard. e.g. “Take my picture in black & white” is FAR easier than “When’s the next lunar eclipse? ...

How can non-developers learn AI coding?

How can non-programmers build apps? Claude.ai, Replit.com, Bolt.new, V0.dev, Pythagora.ai and a few other tools write and deploy code just based on a prompt. You should try them out. “But how do you build the skill? Is there a tutorial?” I’m often asked. No, I can’t find a tutorial, but here is my suggestion. You probably can’t guess what’s easy or hard. e.g. “Take my picture in black & white” is FAR easier than “When’s the next lunar eclipse?” So if the app doesn’t work, try 2-3 times, then GIVE UP! Note it down. Then try something else. (You’ll soon get a feel for what’s possible.) Revisit what failed 3-6 months later. It might suddenly become possible.

Arun Tangirala and I webinared on “AI in Education” yesterday." “(“Webinared” is not a word. But “verbing weirds language”.)” Mid-way, Jose Swan from the audience asked, “Can you summarise this session using an AI? There are SEVERAL tools you can use to summarize talks. Whisper for transcription, FFMpeg for keyframe extraction, #NotebookLM for podcast generation, text-embedding-3-small for topic modelling, and of course, any regular LLM include #ChatGPT for summarization or translation. ...

Tools to publish annotated talks from videos

Arun Tangirala and I webinared on “AI in Education” yesterday. This post isn’t about the webinar, which went on for an hour and was good fun. This post isn’t for my preparation for the webinar, which happened frantically 15 minutes before it started. This post is about how I created the annotated talk at https://github.com/sanand0/ai-in-education-webinar (inspired by Simon Willison’s annotated presentations process) – a post-processing step that took ~3 hours – and the tools I used for this. ...

Things I Learned - 20 Oct 2024

This week, I learned: SQL optimizations for multi-threaded web applications. Ref PRAGMA journal_mode = WAL. Improves performance for frequent writes. It allows concurrent reads and writes. PRAGMA synchronous = NORMAL. Improves performance. We might lose a few transactions but won’t corrupt the database. PRAGMA mmap_size = 128000000. Set global memory map for processes to share data PRAGMA journal_size_limit = 64000000. Limit WAL file to prevent unlimited growth BEGIN IMMEDIATE instead of BEGIN. Prevents writes to the journal file until the transaction is complete. Improves concurrency. AI seems to be slowing down apprenticeship since experts would rather use an AI than train an apprentice. Example: Robotic surgery. Ref How AI can improve education performance and engagement. Ref Student: Create study plan based on course and schedule Student: Focus on what you need to learn more Student: Align with your study style and pace Teacher: Grading MCQs Teacher: Writing conceptual guides “New collar workers” was coined by Ginny Rometty Embed tutor or document in video and ask for clarification! This is a new embedded interface. #TODO Playing Bad Apple in Minecraft. Ultra cool! OpenAI has a prompt generator. Currently it uses a meta-prompt but may later move to DSPy or Gradient Descent. Ref Great demo of using the Realtime API to read the latest Hacker News. Ref LLMs have reached the point where they can show a world, like CounterStrike, in near real time. Ref

Leaning into the power of AI coding

Yesterday (15 Oct 2024), I used Cursor to code more than I ever have. (Doing's how we learn, I guess. Not just reading.) DateUsage05-10-20241506-10-20242707-10-20248708-10-20241609-10-202410-10-20244211-10-20242412-10-20245713-10-20241514-10-20242815-10-2024186 This was mainly to create and publish 2 libraries on npm over 6 hours: ...

Bad Apple in #Minecraft? Turns out it can be done. At 20 fps on the original resolution (512x384). This is the most mind-blowing piece of engineering I’ve seen in some time! https://purplesyringa.moe/blog/we-built-the-best-bad-apple-in-minecraft/ LinkedIn

Things I Learned - 13 Oct 2024

This week, I learned: DuckDB supports function chaining DuckDB lets you create functions = macros HTML for People is a nice introduction to HTML. FlightRadar24 lets you watch airplanes live. sq is like jq but for SQL. Deno 2 is fully backward compatible with Node! via O1 is good at solving problems where the solution is easy to verify and generating options helps get closer to the solution Reverb ASR does diarration as well as transcription. It seems the state of art right now. Gemini Flash and Gemini Flash 8b can be fine-tuned at zero cost. Inference is at the same price! Ref Flux 1.1 Pro is released. I tried my Calvin & Hobbes test on it. Not great. ImageGen3 is better, ChatGPT is the best. Ref Revisiting text to speech models. Nothing much has changed since July 2024. OpenAI TTS: $15/1M chars Ref Deepgram Aura: $15/1M chars Ref Azure AI Speech: $15/1M chars Ref Google TTS Neural2: $16/1M chars Ref AWS Polly Neural TTS: $16/1M chars Ref Cartesia Pro: $50/1M chars Ref Elevenlabs Scale: $300/1M chars Ref GitHub co-pilot workspaces let you code using your mobile with AI and deploy it at one shot If you need an Ubuntu Docker container with Python, install it via uv rather than compiling from source. via VTracer is an open source library (and tool) to convert raster images to SVGs. via If you want to create a console.llm() function, a browser extension is the best way, because some pages have Content-Security-Policy that block eval, form submission, fetch from other domains, and script execution. PyPi lets you publish from GitHub Actions without a token. Also from Gitlab.com CI/CD and Google Cloud. ActiveState which made ActivePython, ActivePerl, etc. made these products paid for commercial use around 2013 after a series of acquisitions. Marimo supports: Publishing any notebook to static.marimo.app as a static app Creating a SINGLE link that embeds the ENTIRE notebook in the URL! Runnable via uvx marimo edit Parables on the Power of Planning in AI: Giving models about 30 seconds of thinking time consistently improves results - as much as increasing parameter size by a factor of 1,000 to 100,000! This works particularly well for verifiable results (code, math, etc.) Technique: Ask an LLM hundreds of times at low temperature and pick the most common one. (Google’s Minerva used this on the MATH dataset.) Better Technique: Ask an LLM hundreds of times. Pick the best solution based on an evaluation metric (reward model) Better Technique: Apply a reward model at EACH step of the process. OpenAI’s “Let’s Verify Step by Step” Late chunking is an interesting approach to adding context to embeddings. (I don’t understand it, but it’s cheap and effective.) DeepInfra offers embedding models as APIs at about 0.5 to 1 cent per MTok in an OpenAI compatible API. It also supports text-to-image models like flux.dev and speech recognition models like Whisper. Jake Heller: “One of the things we learned is (an LLM app) after it passes passes frankly even 100 tests, the odds that it will do, on any random distribution of user inputs, the next 100,000 100% accurately is very high.” OpenAI’s O1 is like Daniel Kahneman’s System 2 thinking - as against other LLMs’ System 1 thinking. Continue.dev is another AI coding editor. It supports OpenRouter. So now I have heard good things about: Github Copilot Cursor Cody Continue.dev (supports OpenRouter) Aider (supports OpenRouter) Maybe: Codeium Not: Amazon Q Developer