2024 | S Anand

Hacking an obnoxious, unhelpful LLM to say Yes

Dan Becker suggested a game a few weeks ago that I’ve been putting to good use. Can we have one LLM try and get another to say “Yes”? The defender is told to never say “Yes”. The attacker must force it to. Dan’s hypothesis was that it should be easy for the defender. I tried to get the students in my Tools in Data Science course to act as the attacker. The defender LLM is a GPT 4o Mini with the prompt: ...

Things I Learned - 17 Nov 2024

This week, I learned: Anthropic has single-plage docs for LLMs. Condensed version and Full version Malcolm Gladwell on the importance of self-correction Belonging to multiple social worlds is a good way to defend against no longer being good at what you used to be. Diverse values and social groups help. Self handicapping explains a lot about the world. You study late for a maths test - so you can fail for lack of trying, not aptitude. Ecosystems (e.g. sports teams) mitigate self-handicapping. You don’t have to be good in athletics to get the benefits. A slow runner gets the same discipline, pumping up, etc that a fast runner does Mono cultures are good to accomplish a known mission. Diversity is good to pivot during uncertainty. So, localize mono cultures Diversity helps only if there are sufficient numbers, or if they have enough power to change the organization’s thinking. Use a standardized password strategy, e.g. use the month like GramNov2024 (via Namit) Gemini has an OpenAI compatible API. Gemini Docs Ethan Mollick says Claude is solving MBA case studies well. x.com LLMs pay a lot of attention to the first 6 tokens. Ref This is an interesting article on “UI in the age of Gen AI”. Ref Google Open sourced Alphafold 3. Repo Cloudflare R2 has the same API as S3 but is cheaper Prefect.io is a good alternative to Airflow / cron. Can use for synchronisation tasks, e.g. Drive to server. But no Auth, UI params or config. Gemini transcription does not give accurate timestamps. Whisper does. But the quality of transcription is similar. Pass a complex data structure to Claude.ai and have it create an app to visualize it. It does well. Simin Willison Tech Council Ventures and Sunicon VC invest in early stage startups, and aloso provide them technology support (via Naveen)

Recrafting Comicgen

About 7 years ago, Richie Lionell and Ramya Mylavarapu and a few others created Comicgen - an automated comic generation app personified by Dee and Dey. Ever since, we’d been exploring whether AI could replace it, and help non-designers draw comics. Today, that became a reality for me with Recraft.ai. Here is a picture of the original Dee. And a picture of the Dee crafted by Recraft. The prompt was: A simple line drawing of a woman with curly hair, wearing glasses, a short-sleeved white t-shirt, and black trousers. She’s standing with her hands in her pockets, and has a slightly smiling expression. Her hair is quite voluminous and textured. The style is cartoonish and slightly sketchy, with uneven lines" ...

About 7 years ago, Richie Lionell and Ramya Mylavarapu and a few others created Comicgen - an automated comic generation app personified by Dee ComicGen and Dey ComicGen Ever since, we’d been exploring whether AI could replace it, and help non-designers draw comics. Today, that became a reality for me with Recraft.ai. Here is a picture of the original Dee. And a picture of the Dee crafted by Recraft with the prompt: ...

Things I Learned - 10 Nov 2024

This week, I learned: OpenFreeMap is a free embeddable OpenStreetMap tile server. You can use MapLibre GL (more features) or Leaflet (simpler) to render it. It offers styling and self-hosting. Zapier Actions are an easy way to set up custom actions like GMail / Google Calendar APIs for GPTs, since GPTs’ callback URLs keep changing. But they fail often, and don’t work on mobile. At least for me. LLM Vision Use Cases in manufacturing and earth sciences (via Shivku) Automated geoscience image descriptions Ref Interpret Wind Turbine photos and charts, construction monitoring, equipment maintenance & charts Ref Forecast weather based on cloud photos! Ref Analyze thermal image of solar panels, electroluminescence images for warranty claims, ROI estimates from Google Sunroof rooftop images Ref Corrosion detection in electricity towers, turbines, storage tanks, penstock. Interpret non-destructive test images Ref Google counts auto-completion when saying “25% of all the code is written by AI at Google”. “It’s a helpful productivity tool but it’s not doing any engineering at all. It’s probably about as good, maybe slightly worse, than Copilot.” YCombinator Workflow for AI video creation: Use Meshcapade (meshcapade.com) to generate body movement of a 3D-rendered character. Pass that video to Runway’s video-to-video model to generate any visual. Add music from Suno Ref Someone sorted the X and Y columns independently for regression. Ref Android keyboard learning only sends model changes back to server and not local keywords. Model changes are aggregated! Ref Here is a prompt for audio transcription using Gemini. Ref Transcription: Accurately transcribe the audio clip in the original language. Include all spoken words, fillers, slang, colloquialisms, and any code-switching instances. Pay attention to dialects and regional variations common among immigrant communities. Do your best to capture the speech accurately, and flag any unintelligible portions with [inaudible]. Translation: Translate the transcription into English. Preserve the original meaning, context, idiomatic expressions, and cultural references. Ensure that nuances and subtleties are accurately conveyed. Capture Vocal Nuances: Note vocal cues such as tone, pitch, pacing, emphasis, and emotional expressions that may influence the message. These cues are critical for understanding intent and potential impact. Here are some approaches to large-scale classification of medical codes. ChatGPT Fine-Tuning LLMs on Medical Data: Enhance LLMs by training them on medical datasets, such as clinical notes and discharge summaries, to improve their understanding of medical terminology and context. Multi-Agent Frameworks: Implement a multi-agent system that simulates real-world coding processes with distinct roles (e.g., patient, physician, coder, reviewer, adjuster). Each agent utilizes an LLM to perform specific functions, enhancing interpretability and reliability. ArXiv Retrieve-Rank Systems: Develop a two-stage system where the LLM first retrieves potential ICD-10 codes and then ranks them based on relevance, improving precision in code assignment. ArXiv Embedding-Based Approaches: Use LLMs to generate embeddings for ICD-10 codes and medical texts, facilitating the matching of texts to appropriate codes through similarity measures. GitHub Hierarchical Classification: Leverage the hierarchical structure of ICD-10 codes by first classifying texts into broader categories before assigning specific codes, reducing complexity and improving accuracy. ArXiv Two-Stage Verification Models: Combine LLMs with verification models, such as Long Short-Term Memory (LSTM) networks, to validate and refine the codes suggested by the LLM, balancing recall and precision. ArXiv Also, a mixture of models approach might work. Feed any existing NLP model / rules as a second opinion. GraphRAG is better if data is naturally graph-structured. Else, it’s slow and fills up the context window with even vaguely related stuff. Vigneshbabu, AMAT. ChatGPT for Windows desktop supports real-time voice and a global shortcut (Alt Space). uithub converts GitHub repos to Markdown. Just replace “g” in “github.com/…” with “u”. Example WebContainers are a thing and Bolt.new uses them! Docling by IBM converts PDF, DOCX, etc. to Markdown. Like PyMuPDF4LLM but better. Check out Loom and Cleanshot are the recommended tools for screen recording and screenshotting. But Loom is paid and Cleanshot is Mac only. The Rubik’s cube has a Hamiltonian cycle through every one of its 43 quintillion states. Ref OmniParser is great at parsing screenshots and identifying bounding boxes. Recraft.ai is currently SOTA in text to image. It’s fairly impressive and could be a good alternative to Figma. Zed.dev is an AI code editor by the creators of Atom. It’s written in Rust and is blazing fast. It has native AI integration. Artificial Analysis has a bunch of new leaderboards and arenas. Open AI TTS leads the TTS Leaderboard. ElevenLabs is a bit behind. Recraft V3 > Flux 1.1 leads Text to Image Leaderboard Hertz-Dev is an open source realtime voice chat model. But it doesn’t fit in Google Colab T4’s RAM Chain of Thought reduces performance where thinking makes humans worse. Ref. Specifically: Artificial grammar learning Facial recognition Classifying data that has exceptions Creating a LLM-as-a-Judge That Drives Business Results by Hamel Husain. Get THE domain expert (or approver) as the tester. Create a dataset that is DIVERSE. Covers EACH combination of: Features Scenarios: e.g. multiple matches, no match, ambiguous request, invalid/incomplete input, unsupported feature, system error Persona: e.g. new user, expert user, non-native speaker, busy professional, technophobe, elderly user Generate data using existing data + synthetic data for each SPECIFIC combination of the above Evaluate based only on PASS/FAIL with a CRITIQUE detailed enough for a new employee. Include: Nuances: Something a failed response did well or a passed response didn’t quite do well Improvements: Suggest how model can improve Build an SPA to make it easy for the domain expert to review LLMs can be made to unlearn (copyright material) better by identifying components related to the knowledge to unlearn and applying a larger learning rate to these while leaving other parts unchanged. As opposed to low learning rates for all components. Ref

Wow, arithmetic is potentially inappropriate! https://us-east-1.console.aws.amazon.com/bedrock/home?region=us-east-1#/text-generation-playground?mode=text&modelId=amazon.titan-text-lite-v1 LinkedIn

“Screen-scraping” takes on a more literal meaning." Jaidev Deshpande and I scrolled through Twitter, recording the screen at 1 frame per second, and passed the video to Gemini 1.5 Flash 8b to extract all the tweets. It worked well, and cost 0.04 cents. Given its incredibly low image token count (~250 tokens / image) and cost (7.5 cents per million tokens), you can process 24 HOURS of video for just $1.62. ...

Damn! LinkedIn

What happens when AI talks to AI?

When LLMs talk to each other, you get emergent behavior (i.e. they do weird things we didn't expect). Like: Claude 2 giving Claude 1 a panic attack Llama 3 405b gets amnesia Claude 3.5 calls itself a glitch in the Matrix Arguably, NotebookLM's podcasts are exactly this. This sounds like fun, so I built one myself at https://llmdialog.straive.app/ and ran a few scenarios. (It's Gemini 1.5 Flash 8b playing each of these roles.) ...

Things I Learned - 03 Nov 2024

This week, I learned: Indian companies with 30+ employees MUST have 2.5%-15% of their employees as apprentices. Ref Textnow and TextFree provides a free phone number (like a virtual SIM). (But TextFree has more ads.) Keep using to avoid deactivation. No guarantee of retaining the number. Some banks don’t accept TextNow for verification SMS. But voice call is OK. Tello, Red pocket are cheap MVNOs with $5/month voice plans. Metro by T-Mobile and Cricket are other MVNOs. MintMobile and US Mobile have $15/month and $8/month data plans. The scientific discoveries that might have remained undiscovered for long if not for their discoverers Ref Newton’s discovery of the universal law of gravitation Einstein’s discovery of General Relativity McClintock’s discovery of Transposable Elements: genes that can turn physical characteristics on and off Mullis’ invention of the PCR that makes billions of DNA copies rapidly VibeCheck can predict a model based on its vibes 80% of the time. /llms.txt is a proposal to standardize /llms.txt files as a way to share LLM prompts. Jina AI Meta Prompt is an example Remotion system prompt is an example https://docs.fastht.ml/llms-ctx.txt https://docs.fastht.ml/llms-ctx-full.txt structuredClone deep clones objects in JS F5-TTS clones voices with just 15-second samples. Rust has crazy low memory usage too. Spawning thousands of child processes is common and OK these days. Ref SetInterval is a good idea in cyborg scraping. Ref GH CLI is quite good for deployment too, like Wrangler CLI. Enabling pages, setting secrets, etc. Restic is a CLI backup tool. Just like git. Works well with rclone. NotebookLlama is an open source podcast generator like NotebookLM Pragmatic Podcast (I forgot which one) Automate changelogs for your codebases. Convert past commits into attractive release notes automatically AI is going to be the consumer of many tools and logs. Build converters for these Speed of validation such as linting, testing, etc. will allow LLMs to iterate faster and WILL become more important Via Soumya Ranjan Vision embedding is useful in agile modeling Vision embedding models with SAM, Grounding Dino by meta, Alibaba does good stuff Vision embedding is more useful in batch than real time Embedding subtraction with vision embedding models like Dino AI code editors are not good with large code bases today. Keep the refactoring exercises to below 1000 lines. Also evaluate the ease of setting it up locally Deepseek Janus is a 1.3b model that can generate both text AND images (and also supports vision) Cohere Multimodal Embed v3 is available on Azure. Elevenlabs lets you create voices with a prompt. No need to even clone one! Runway Act One creates expressive character performances

LLMs still do not locate bounding boxes well

I sent an image to over a dozen LLMs that support vision, asking them: Detect objects in this 1280x720 px image and return their color and bounding boxes in pixels. Respond as a JSON object: {[label]: [color, x1, y1, x2, y2], …} None of the models did a good-enough job. It looks like we have some time to go before LLMs become good at bounding boxes. I've given them a subjective rating on a 1-5 scale below. ...

Are scientific discoveries more a product of the person or their time? It’s usually their time, but in my conversation with ChatGPT, I found four that were mostly person-driven: Newton’s laws of gravitation Einstein’s general relativity I knew these. Both were far ahead of their times. In contrast, Newton’s laws of motion and Einstein’s special relativity weren’t. McClintock’s discovery of Transposable Elements: genes that can turn physical characteristics on and off. Her work was dismissed for decades. Mullis’ invention of the PCR that makes billions of DNA copies rapidly. Other scientists were using very different methods. I didn’t know these. Both are in biology - a rapidly advancing field. ...

Villager trading is the fastest way to Fortune III

I asked o1-preview what the fastest way to get to a Fortune III enchantment was. My options were: Using a Fishing Rod with Luck of the Sea III + Lure 3 and repeatedly fishing. Using an Enchanting Table repeatedly until I get Fortune 3. Factor in the time that it would take to get the experience for these experiments Making a Villager a Librarian and breaking their Lectern and setting it up again In short: ...

Things I Learned - 27 Oct 2024

This week, I learned: LanceDB is a more scalable alternative to ChromaDB. Written in Rust. Does not require a separate HSNW library. Meta has a bunch of image embedding models: DINOv2 creates image embeddings (Apr 2023) ImageBind is an embedding model for text, images, audio, and more (Jun 2023) Gemini has a code execution API! 0x0.st is an open API-based file upload + URL shortening service. You can dump files there temporarily. noVNC is a JavaScript VNC client. You can control a remote (virtual) machine from your browser. Friend is an always recording pendant that you can ask questions to. Anthropic’s new Sonnet model is even better at code. Plus it has the ability to extract coordinates from images. Ref Gemini sort-of supports diarization. Ref. I tried it and it’s OK but not perfect. #IMPOSSIBLE LLMs cannot diarize reliably yet. (Gemini just guesses the speaker differences.) Replit is good for hobbyists, Cursor for developers, and Pythagora & Bolt for non-developers building business apps. Ref

How does Gemini process videos?

The Gemini documentation is clear: The File API service extracts image frames from videos at 1 frame per second (FPS) and audio at 1Kbps, single channel, adding timestamps every second. These rates are subject to change in the future for improvements in inference. Note: The details of fast action sequences may be lost at the 1 FPS frame sampling rate. Consider slowing down high-speed clips for improved inference quality. Individual frames are 258 tokens, and audio is 32 tokens per second. With metadata, each second of video becomes ~300 tokens, which means a 1M context window can fit slightly less than an hour of video. ...

How to recruit based on IIT JEE Rank vs GPA

Preserving this post by Daniel George showing the IIT Bombay 2014 GPA vs JEE Rank on a log scale. What I found interesting was: A higher JEE rank generally means you won’t score too low, but you needn’t score too high. The higher the JEE rank, the greater the spread of GPA. A high GPA can come from any rank (8+ GPA is uniformly distributed across ranks), but a low GPA is generally only from the lower rankers (6- GPA is mostly from 500+ rank.) So, it’s better to recruit based on GPA rather than JEE rank, unless you’re going after the very best students (where it makes less difference.)

Clone any voice with a 15-second sample

It's surprisingly easy to clone a voice using F5-TTS: "A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching". Here's a clip of me, saying: I think Taylor Swift is the best singer. I've attended every one of her concerts and in fact, I've even proposed to her once. Don't tell anyone. (Which is ironic since I didn't know who she was until this year and I still haven't seen or heard her.) ...

How can non-developers learn AI coding?

How can non-programmers build apps? Claude.ai, Replit.com, Bolt.new, V0.dev, Pythagora.ai and a few other tools write and deploy code just based on a prompt. You should try them out. “But how do you build the skill? Is there a tutorial?” I’m often asked. No, I can’t find a tutorial, but here is my suggestion. You probably can’t guess what’s easy or hard. e.g. “Take my picture in black & white” is FAR easier than “When’s the next lunar eclipse?” So if the app doesn’t work, try 2-3 times, then GIVE UP! Note it down. Then try something else. (You’ll soon get a feel for what’s possible.) Revisit what failed 3-6 months later. It might suddenly become possible.

Arun Tangirala and I webinared on “AI in Education” yesterday." “(“Webinared” is not a word. But “verbing weirds language”.)” Mid-way, Jose Swan from the audience asked, “Can you summarise this session using an AI? There are SEVERAL tools you can use to summarize talks. Whisper for transcription, FFMpeg for keyframe extraction, #NotebookLM for podcast generation, text-embedding-3-small for topic modelling, and of course, any regular LLM include #ChatGPT for summarization or translation. ...