Things I Learned - 17 Nov 2024

This week, I learned: Anthropic has single-plage docs for LLMs. Condensed version and Full version Malcolm Gladwell on the importance of self-correction Belonging to multiple social worlds is a good way to defend against no longer being good at what you used to be. Diverse values and social groups help. Self handicapping explains a lot about the world. You study late for a maths test - so you can fail for lack of trying, not aptitude. Ecosystems (e.g. sports teams) mitigate self-handicapping. You don’t have to be good in athletics to get the benefits. A slow runner gets the same discipline, pumping up, etc that a fast runner does Mono cultures are good to accomplish a known mission. Diversity is good to pivot during uncertainty. So, localize mono cultures Diversity helps only if there are sufficient numbers, or if they have enough power to change the organization’s thinking. Use a standardized password strategy, e.g. use the month like GramNov2024 (via Namit) Gemini has an OpenAI compatible API. Gemini Docs Ethan Mollick says Claude is solving MBA case studies well. x.com LLMs pay a lot of attention to the first 6 tokens. Ref This is an interesting article on “UI in the age of Gen AI”. Ref Google Open sourced Alphafold 3. Repo Cloudflare R2 has the same API as S3 but is cheaper Prefect.io is a good alternative to Airflow / cron. Can use for synchronisation tasks, e.g. Drive to server. But no Auth, UI params or config. Gemini transcription does not give accurate timestamps. Whisper does. But the quality of transcription is similar. Pass a complex data structure to Claude.ai and have it create an app to visualize it. It does well. Simin Willison Tech Council Ventures and Sunicon VC invest in early stage startups, and aloso provide them technology support (via Naveen)

Things I Learned - 10 Nov 2024

This week, I learned: OpenFreeMap is a free embeddable OpenStreetMap tile server. You can use MapLibre GL (more features) or Leaflet (simpler) to render it. It offers styling and self-hosting. Zapier Actions are an easy way to set up custom actions like GMail / Google Calendar APIs for GPTs, since GPTs’ callback URLs keep changing. But they fail often, and don’t work on mobile. At least for me. LLM Vision Use Cases in manufacturing and earth sciences (via Shivku) Automated geoscience image descriptions Ref Interpret Wind Turbine photos and charts, construction monitoring, equipment maintenance & charts Ref Forecast weather based on cloud photos! Ref Analyze thermal image of solar panels, electroluminescence images for warranty claims, ROI estimates from Google Sunroof rooftop images Ref Corrosion detection in electricity towers, turbines, storage tanks, penstock. Interpret non-destructive test images Ref Google counts auto-completion when saying “25% of all the code is written by AI at Google”. “It’s a helpful productivity tool but it’s not doing any engineering at all. It’s probably about as good, maybe slightly worse, than Copilot.” YCombinator Workflow for AI video creation: Use Meshcapade (meshcapade.com) to generate body movement of a 3D-rendered character. Pass that video to Runway’s video-to-video model to generate any visual. Add music from Suno Ref Someone sorted the X and Y columns independently for regression. Ref Android keyboard learning only sends model changes back to server and not local keywords. Model changes are aggregated! Ref Here is a prompt for audio transcription using Gemini. Ref Transcription: Accurately transcribe the audio clip in the original language. Include all spoken words, fillers, slang, colloquialisms, and any code-switching instances. Pay attention to dialects and regional variations common among immigrant communities. Do your best to capture the speech accurately, and flag any unintelligible portions with [inaudible]. Translation: Translate the transcription into English. Preserve the original meaning, context, idiomatic expressions, and cultural references. Ensure that nuances and subtleties are accurately conveyed. Capture Vocal Nuances: Note vocal cues such as tone, pitch, pacing, emphasis, and emotional expressions that may influence the message. These cues are critical for understanding intent and potential impact. Here are some approaches to large-scale classification of medical codes. ChatGPT Fine-Tuning LLMs on Medical Data: Enhance LLMs by training them on medical datasets, such as clinical notes and discharge summaries, to improve their understanding of medical terminology and context. Multi-Agent Frameworks: Implement a multi-agent system that simulates real-world coding processes with distinct roles (e.g., patient, physician, coder, reviewer, adjuster). Each agent utilizes an LLM to perform specific functions, enhancing interpretability and reliability. ArXiv Retrieve-Rank Systems: Develop a two-stage system where the LLM first retrieves potential ICD-10 codes and then ranks them based on relevance, improving precision in code assignment. ArXiv Embedding-Based Approaches: Use LLMs to generate embeddings for ICD-10 codes and medical texts, facilitating the matching of texts to appropriate codes through similarity measures. GitHub Hierarchical Classification: Leverage the hierarchical structure of ICD-10 codes by first classifying texts into broader categories before assigning specific codes, reducing complexity and improving accuracy. ArXiv Two-Stage Verification Models: Combine LLMs with verification models, such as Long Short-Term Memory (LSTM) networks, to validate and refine the codes suggested by the LLM, balancing recall and precision. ArXiv Also, a mixture of models approach might work. Feed any existing NLP model / rules as a second opinion. GraphRAG is better if data is naturally graph-structured. Else, it’s slow and fills up the context window with even vaguely related stuff. Vigneshbabu, AMAT. ChatGPT for Windows desktop supports real-time voice and a global shortcut (Alt Space). uithub converts GitHub repos to Markdown. Just replace “g” in “github.com/…” with “u”. Example WebContainers are a thing and Bolt.new uses them! Docling by IBM converts PDF, DOCX, etc. to Markdown. Like PyMuPDF4LLM but better. Check out Loom and Cleanshot are the recommended tools for screen recording and screenshotting. But Loom is paid and Cleanshot is Mac only. The Rubik’s cube has a Hamiltonian cycle through every one of its 43 quintillion states. Ref OmniParser is great at parsing screenshots and identifying bounding boxes. Recraft.ai is currently SOTA in text to image. It’s fairly impressive and could be a good alternative to Figma. Zed.dev is an AI code editor by the creators of Atom. It’s written in Rust and is blazing fast. It has native AI integration. Artificial Analysis has a bunch of new leaderboards and arenas. Open AI TTS leads the TTS Leaderboard. ElevenLabs is a bit behind. Recraft V3 > Flux 1.1 leads Text to Image Leaderboard Hertz-Dev is an open source realtime voice chat model. But it doesn’t fit in Google Colab T4’s RAM Chain of Thought reduces performance where thinking makes humans worse. Ref. Specifically: Artificial grammar learning Facial recognition Classifying data that has exceptions Creating a LLM-as-a-Judge That Drives Business Results by Hamel Husain. Get THE domain expert (or approver) as the tester. Create a dataset that is DIVERSE. Covers EACH combination of: Features Scenarios: e.g. multiple matches, no match, ambiguous request, invalid/incomplete input, unsupported feature, system error Persona: e.g. new user, expert user, non-native speaker, busy professional, technophobe, elderly user Generate data using existing data + synthetic data for each SPECIFIC combination of the above Evaluate based only on PASS/FAIL with a CRITIQUE detailed enough for a new employee. Include: Nuances: Something a failed response did well or a passed response didn’t quite do well Improvements: Suggest how model can improve Build an SPA to make it easy for the domain expert to review LLMs can be made to unlearn (copyright material) better by identifying components related to the knowledge to unlearn and applying a larger learning rate to these while leaving other parts unchanged. As opposed to low learning rates for all components. Ref

Things I Learned - 03 Nov 2024

This week, I learned: Indian companies with 30+ employees MUST have 2.5%-15% of their employees as apprentices. Ref Textnow and TextFree provides a free phone number (like a virtual SIM). (But TextFree has more ads.) Keep using to avoid deactivation. No guarantee of retaining the number. Some banks don’t accept TextNow for verification SMS. But voice call is OK. Tello, Red pocket are cheap MVNOs with $5/month voice plans. Metro by T-Mobile and Cricket are other MVNOs. MintMobile and US Mobile have $15/month and $8/month data plans. The scientific discoveries that might have remained undiscovered for long if not for their discoverers Ref Newton’s discovery of the universal law of gravitation Einstein’s discovery of General Relativity McClintock’s discovery of Transposable Elements: genes that can turn physical characteristics on and off Mullis’ invention of the PCR that makes billions of DNA copies rapidly VibeCheck can predict a model based on its vibes 80% of the time. /llms.txt is a proposal to standardize /llms.txt files as a way to share LLM prompts. Jina AI Meta Prompt is an example Remotion system prompt is an example https://docs.fastht.ml/llms-ctx.txt https://docs.fastht.ml/llms-ctx-full.txt structuredClone deep clones objects in JS F5-TTS clones voices with just 15-second samples. Rust has crazy low memory usage too. Spawning thousands of child processes is common and OK these days. Ref SetInterval is a good idea in cyborg scraping. Ref GH CLI is quite good for deployment too, like Wrangler CLI. Enabling pages, setting secrets, etc. Restic is a CLI backup tool. Just like git. Works well with rclone. NotebookLlama is an open source podcast generator like NotebookLM Pragmatic Podcast (I forgot which one) Automate changelogs for your codebases. Convert past commits into attractive release notes automatically AI is going to be the consumer of many tools and logs. Build converters for these Speed of validation such as linting, testing, etc. will allow LLMs to iterate faster and WILL become more important Via Soumya Ranjan Vision embedding is useful in agile modeling Vision embedding models with SAM, Grounding Dino by meta, Alibaba does good stuff Vision embedding is more useful in batch than real time Embedding subtraction with vision embedding models like Dino AI code editors are not good with large code bases today. Keep the refactoring exercises to below 1000 lines. Also evaluate the ease of setting it up locally Deepseek Janus is a 1.3b model that can generate both text AND images (and also supports vision) Cohere Multimodal Embed v3 is available on Azure. Elevenlabs lets you create voices with a prompt. No need to even clone one! Runway Act One creates expressive character performances

Things I Learned - 27 Oct 2024

This week, I learned: LanceDB is a more scalable alternative to ChromaDB. Written in Rust. Does not require a separate HSNW library. Meta has a bunch of image embedding models: DINOv2 creates image embeddings (Apr 2023) ImageBind is an embedding model for text, images, audio, and more (Jun 2023) Gemini has a code execution API! 0x0.st is an open API-based file upload + URL shortening service. You can dump files there temporarily. noVNC is a JavaScript VNC client. You can control a remote (virtual) machine from your browser. Friend is an always recording pendant that you can ask questions to. Anthropic’s new Sonnet model is even better at code. Plus it has the ability to extract coordinates from images. Ref Gemini sort-of supports diarization. Ref. I tried it and it’s OK but not perfect. #IMPOSSIBLE LLMs cannot diarize reliably yet. (Gemini just guesses the speaker differences.) Replit is good for hobbyists, Cursor for developers, and Pythagora & Bolt for non-developers building business apps. Ref

Things I Learned - 20 Oct 2024

This week, I learned: SQL optimizations for multi-threaded web applications. Ref PRAGMA journal_mode = WAL. Improves performance for frequent writes. It allows concurrent reads and writes. PRAGMA synchronous = NORMAL. Improves performance. We might lose a few transactions but won’t corrupt the database. PRAGMA mmap_size = 128000000. Set global memory map for processes to share data PRAGMA journal_size_limit = 64000000. Limit WAL file to prevent unlimited growth BEGIN IMMEDIATE instead of BEGIN. Prevents writes to the journal file until the transaction is complete. Improves concurrency. AI seems to be slowing down apprenticeship since experts would rather use an AI than train an apprentice. Example: Robotic surgery. Ref How AI can improve education performance and engagement. Ref Student: Create study plan based on course and schedule Student: Focus on what you need to learn more Student: Align with your study style and pace Teacher: Grading MCQs Teacher: Writing conceptual guides “New collar workers” was coined by Ginny Rometty Embed tutor or document in video and ask for clarification! This is a new embedded interface. #TODO Playing Bad Apple in Minecraft. Ultra cool! OpenAI has a prompt generator. Currently it uses a meta-prompt but may later move to DSPy or Gradient Descent. Ref Great demo of using the Realtime API to read the latest Hacker News. Ref LLMs have reached the point where they can show a world, like CounterStrike, in near real time. Ref

Things I Learned - 13 Oct 2024

This week, I learned: DuckDB supports function chaining DuckDB lets you create functions = macros HTML for People is a nice introduction to HTML. FlightRadar24 lets you watch airplanes live. sq is like jq but for SQL. Deno 2 is fully backward compatible with Node! via O1 is good at solving problems where the solution is easy to verify and generating options helps get closer to the solution Reverb ASR does diarration as well as transcription. It seems the state of art right now. Gemini Flash and Gemini Flash 8b can be fine-tuned at zero cost. Inference is at the same price! Ref Flux 1.1 Pro is released. I tried my Calvin & Hobbes test on it. Not great. ImageGen3 is better, ChatGPT is the best. Ref Revisiting text to speech models. Nothing much has changed since July 2024. OpenAI TTS: $15/1M chars Ref Deepgram Aura: $15/1M chars Ref Azure AI Speech: $15/1M chars Ref Google TTS Neural2: $16/1M chars Ref AWS Polly Neural TTS: $16/1M chars Ref Cartesia Pro: $50/1M chars Ref Elevenlabs Scale: $300/1M chars Ref GitHub co-pilot workspaces let you code using your mobile with AI and deploy it at one shot If you need an Ubuntu Docker container with Python, install it via uv rather than compiling from source. via VTracer is an open source library (and tool) to convert raster images to SVGs. via If you want to create a console.llm() function, a browser extension is the best way, because some pages have Content-Security-Policy that block eval, form submission, fetch from other domains, and script execution. PyPi lets you publish from GitHub Actions without a token. Also from Gitlab.com CI/CD and Google Cloud. ActiveState which made ActivePython, ActivePerl, etc. made these products paid for commercial use around 2013 after a series of acquisitions. Marimo supports: Publishing any notebook to static.marimo.app as a static app Creating a SINGLE link that embeds the ENTIRE notebook in the URL! Runnable via uvx marimo edit Parables on the Power of Planning in AI: Giving models about 30 seconds of thinking time consistently improves results - as much as increasing parameter size by a factor of 1,000 to 100,000! This works particularly well for verifiable results (code, math, etc.) Technique: Ask an LLM hundreds of times at low temperature and pick the most common one. (Google’s Minerva used this on the MATH dataset.) Better Technique: Ask an LLM hundreds of times. Pick the best solution based on an evaluation metric (reward model) Better Technique: Apply a reward model at EACH step of the process. OpenAI’s “Let’s Verify Step by Step” Late chunking is an interesting approach to adding context to embeddings. (I don’t understand it, but it’s cheap and effective.) DeepInfra offers embedding models as APIs at about 0.5 to 1 cent per MTok in an OpenAI compatible API. It also supports text-to-image models like flux.dev and speech recognition models like Whisper. Jake Heller: “One of the things we learned is (an LLM app) after it passes passes frankly even 100 tests, the odds that it will do, on any random distribution of user inputs, the next 100,000 100% accurately is very high.” OpenAI’s O1 is like Daniel Kahneman’s System 2 thinking - as against other LLMs’ System 1 thinking. Continue.dev is another AI coding editor. It supports OpenRouter. So now I have heard good things about: Github Copilot Cursor Cody Continue.dev (supports OpenRouter) Aider (supports OpenRouter) Maybe: Codeium Not: Amazon Q Developer

Things I Learned - 06 Oct 2024

This week, I learned: ffmpeg on WASM works but is unstable and hard to use. You can’t use it in a CDN without CORS issues, since it loads ffmpeg-core via a worker. It often runs into buffer allocation issues. Exotel and Plivo provide voice & SMS services in India (like Twilio). Plivo is more customer friendly. Uber’s H3, Google’s S2, and GeoHash are geocoding systems. H3 offers uniform cell sizes and better distance measurement S2 offers higher precision (factoring in Earth’s curvature) for exact location matches GeoHash is the simplest There’s a movement towards embeddable databases on the cloud. MotherDuck is hosted DuckDB. Turso is hosted SQLite (with local sync, multi-tenant) StarBase DB is SQLite with an API on top of Cloudflare Durable Objects. Software 2.0 by Andrej Karpathy. This is fundamentally altering the programming paradigm by which we iterate on our software, as the teams split in two: the 2.0 programmers (data labelers) edit and grow the datasets, while a few 1.0 programmers maintain and iterate on the surrounding training code infrastructure, analytics, visualizations and labeling interfaces. Adaptive UI ideas: Adaptive Fields: Show only required fields based on what the user field so far. Smart Inputs: Dropdowns and auto-complete based on user’s context. Smart Themes: Change font size, contrast, theme guessing the user’s age and preferences. Dynamic Menus: Show what they might need to do next. Like Nokia’s right button, but using LLMs. Smart Tooltips: Check what the user’s doing (delays, confusions, previous clicks, current actions) and show relevant tips. Personalized Layout: Show only the relevant sections of the app. E.g. based on what they’re doing. Smart Charts: Create the right chart that solve the user’s question. Adaptive Back-end Dynamic APIs: Create endpoints on the fly based on user needs Dynamic Indexing: Create & update indices on the fly based on user needs Dynamic Schema: Create & update schema on the fly based on user needs Dynamic Migration: Migrate to a new database or OS or language as required Dynamic Queries: Create SQL/NoSQL queries to solve the user problem Dynamic RBAC: Figure out who needs permissions and why. Add OR REMOVE access as required Dynamic Logging. Log what’s required. Explain why it’s logged and what’s happening. Fix code that raised the error Dynamic Caching. Cache what’s likely to be required. Evict what may not be required. Figure out cache keys. Aider LLM Leaderboards show which LLMs code better. As of now, o1-preview > claude-3.5 sonnet on code editing claude-3-opus > claude-3.5-sonnet on code refactoring deepseek-coder-v gpt-4o-mini sucks. Jaro-Winkler Distance is a string matching algorithm that weights the start of a string higher. Passing the feed of the following to NotebookLLM is a good way to get caught up with news and summaries. A blog / WhatsApp group (e.g. The Generative AI Group, Sithamalli, etc.) A Google Group / mailing list (e.g. genainews, datameet) YouTube channels (e.g. Vertiasium, GitHub) Hacker News top stories Research papers Emails (skipping marketing emails) OpenAI Evals and Distillation has a clever design. They just convert filtered history to .JSONL files that can be an input to either. Speak is a language learning app based on OpenAI’s Realtime API. OpenAI’s Realtime API can be used in a text-to-text chat mode without needing to send the entire context. If the pricing works out right, this can be far cheaper than sending the entire conversation context. Ref Matching addresses with just embeddings works well. Combine it with simple hard rules. Ref OpenAI’s prompt caching works for images too – both linked and embedded Quotes on Graph RAG from a Generative AI WhatsApp Group. “Damn so literally nobody uses Graph RAG yet. Good to know.” ~Sumba “A big four consulting firm uses GraphRAG to retrieve related documents and excerpts from governance and compliance docs.” ~Vinayak Hegde (Microsoft) “Graph RAG is expensive and unnecessary in most of the cases.” ~Utkarsh Saxena ChatGPT’s advanced mode includes: “…you can use various regional accents and dialects.” Ref Source But the API can “laugh, whisper, and adhere to tone direction.” Ref Hume API (INR 6/min) is far cheaper than OpenAI’s real-time chat (6c/min input + 24c/min output) Devika is an open-source clone of Devin. DuckDB runs inside Pyodide Hungarian Jews have genetic diseases that increase their IQ. Gaucher’s disease, Torsion dystonia. People don’t like hard stuff like maths or science, so richer societies have fewer scientists Ethan Mollick feels Claude 3.5 Sonnet is better at style and critiquing blog posts than OpenAI’s o1 (which is better at reasoning.) News is going to be crazily disrupted again with voice mode. I can just listen to the topic I want In Singapore Airlines, You can’t wear your seatbelt loose You have to keep the laptop in the pocket in front, not on your lap, during takeoff You can’t charge during takeoff They verify if you ask for a veg meal and place a sticker on your seat Coders are more likely to edit LLM code. Non-coders don’t have that bad habit. Vaishnavi and Ranjeet edited code Indal and Koustav didn’t Coders are likely to get more out of an LLM because they know what it can do. But some non-coders will get more out of an LLM because they don’t know what it can’t do. E.g. Indal trying for a confetti animation, which is hard but do-able “You have to put in a lot of work to become productive at AI coding.” Simon Willison

Things I Learned - 29 Sep 2024

This week, I learned: Pyodide can access the DOM and JavaScript in the browser Jupyter Lite lets you run Jupyter notebooks in the browser AVIFs is about 10X better than GIFs. I tried creating one via EZGIF AVIF Maker and the .avifs file created was 15X smaller! ffmpeg -i input.gif -c:v libaom-av1 -crf 30 -b:v 0 -cpu-used 4 -tiles -an output.avif Claude 3.5 thinks .opus is the best format to compress audio. It used ffmpeg -i audio.wav -c:a libopus -b:a 16k -application voip -vbr on -compression_level 10 audio.opus API coding best practices Source via Simon Willison: Always add screenshots to the Readme. They never break. Always add every example. Human think in examples. Avoid defaults and be explicit unless 99% of the usage is with the default. Make the feedback loops incredibly fast. Make deprecations easy for users to deal with. Keep objects immutable. PyMuPDF4LLM can convert PDFs to Markdown. It handles tables, too. 04 Oct 2024. PDF-Extract-Kit does PDF layout, formula, table, and OCR extraction using various models. 04 Oct 2024. llmsherpa extracts PDF layout, tables, not OCR When evaluating feasibility of technology with LLMs always ask for multiple options and pick from those. Simon Willison Gemini supports audio natively Google Vertex AI has an OpenAI compatible API but it works only for some models. Anthropic and Gemini are not compatible. When you paste HTML into Excel, it automatically changes the font of the cell to match the content in the HTML! Aptos is the new default font in Office - replacing Calibri. Anthropic’s Introducing Contextual Retrieval says: Use BM25 in addition to embeddings to match rare terms (e.g. identifiers) Add a context to each chunk’s metadata (generate it with a cheap LLM) and pass it to the summarizing LLM Reranking helps with cost AND accuracy. Use Cohere or Voyage Sentient lets you control the browser via Python in natural language

Things I Learned - 22 Sep 2024

This week, I learned: E2E is a cheap GPU hosting provider for India. About Rs 100/hr for a V100 16GB Jetson NVIDIa is like Raspberry Pi with a GPU! But it’s expensive. Sarvam.ai offers Indic text to speech Jupyter Lite lets you run Jupyter notebooks in the browser Piston lets you run Python code via a REST API <link rel="modulepreload"> lets you load and compile modules early! Ollama 0.2 can handle concurrent requests with only a little additional memory. (So can vLLM and DeepSpeed.) Prompt engineering for code generators: Claude Artifacts Prompt Val.Townie system prompt. Good example of how to create Cursor editing Cursor debugging Cursor conversation XML tags seem best to structure prompts across LLMs. Claude OpenAI Gemini Instructor prompts by Ethan Mollick help teach better Non-Negative matrix factorization apparantly aligns to intuition more than K-Means and hence would be a great fit for most cosine-similarity matrices (via Jaidev). Segmind’s Hallo lets you animate a face to an audio clip VoidEditor aims to be an open source Cursor alternative Video of ChatGPT o1 + mini reproducing the methodology of a paper by writing the code - in 6 iterations. Here’s the repo. Prompts: You are a Python and Astrophysics expert who is tasked with helping me on my research project. Please read the following methods section of this research paper and re-create the Python code described. Thank you, this code looks really nice. I don’t have any actual data or noise cube ready at the moment, but could you please generate some test data that can be used in the code you just wrote: {CODE} Hi. thank you for writing the code! Unfortunately, it seems that I get an error when I try to run it. I’ve attached the error message below, can you please refine the code so that the error is resolved? {ERROR} Thank you, but when attempting to run the code that you provided, I received the following error: {ERROR} Hello, thank you for the code. but now I get the following error pasted below: {ERROR} Thank you, I think we are getting close to a final solutiom I still get an error, which I’ve pasted below: {ERROR} Groq, SembaNova and Cerebras are fast inference models. All appear to be free The skills required to vet the AI’s response is the same skillset used to vet a Pull Request. It’s a good way to teach code review. Source: My personal guide for developing software with AI Prompt engineering tip: Tell LLMs another AI wrote code. Else they will agree with you!

Things I Learned - 15 Sep 2024

This week, I learned: Hume provides a voice-to-voice model (EVI 2) that handles emotions at 7 cents/minute. OpenArt workflows has image generation workflows Pixtral seems quite good at OCR LLM coding Makes you more ambitious Lets you code without stress. (Just pass it the error and have it fix it. Or find another approach) Is unlimited. You can run dozens of agents in parallel Simon Willison’s crowdsourced list of prompt engineering hacks “Invest in things that don’t change.” Jeff Bezos. Like faster delivery, SQL, web platform. Medical cost in Singapore (for insurance coverage) - via Kumar Root canal at clinic: $1,300 Crown replacement at clinic: $1,300 Periodontist (gums) at hospital: $2,500 OAuth from First Principles is a SIMPLE explanation of OAuth. Conclusion: “You probably shouldn’t implement your own OAuth client.” Alphaxiv is Arxiv.org but with author comments and chat The Impact of AI on Computer Science Education: Eric Klopfer divided his undergrad CS class into three groups and gave them a Fortran task. One used ChatGPT. Another, Meta’s Code Llama LLM. Third, only use Google. ChatGPT group was faster than Code Llama was faster than Google When tested on the approach, the ChatGPT remembered nothing. Half the Code Llama group passed. The Google group passed fully Server-side implementation of an OAuth2 client is too complex. Best to delegate this to Auth0 Via Pratap Vardhan: At Khan Academy, every developer working on Khanmigo has cursor. Everyone who’s contributed to a Khan Academy GitHub repo has GitHub Copilot. I stopped using Google + StackOverflow 2 years ago. I use ChatGPT, Copilot, etc. For humans, I ask Reddit. Excited by async agents. Things that do my job while I sleep. Zapier notifications. Monitor what happens. Put it into a flow diagram and alert me. Every month, did my broker trade? Did my bank transaction fail? Did I pay my electricity bill? Every time you delegate, use an agent instead. Read my RSS feeds. Read my browser history and suggest interests. Plan a session in Bain, BCG, etc. on Artifacts. Explore sparse embeddings. More effective. ColiPali, ColBERT

Things I Learned - 08 Sep 2024

This week, I learned: When running a Hello world app: FastAPI takes ~26K RAM, 3% CPU NodeJS + Express takes ~62K RAM, 2% CPU Deno + Express takes ~62K RAM, 1% CPU Deno + Fresh takes ~54K RAM, 0.4% CPU I was testing out different video LLMs: Luma Labs lets you create videos from text Runwal ML lets you create video from an image + text Viggle lets you add images to a video or move a character in a certain way Veed.io is a video editor that offers AI video editing features Deepmotion generates 3D animations from video Wonder Dynamics may be similar to DeepMotion I tested out a few audio LLMs: Suno is fast, has a better UI, lots of examples Udio is slow, poor UI, creates richer music Reflection 70b is one of the top models now, and is open source!. It works by making the LLM reflect on its answer inside <reflection>...</reflection> tags. The best diarization model today is whisperX. Run on Colab T4 GPU with: Scale’s SEAL Leaderboards seem fairly good. coedit-xxl is Grammarly’s fine-tuned google/flan-t5-xxl model run on CoEdit - text editing dataset. It’s mainly for single-line editing, though, and far from a full-document or full-email zero-shot editor.

Things I Learned - 01 Sep 2024

This week, I learned: LLMs are so good that they can simulate Doom in real time. gamengen Val.town’s code generation system prompt uses https://maxm-imggenurl.web.val.run/the-description-of-your-image to dynamically generate images Practice for each thought: “What would make me change my mind? How likely is that?” Cursor uses speculative edits and a variety of other techniques to speed up code editing. ChatGPT does a better job at cartoon generation than even Flux.1

Things I Learned - 25 Aug 2024

This week, I learned: Karya.in is creating high quality datasets. Suhel mentioned them An 8-year old uses Cursor.ai to code Hermes 3 has special tokens like <SCRATCHPAD>, <RESTATEMENT>, <THOUGHT_*>, <PYDANTIC_SCHEMAS>, <SCHEMA_*>, <REASONING>, <INNER_MONOLOGUE>, <PLAN>, <EXECUTION>, <REFLECTION>, <THINKING>, <SOLUTION>, <EXPLANATION>, <UNIT_TEST>, etc. This extends the capability dramatically. Lumentis creates docs from transcripts and text LLMs write worse code in JSON than Markdown Copilot’s system prompt calls a search_enterprise(query: str) tool and a hint(M365Copilot_language: str) tool as assistants. Anthropic Prompt Caching is 90% cheaper to use and 25% costlier to create. So if there’s a 27% chance it’ll be re-used, cache it.

Things I Learned - 18 Aug 2024

This week, I learned: Code agent frameworks to explore: Cognition Factory Codegen Some interesting multi-modal generation models / tools to explore: Flux for open-weights image generation Runway Gen 3 for video generation Suno for music generation DocxTemplater is SlideSense but open-core and handles DOCX as well! handle = await window.showDirectoryPicker() lets you access the browser File system API.

Things I Learned - 11 Aug 2024

This week, I learned: Embedding models can be fine-tuned. Example: #TODO Agentic RAG (Ravi Theja, LlamaIndex) RAG via top-k retrieval fails with summarization => need to read all chunks comparison: compare product X vs Y => need to split and re-combine structured analytics. e.g. most expensive employees => Text2SQL first multi-part questions. e.g. Tell me about speed of model X AND cost of model Y and recommend => need to split and re-combine RAG failures: It’s single shot. No query planning. No tools. No correction. No memory. Agents that help in RAG Route to the right tool E.g. retrieve via vector top-k search or vector summary search or keyword search or combination? One-shot query planning E.g. Break query into multiple specific queries. RAG those. Then combine. #TRY - maybe in DocSearch Tool use E.g. Schema retrieval, Text2SQL, Calendar, Chat, APIs, Search, etc. Agent orchestration ReAct: An agent reasoning loop. Reason + Act. {Thought, Action, Action Input, Observation}*. Orchestrate tools with a prompt Multi-agent task solver: Llama agents Instead of a single agent loop, use different agents. Also allows parallelization Allow services to register. (MS TaskWeaver stores tool descriptions in YAML) LlamaHub Tools has ideas for agents Notes on LLM Fine-Tuning Rouge 2 and Bleu and such metrics are NOT good. Create you own benchmarks Non-PEFT fine tuning needs 6X GPU RAM. Optimizer states, Gradient, Activations are the overhead. PEFT is about tuning a subset of parameters. LORA adds additional weights without updating the model. It’s a low rank matrix multiplication. You can change these adapters in runtime. Saves space. Fast to train Quantization: Stick to bitsandbytes or AWQ (may be a bit better) QLORA = Quantization + LORA Predibase has open-sourced Lora Adapters in “Lora Land”. Existing adapters are pretty good. ghcr.io/predibase/lorax:main Docker image works on Docker compose to run locally. devices: on Docker Compose lets you specify NVIDIA GPU devices Locust is a HTTP load testing lib in Python Techniques for inference optimization Dynamic adapters: Loads right LORAX adapters WHEN a request comes in Multi-adapter batching: Process all inputs in parallel on the same GPU, but different users are post-processed using different adapters Notes from a 4-hour flight: What We’ve Learned From A Year of Building with LLMs Strategy IS IT TOO HARD/EXPENSIVE? Log it. LLMs are getting cheaper and better. WILL OPENAI BUILD IT? If so, wait for it instead of building. HAS A STARTUP BUILT IT? If so, use it instead. It’s a generic use case there’s no point re-inventing. FOCUSED USE CASES over generic. Build trust by starting small. Tools for LLM Ops (feedback): LangSmith, Log10, LangFuse, W&B Weave, HoneyHive #TRY Human in the Loop is about humans evaluating model outputs. That’s different from AI in the loop, human in the center, where AI accelerates human output (like Github Copilot) Operations CHECK EMBEDDINGS DRIFT over time. Users might be input-ing different things than before. LOG AND REVIEW everything. Instructor coaxes structured output from LLM APIs. #TRY IMPLICIT FEEDBACK collection is easy. Just let users edit stuff. #TRY Tactical Try n-shot prompting (n=5-12) before bigger models. #TRY Always structure for output: Markdown, XML/HTML tags. Combine RAG with Keyword search. It reduces user frustration in edge cases. Prefer multiple small prompts to one big prompt. Do X. Then Y. Then Z. Jitter prompts for diversity beyond temperature. LLM-as-judge works better when comparing outputs (not rating 1 output). Keep length similar (LLMs prefer wordiness). Swap order and compare. Allow for ties. Ask for reason FIRST. Hermes: A Text-to-SQL solution at Swiggy “Hermes performed significantly better for charters with well-defined metadata and a relatively smaller number of tables.” “We collect feedback on the accuracy of the returned query from stakeholders directly within the Slack bot.” How I use AI and “Replacing my right hand with AI” EMBED in every app/workflow. E.g. Auto-fix spellings. Auto-review code. Auto-ask LLM on errors and apply patch! Auto-search for answer, assess, continue. PERSIST. Stick with the LLM to the end. Don’t fix it yourself. It’s faster. #TRY INTERVENE FAST. If an LLM can’t solve it by itself in 2 tries, it needs in-depth help. APP-IFY one-off tasks. Disposable tools. “Write web-app to convert JSON to tab-delimited.” “Extract fields as a table.” “Diff JSON.” #TRY BEST language/frameworks preferred. CUDA in Python. Rust. C. Raspberry Pi. Arduino. Bluetooth. Modern ESM/JS. #TRY TEACH examples. “Here’s the LLM Foundry API.” “Here’s how to use gramex.data.” DUMP entire code. Models can handle it. Refactoring to SQLAlchemy 2, Pandas 2. API Documentation. Test case generation. #TRY ASK for features & packages. Docker without root access. GPU access inside docker. Windows CLI-only C++ compiler. TEST CASE writing. #TRY SPEC IN DETAIL. Use these libraries. Write like this: code example. SPEC USAGE in detail. “I will just pipe it into sqlite”, or “I will just run ffmpeg -i filename [YOUR OPTIONS]. Describe the UI, API input/output, data structure, and internal data structure. HELP on usage. “ffmpeg to get audio.mp3”. My benchmark for large language models LLM(text) is a useful function to have in JS and Python too. Useful as a simple pip install llmfoundry Allow images, files in LLM() Current list of #IMPOSSIBLE (or hard) things for LLMs Translate technical documents to Dutch – because they don’t understand the technical terms well Translate large documents (JSON to XML, English to Chinese, Python to Rust, Wrong to right spelling) – because the output tokens are limited micro-agent generates test cases first when asked to build an app. Then it iterates until the test cases pass. Alternative interfaces to YouTube: Piped.video, CloudTube, Invidious, NewPipe, FreeTube Deepseek Context Caching reduces price to 1.4 cents/MTok for portions of chat messages that are repeated. That’s a 10X reduction for long conversations!

Things I Learned - 04 Aug 2024

This week, I learned: Assisted generation uses a faster LLM to generate text and a better (tokenizer-compatible) LLM to validate it. This makes it faster. E.g. Gemma 2 2b with Gemma 2 27b Power Toys has an Advanced Paste that uses OpenAI to paste as Markdown or JSON! Interest Turing complete languages: find + mkdir, maybe sed and awk Minecraft’s Redstone Circuits Conway’s Game of Life Cellular Automata Rule 110 Magic: The Gathering SQL Excel Rev.ai does a good job of diarization. Cost: 2 cents per minute. Update: 6 Jun 2025. Cost: 0.33c/min Ref

Things I Learned - 28 Jul 2024

This week, I learned: Speech editing in audio files is a thing. Speech Editing Toolkit and Descript GPT 4o Mini is almost as good as GPT 4o in the LMSYS leaderboard. Llama 3.1 400B model and Mistral 2 Large are yet to be evaluated. If LLMs can generate any text, and text can describe the real world, we can rapidly generate “artifacts” that generate: 3D Printable Models: STL (Stereolithography): Defines the surface geometry of 3D objects using triangular facets. OBJ (Wavefront OBJ): Describes 3D geometry including vertices, textures, and normals. X3D: An XML-based file format for representing 3D computer graphics. Vector Graphics: SVG (Scalable Vector Graphics): Defines vector-based graphics in XML format, useful for illustrations, diagrams, and user interface elements. CAD Drawings: DXF (Drawing Exchange Format): Represents CAD data, including shapes, lines, and curves, used in engineering and architecture. Circuit Designs: KiCAD: An open-source software suite for Electronic Design Automation (EDA), which uses various file formats like PCBNew and EESchema to represent circuit designs. Blueprints and Architectural Designs: GML (Geography Markup Language): Encodes geographical features and spatial information. CityGML: A specific GML application schema for modeling and exchanging 3D city models. Molecular Structures: PDB (Protein Data Bank): Describes the three-dimensional structures of molecules. CML (Chemical Markup Language): An XML-based standard for representing molecular data. Robotics and Automation: URDF (Unified Robot Description Format): Defines the physical configuration of a robot, including joints, links, and sensors. COLLADA (Collaborative Design Activity): An XML-based schema to describe digital assets for 3D applications, often used in robotics. Geospatial Data: KML (Keyhole Markup Language): Used for geographic data visualization, primarily in Google Earth. GeoJSON: A format for encoding a variety of geographic data structures using JSON. Mathematical Markup: MathML (Mathematical Markup Language): Describes mathematical notation and captures both its structure and content. Music and Sound: MusicXML: Encodes sheet music in a structured format that can be easily shared between different music notation software. Documents and Text: DocBook: A semantic markup language for technical documentation. Markdown: A lightweight markup language with plain text formatting syntax. Biological Data: SBML (Systems Biology Markup Language): Represents computational models of biological processes. PhyloXML: An XML format for representing phylogenetic trees. Game Development: FBX (Filmbox): A file format for 3D animation that can hold information about the geometry, textures, and animations. VRML (Virtual Reality Modeling Language): Describes interactive 3D objects and worlds. Data Visualization: ChartML: Encodes charts and graphs in a structured format. D3.js (Data-Driven Documents): Uses HTML, SVG, and CSS to bring data to life with interactive visualizations. Building Information Modeling (BIM): IFC (Industry Foundation Classes): Describes building and construction data. Textiles and Fabrics: LoomML: Represents the design and structure of woven fabrics. Augmented Reality and Virtual Reality: ARML (Augmented Reality Markup Language): Defines how augmented reality applications should behave and what content they should display. VRML (Virtual Reality Modeling Language): For describing interactive 3D objects and worlds. Medical Imaging and Health Data: DICOM (Digital Imaging and Communications in Medicine): Encodes medical imaging data. HL7 (Health Level 7): A set of standards for the exchange of information between medical applications. Simulation Data: FMI (Functional Mock-up Interface): Represents and exchanges dynamic simulation models. SBML (Systems Biology Markup Language): For computational models of biological processes. Sound and Audio: MML (Music Markup Language): For encoding music notation and performance information. SoundFont: A file format for defining musical instrument sounds. Animation and Visual Effects: BVH (Biovision Hierarchy): Encodes motion capture data. Alembic: A computer graphics interchange framework primarily for exchanging animation and visual effects data. Textile Patterns: WIF (Weaving Information File): Describes weaving patterns and structures. Knitting Markup Language: Encodes knitting patterns in a structured format. Scientific Data: CDF (Common Data Format): Used for storing scientific data. NetCDF (Network Common Data Form): Supports the creation, access, and sharing of array-oriented scientific data. Photography and Imaging: XMP (Extensible Metadata Platform): Used for embedding metadata in digital images and other media files. Construction and Engineering: LandXML: For civil engineering and land surveying data. gbXML (Green Building XML): Facilitates the transfer of building data for analysis of energy and environmental performance. Packaging and Retail: BPL (Barcode Product Labeling): Encodes information for product packaging and labeling. GS1 XML: Used for electronic business messaging, including product identification and tracking. Typography and Font Design: UFO (Unified Font Object): A format for storing font data. SFNT (Spline Font): Encodes scalable font information. Product Data Management: PLMXML (Product Lifecycle Management XML): Used for sharing product data across PLM systems. GPT 4o Mini can be fine-tuned! Awesome PaaS lists self-hosted deployment platforms. Piku - similar to Dokku – is promising.

Things I Learned - 21 Jul 2024

This week, I learned: GPT For Work has a set of useful spreadsheet LLM functions Xata offers a free PostgreSQL tier with REST API Mamba now uses mambaforge as the default installation, i.e. conda-forge is the default and only channel! Update: 6 Jun 2025. Mambaforge is sunset as of 29 Jul 2024. Conda-forge now uses Miniforge as the standard installer Ref conda-forge.org. Users should switch to Miniforge instead. nginx supports a load-balancing method least_conn which is far better than the default round-robin. #IMPOSSIBLE LLMs cannot provide a bounding box of objects in images. (Maybe Florence 2 can). Update: Mar 2025. Gemini has good timestamps and bounding boxes Models gently grow in capability. It helps to maintain an impossibility list that steadily gets invalidated. Ref Github Copilot internals walks through how Copilot constructs its prompts

Things I Learned - 14 Jul 2024

This week, I learned: Carlton’s TDS session Always create a new venv via VS Code when starting a training session. Helps reproduce issues (though I could use Colab instead) Create an empty .ipynb notebook and double-click it. That’s another way (though slower) to open a Jupyter notebook Share Parrish Knowledge Project podcast. Three generations of wealth There is a big difference between liking animals and being a vet. Between liking education and being a teacher. Even if no one reads your writing, you benefit from the writing. Emotional.crises like 9/11 or Covid are far easier for markets to recover from Hidden brain podcast. White trying to hard can back fire on you Sometimes conscious thinking makes our automated responses of sports music, dance are great examples Instead, SURRENDER to something outside of you. Like playing with kids. Exercise also sends blood away from brain. Drugs. ChatGPT. It’s called Ue in Chinese philosophy A quick check on the pricing of text to speech models OpenAI TTS: $15/1M chars Ref Deepgram Aura: $15/1M chars Ref Elevenlabs Scale: $165/1M chars Ref Google TTS Neural2: $16/1M chars Ref Azure AI Speech: $15/1M chars Ref AWS Polly Neural TTS: $16/1M chars Ref

Things I Learned - 07 Jul 2024

This week, I learned: Predibase uses LORAX to run multiple fine-tunings of a base model in a single GPU via adapters. Ref