Image generation gets better at comics

I heard a lot about the new image generation models last week. So, I tested to see what’s improved. I gave the prompt below to various image generation models – old and new. A Calvin and Hobbes strip. Calvin is boxing Hobbes, with a dialog bubble from Calvin, saying “Bring it on!” Stable Diffusion XL Lightning Stable Diffusion XL Base Dall-E API ...

Things I Learned - 25 Aug 2024

This week, I learned: Karya.in is creating high quality datasets. Suhel mentioned them An 8-year old uses Cursor.ai to code Hermes 3 has special tokens like <SCRATCHPAD>, <RESTATEMENT>, <THOUGHT_*>, <PYDANTIC_SCHEMAS>, <SCHEMA_*>, <REASONING>, <INNER_MONOLOGUE>, <PLAN>, <EXECUTION>, <REFLECTION>, <THINKING>, <SOLUTION>, <EXPLANATION>, <UNIT_TEST>, etc. This extends the capability dramatically. Lumentis creates docs from transcripts and text LLMs write worse code in JSON than Markdown Copilot’s system prompt calls a search_enterprise(query: str) tool and a hint(M365Copilot_language: str) tool as assistants. Anthropic Prompt Caching is 90% cheaper to use and 25% costlier to create. So if there’s a 27% chance it’ll be re-used, cache it.

Weird emergent properties on Llama 3 405B

In this episode of ThursdAI, Alex Volkov (of Weights & Biases) speaks with Jeffrey Quesnelle (of Nous Research) on what they found fine-tuning Llama 3 405B. This segment is fascinating. Llama 3 405 B thought it was an amnesiac because there was no system prompt! In trying to make models align with the system prompt strongly, these are the kinds of unexpected behaviors we encounter. It’s also an indication how strongly we can have current LLMs adopt a personality simply by beginning the system prompt with “You are …” ...

Things I Learned - 18 Aug 2024

This week, I learned: Code agent frameworks to explore: Cognition Factory Codegen Some interesting multi-modal generation models / tools to explore: Flux for open-weights image generation Runway Gen 3 for video generation Suno for music generation DocxTemplater is SlideSense but open-core and handles DOCX as well! handle = await window.showDirectoryPicker() lets you access the browser File system API.

The LLM Psychologist

Andrej Karpathy mentioned the term LLM psychologist first in Feb 2023. I’ve been thinking about this for a while, now. I’ve always been fascinated by psychologists in fiction. I grew up with Hari Seldon in Foundation, wanting to be a psycho-historian. (I spent several teenage years building my mind-reading abilities.) I wanted to be Susan Calvin, the only robopsychologist. ...

Visiting client offices is usually a painful exercise, given travel and security. But there are some small things that make your day. Like the Mentos at the reception. Or the unsecured WiFi. Or the delightful view of the city from a skyscraper. Today, it was the noble admin person who placed the power sockets ON TOP OF the desks, so I don’t have to bend below the desk or dig into a hole to get connected. ...

Things I Learned - 11 Aug 2024

This week, I learned: Embedding models can be fine-tuned. Example: #TODO Agentic RAG (Ravi Theja, LlamaIndex) RAG via top-k retrieval fails with summarization => need to read all chunks comparison: compare product X vs Y => need to split and re-combine structured analytics. e.g. most expensive employees => Text2SQL first multi-part questions. e.g. Tell me about speed of model X AND cost of model Y and recommend => need to split and re-combine RAG failures: It’s single shot. No query planning. No tools. No correction. No memory. Agents that help in RAG Route to the right tool E.g. retrieve via vector top-k search or vector summary search or keyword search or combination? One-shot query planning E.g. Break query into multiple specific queries. RAG those. Then combine. #TRY - maybe in DocSearch Tool use E.g. Schema retrieval, Text2SQL, Calendar, Chat, APIs, Search, etc. Agent orchestration ReAct: An agent reasoning loop. Reason + Act. {Thought, Action, Action Input, Observation}*. Orchestrate tools with a prompt Multi-agent task solver: Llama agents Instead of a single agent loop, use different agents. Also allows parallelization Allow services to register. (MS TaskWeaver stores tool descriptions in YAML) LlamaHub Tools has ideas for agents Notes on LLM Fine-Tuning Rouge 2 and Bleu and such metrics are NOT good. Create you own benchmarks Non-PEFT fine tuning needs 6X GPU RAM. Optimizer states, Gradient, Activations are the overhead. PEFT is about tuning a subset of parameters. LORA adds additional weights without updating the model. It’s a low rank matrix multiplication. You can change these adapters in runtime. Saves space. Fast to train Quantization: Stick to bitsandbytes or AWQ (may be a bit better) QLORA = Quantization + LORA Predibase has open-sourced Lora Adapters in “Lora Land”. Existing adapters are pretty good. ghcr.io/predibase/lorax:main Docker image works on Docker compose to run locally. devices: on Docker Compose lets you specify NVIDIA GPU devices Locust is a HTTP load testing lib in Python Techniques for inference optimization Dynamic adapters: Loads right LORAX adapters WHEN a request comes in Multi-adapter batching: Process all inputs in parallel on the same GPU, but different users are post-processed using different adapters Notes from a 4-hour flight: What We’ve Learned From A Year of Building with LLMs Strategy IS IT TOO HARD/EXPENSIVE? Log it. LLMs are getting cheaper and better. WILL OPENAI BUILD IT? If so, wait for it instead of building. HAS A STARTUP BUILT IT? If so, use it instead. It’s a generic use case there’s no point re-inventing. FOCUSED USE CASES over generic. Build trust by starting small. Tools for LLM Ops (feedback): LangSmith, Log10, LangFuse, W&B Weave, HoneyHive #TRY Human in the Loop is about humans evaluating model outputs. That’s different from AI in the loop, human in the center, where AI accelerates human output (like Github Copilot) Operations CHECK EMBEDDINGS DRIFT over time. Users might be input-ing different things than before. LOG AND REVIEW everything. Instructor coaxes structured output from LLM APIs. #TRY IMPLICIT FEEDBACK collection is easy. Just let users edit stuff. #TRY Tactical Try n-shot prompting (n=5-12) before bigger models. #TRY Always structure for output: Markdown, XML/HTML tags. Combine RAG with Keyword search. It reduces user frustration in edge cases. Prefer multiple small prompts to one big prompt. Do X. Then Y. Then Z. Jitter prompts for diversity beyond temperature. LLM-as-judge works better when comparing outputs (not rating 1 output). Keep length similar (LLMs prefer wordiness). Swap order and compare. Allow for ties. Ask for reason FIRST. Hermes: A Text-to-SQL solution at Swiggy “Hermes performed significantly better for charters with well-defined metadata and a relatively smaller number of tables.” “We collect feedback on the accuracy of the returned query from stakeholders directly within the Slack bot.” How I use AI and “Replacing my right hand with AI” EMBED in every app/workflow. E.g. Auto-fix spellings. Auto-review code. Auto-ask LLM on errors and apply patch! Auto-search for answer, assess, continue. PERSIST. Stick with the LLM to the end. Don’t fix it yourself. It’s faster. #TRY INTERVENE FAST. If an LLM can’t solve it by itself in 2 tries, it needs in-depth help. APP-IFY one-off tasks. Disposable tools. “Write web-app to convert JSON to tab-delimited.” “Extract fields as a table.” “Diff JSON.” #TRY BEST language/frameworks preferred. CUDA in Python. Rust. C. Raspberry Pi. Arduino. Bluetooth. Modern ESM/JS. #TRY TEACH examples. “Here’s the LLM Foundry API.” “Here’s how to use gramex.data.” DUMP entire code. Models can handle it. Refactoring to SQLAlchemy 2, Pandas 2. API Documentation. Test case generation. #TRY ASK for features & packages. Docker without root access. GPU access inside docker. Windows CLI-only C++ compiler. TEST CASE writing. #TRY SPEC IN DETAIL. Use these libraries. Write like this: code example. SPEC USAGE in detail. “I will just pipe it into sqlite”, or “I will just run ffmpeg -i filename [YOUR OPTIONS]. Describe the UI, API input/output, data structure, and internal data structure. HELP on usage. “ffmpeg to get audio.mp3”. My benchmark for large language models LLM(text) is a useful function to have in JS and Python too. Useful as a simple pip install llmfoundry Allow images, files in LLM() Current list of #IMPOSSIBLE (or hard) things for LLMs Translate technical documents to Dutch – because they don’t understand the technical terms well Translate large documents (JSON to XML, English to Chinese, Python to Rust, Wrong to right spelling) – because the output tokens are limited micro-agent generates test cases first when asked to build an app. Then it iterates until the test cases pass. Alternative interfaces to YouTube: Piped.video, CloudTube, Invidious, NewPipe, FreeTube Deepseek Context Caching reduces price to 1.4 cents/MTok for portions of chat messages that are repeated. That’s a 10X reduction for long conversations!

Fascinating to see the how LLM cost-quality frontier moves. Recent fights were mostly on cost. Yesterday, #OpenAI halved the GPT-4o cost. At $2.5/MTok (and with GPT-4o-min at 15 cents/MTok), the best and cheapest models are back with OpenAI, IMHO. Sigh, time to move all our stuff back from #Anthropic. For now… https://gramener.com/llmpricing/ LinkedIn

Things I Learned - 04 Aug 2024

This week, I learned: Assisted generation uses a faster LLM to generate text and a better (tokenizer-compatible) LLM to validate it. This makes it faster. E.g. Gemma 2 2b with Gemma 2 27b Power Toys has an Advanced Paste that uses OpenAI to paste as Markdown or JSON! Interest Turing complete languages: find + mkdir, maybe sed and awk Minecraft’s Redstone Circuits Conway’s Game of Life Cellular Automata Rule 110 Magic: The Gathering SQL Excel Rev.ai does a good job of diarization. Cost: 2 cents per minute. Update: 6 Jun 2025. Cost: 0.33c/min Ref

Things I Learned - 28 Jul 2024

This week, I learned: Speech editing in audio files is a thing. Speech Editing Toolkit and Descript GPT 4o Mini is almost as good as GPT 4o in the LMSYS leaderboard. Llama 3.1 400B model and Mistral 2 Large are yet to be evaluated. If LLMs can generate any text, and text can describe the real world, we can rapidly generate “artifacts” that generate: 3D Printable Models: STL (Stereolithography): Defines the surface geometry of 3D objects using triangular facets. OBJ (Wavefront OBJ): Describes 3D geometry including vertices, textures, and normals. X3D: An XML-based file format for representing 3D computer graphics. Vector Graphics: SVG (Scalable Vector Graphics): Defines vector-based graphics in XML format, useful for illustrations, diagrams, and user interface elements. CAD Drawings: DXF (Drawing Exchange Format): Represents CAD data, including shapes, lines, and curves, used in engineering and architecture. Circuit Designs: KiCAD: An open-source software suite for Electronic Design Automation (EDA), which uses various file formats like PCBNew and EESchema to represent circuit designs. Blueprints and Architectural Designs: GML (Geography Markup Language): Encodes geographical features and spatial information. CityGML: A specific GML application schema for modeling and exchanging 3D city models. Molecular Structures: PDB (Protein Data Bank): Describes the three-dimensional structures of molecules. CML (Chemical Markup Language): An XML-based standard for representing molecular data. Robotics and Automation: URDF (Unified Robot Description Format): Defines the physical configuration of a robot, including joints, links, and sensors. COLLADA (Collaborative Design Activity): An XML-based schema to describe digital assets for 3D applications, often used in robotics. Geospatial Data: KML (Keyhole Markup Language): Used for geographic data visualization, primarily in Google Earth. GeoJSON: A format for encoding a variety of geographic data structures using JSON. Mathematical Markup: MathML (Mathematical Markup Language): Describes mathematical notation and captures both its structure and content. Music and Sound: MusicXML: Encodes sheet music in a structured format that can be easily shared between different music notation software. Documents and Text: DocBook: A semantic markup language for technical documentation. Markdown: A lightweight markup language with plain text formatting syntax. Biological Data: SBML (Systems Biology Markup Language): Represents computational models of biological processes. PhyloXML: An XML format for representing phylogenetic trees. Game Development: FBX (Filmbox): A file format for 3D animation that can hold information about the geometry, textures, and animations. VRML (Virtual Reality Modeling Language): Describes interactive 3D objects and worlds. Data Visualization: ChartML: Encodes charts and graphs in a structured format. D3.js (Data-Driven Documents): Uses HTML, SVG, and CSS to bring data to life with interactive visualizations. Building Information Modeling (BIM): IFC (Industry Foundation Classes): Describes building and construction data. Textiles and Fabrics: LoomML: Represents the design and structure of woven fabrics. Augmented Reality and Virtual Reality: ARML (Augmented Reality Markup Language): Defines how augmented reality applications should behave and what content they should display. VRML (Virtual Reality Modeling Language): For describing interactive 3D objects and worlds. Medical Imaging and Health Data: DICOM (Digital Imaging and Communications in Medicine): Encodes medical imaging data. HL7 (Health Level 7): A set of standards for the exchange of information between medical applications. Simulation Data: FMI (Functional Mock-up Interface): Represents and exchanges dynamic simulation models. SBML (Systems Biology Markup Language): For computational models of biological processes. Sound and Audio: MML (Music Markup Language): For encoding music notation and performance information. SoundFont: A file format for defining musical instrument sounds. Animation and Visual Effects: BVH (Biovision Hierarchy): Encodes motion capture data. Alembic: A computer graphics interchange framework primarily for exchanging animation and visual effects data. Textile Patterns: WIF (Weaving Information File): Describes weaving patterns and structures. Knitting Markup Language: Encodes knitting patterns in a structured format. Scientific Data: CDF (Common Data Format): Used for storing scientific data. NetCDF (Network Common Data Form): Supports the creation, access, and sharing of array-oriented scientific data. Photography and Imaging: XMP (Extensible Metadata Platform): Used for embedding metadata in digital images and other media files. Construction and Engineering: LandXML: For civil engineering and land surveying data. gbXML (Green Building XML): Facilitates the transfer of building data for analysis of energy and environmental performance. Packaging and Retail: BPL (Barcode Product Labeling): Encodes information for product packaging and labeling. GS1 XML: Used for electronic business messaging, including product identification and tracking. Typography and Font Design: UFO (Unified Font Object): A format for storing font data. SFNT (Spline Font): Encodes scalable font information. Product Data Management: PLMXML (Product Lifecycle Management XML): Used for sharing product data across PLM systems. GPT 4o Mini can be fine-tuned! Awesome PaaS lists self-hosted deployment platforms. Piku - similar to Dokku – is promising.

Loved this Rocky Aur Rani Kii Prem Kahaani scene where Ranveer asks, “Chinese ko Chinese bol sakte hai?” हम बहनदी भी नहीं बोल सकते? आंटी, मैं दिल्ली से हूँ। मैं कैसे नहीं बहनदी बोलूं बहनदी!? कैसा जमाना आ गया है? फैट-ों को फैट नहीं बोल सकते, ब्लैक-ों को ब्लैक नहीं बोल सकते, ओल्ड-ों को ओल्ड नहीं बोल सकते, मुँह खोलने से डर लगता है मुझे! आप मुझे बताओ, चाइनीज़ को चाइनीज़ बोल सकते हैं? ...

Things I Learned - 21 Jul 2024

This week, I learned: GPT For Work has a set of useful spreadsheet LLM functions Xata offers a free PostgreSQL tier with REST API Mamba now uses mambaforge as the default installation, i.e. conda-forge is the default and only channel! Update: 6 Jun 2025. Mambaforge is sunset as of 29 Jul 2024. Conda-forge now uses Miniforge as the standard installer Ref conda-forge.org. Users should switch to Miniforge instead. nginx supports a load-balancing method least_conn which is far better than the default round-robin. #IMPOSSIBLE LLMs cannot provide a bounding box of objects in images. (Maybe Florence 2 can). Update: Mar 2025. Gemini has good timestamps and bounding boxes Models gently grow in capability. It helps to maintain an impossibility list that steadily gets invalidated. Ref Github Copilot internals walks through how Copilot constructs its prompts

Things I Learned - 14 Jul 2024

This week, I learned: Carlton’s TDS session Always create a new venv via VS Code when starting a training session. Helps reproduce issues (though I could use Colab instead) Create an empty .ipynb notebook and double-click it. That’s another way (though slower) to open a Jupyter notebook Share Parrish Knowledge Project podcast. Three generations of wealth There is a big difference between liking animals and being a vet. Between liking education and being a teacher. Even if no one reads your writing, you benefit from the writing. Emotional.crises like 9/11 or Covid are far easier for markets to recover from Hidden brain podcast. White trying to hard can back fire on you Sometimes conscious thinking makes our automated responses of sports music, dance are great examples Instead, SURRENDER to something outside of you. Like playing with kids. Exercise also sends blood away from brain. Drugs. ChatGPT. It’s called Ue in Chinese philosophy A quick check on the pricing of text to speech models OpenAI TTS: $15/1M chars Ref Deepgram Aura: $15/1M chars Ref Elevenlabs Scale: $165/1M chars Ref Google TTS Neural2: $16/1M chars Ref Azure AI Speech: $15/1M chars Ref AWS Polly Neural TTS: $16/1M chars Ref

I'll leave tomorrow's problems to tomorrow's me

What a delightful idea. I’ll leave tomorrow’s problems to tomorrow’s me. – Saitama, One Punch Man Saitama is now one of my favorite heroes. Right up there with Atticus Finch and Juror #8. Very few people can articulate such a wonderful philosophy as effectively. The closest was Calvin. Of course, it’s not a perfect system. But they do say, “Sometimes, the best way to get something is to stop trying to get it.”

Things I Learned - 07 Jul 2024

This week, I learned: Predibase uses LORAX to run multiple fine-tunings of a base model in a single GPU via adapters. Ref

Things I Learned - 30 Jun 2024

This week, I learned: Amara’s law: “We tend to overestimate the effect of a technology in the short run and underestimate the effect in the long run.” LLM Patterns include Evals, RAG, Fine-tuning, Caching, Guardrails, Defensive UX, Collect feedback. Notably: Defensive UX: Microsoft, Google, and Apple have guidelines for Human-AI interactions Collect feedback: Explicit and implicit Rouge and Context Precision are metrics to evaluate LLM responses that serve as a starting point – but not sufficient, usually Any word with the letters izehsglbo can be spelt on a calculator. That includes Hobbes (538804)! Via Calculator spelling Tor Browser + DuckDuckGo is good for torrent searches. Maybe the Dark Web IS the original Internet. The ad-free hacker web

Hobbes on a calculator

I just learned that any word made of just these letters beighlosz can be spelt on a calculator. That includes Hobbes! 538804 upside-down looks like this: I’m surprised I never knew that. The longest, by far, appears to be hillbillies – 53177187714

Things I Learned - 23 Jun 2024

This week, I learned: Luma Labs Dream Machine generated videos. It’s free and is of reasonable quality. Update: 6 Jun 2025. Costs $10/month LLM DataHub has LLM training datasets, regularly updated From Dan Becker on running a workshop Answer questions at the end, not in parallel in a chat, to avoid distraction Have fewer words in slides when presenting. It’s less distracting Morgan Housel Shane Parrish podcast Risk is what stops you from achieving YOUR goals. What’s risky for me may not be risky for you The lesson from compounding is that you want to optimize for duration, not return. That’s what does the heavy lifting. Survival, consistency, long term - these matter. The performance does NOT matter.

The psychology of peer reviews

We asked the ~500 students in my Tools in Data Science course in Jan 2024 to create data visualizations. They then evaluated each others’ work. Each person’s work was evaluated by 3 peers. The evaluation was on 3 criteria: Insight, Visual Clarity, and Accuracy (with clear details on how to evaluate.) I was curious to see if what we can learn about student personas from their evaluations. ...

Embeddings in DuckDB

This article on Using DuckDB for Embeddings and Vector Search by Sören Brunk shows a number of DuckDB features I wasn’t aware of. DuckDB can read directly from Huggingface datasets DuckDB can read just the parts of a .parquet file it needs, even over HTTP DuckDB lets you write custom functions in Python DuckDB now has a vector similarity search extension I’ve recently become a DuckDB fan and continue to be impressed.