2024 | S Anand

Books in 2024

I read 51 new books in 2024 (about the same as in 2023, 2022, 2021, and 2020.) But slightly differently. I only read Manga this year. Fullmetal Alchemist (Vol 12 - 27). What started off as a childishly illustrated children’s book evolved into a complex, gripping plot. Attack on Titan (Vol 1 - 34). I read it while I watched the TV Series (reading first, then watching). It started explosively and the pace never let up. I had to take breaks just to breathe and calm my nerves. The sheer imagination and subtlety is brilliant. It’s hard to decide which is better—the manga (book) or the anime (TV). The TV series translates the book faithfully in plot and in spirit. It helped that I read each chapter first, allowing me to imagine it, and then watch it, which told me what all I missed in the book. I absolutely would not have understood the manga without watching the anime. ...

My Year in 2024

Here’s the report card for my 2024 resolutions: Compound long-term goals, daily. PASS. I managed to work continuously build on 6 areas in 2024: Blogging about 50 posts on my blog and on LinkedIn Weekly notes of things I learned Teaching Tools in Data Science (repo) Reading only Manga Experimenting with LLM applications LLM Evangelization through LLM Foundry, Straive’s LLM portal. Hit 80 heart points, daily. FAIL. I stopped exercise in the second half and gained 7 kgs. Be a better husband. PASS. My wife confirmed that I was “definitely worse in 2023 than 2024.” My most memorable events in 2024 were: ...

Things I Learned - 29 Dec 2024

This week, I learned: A clever idea. Give an LLM a chapter from a textbook. Ask it to generate a unique, playable game to help me learn theconcepts for an exam. Page Bailey What would be the cost of storing about 500GB of LLM cache logs and 5 million write requests per month? CloudFlare KV: $250 + $25 / month Ref MongoDB: $125 + $5 / month Ref S3: $0.0115 + $25 / month Ref + ? CloudFlare R2: $0.0075 + $22.5 / month Ref Satya Nadella prepares for meetings by asking Copilot to tell him everything he needs to know about the client from the CRM, emails, meeting transcripts etc. He shares that colleagues who annotate it further for him. That’s using AI for reasoning and collaborating with colleagues. Satya Nadella | BG2 w/ Bill Gurley & Brad Gerstner WOW. This is how a software agent will work alongside humans: Fix issue #5478: Add color to the line next to “Ran a XXX Command” based on return value - using @openhands-agent. aisuite by Andrew Ng is a unified interface to LLMs. Sort of like an openai library across multiple providers. Learnings from Best of 2024 in Agents (from #1 on SWE-Bench Full, Prof. Graham Neubig of OpenHands/AllHands) Passing code execution as a tool is more powerful than granular tools. You combine multiple tools and tool calls into one. You move code to the data rather than the other way around. Mostly, you need bash, Python (or Jupyter), file manager, web browser. UI: Go where the user is, instead of bringing them to you. A remote runtime is a critical component. Claude 3.5 Sonnet (20241022) and Claude 3.5 Haiku (20241022) perform best on SWE Bench, followed by Deepseek V3, then O1 2024-12-17. X Browsers support SVG favicons as data URLs. So I used this SVG (generated by Claude via Generate a simple, interesting SVG favicon. Keep the SVG size VERY small but it should be inspiring.) Since HNSW indexing is an overhead, just use NumPy matrix multiplication to calculate cosine similarity. For 1M vectors, it takes ~0.05 seconds. A 1M vector dataset handles ~2GB of text at a chunk size of 2K chars. In short, if you’re embedding <2GB of text, just use NumPy. DuckDB’s VSS extension HNSW index + Embeddings (2K chunks of 512 dimensions) takes up roughly 2.5X the size of the original data. Embedding 554 files of ~4,456 KB took 710 seconds. Creating the index took 660 seconds. The resulting DB was 18.1 MB. How to use LLMs in market research. Use LLMs with search for secondary research. Create different personas and run user surveys on them. This paper used 1,052 real-life interview audio transcripts as agent memory to simulate people Generate your market research report using LLMs. Given about 30 generations, Llama 1b outperforms Llama 8b. Ref OpenAI introduced a developer role in addition to the system role. This is mainly for o1. The API is backward compatible - and also forward compatible. OpenAI Em dashes are a strong sign of ChatGPT use. Curly quotes too. Reddit CloudFlare has multiple SSL modes when proxying requests. Off (no encryption): No encryption between browsers and Cloudflare or between Cloudflare and origins. Everything is cleartext HTTP. Flexible: Browsers to Cloudflare is HTTPS, Cloudflare to origin is HTTP. Useful to set up CloudFlare as a HTTP Proxy. Full: Browser to Cloudflare matches browser request. Same protocol is used for Cloudflare to origin, without validating the origin’s certificate. Use for self-signed or otherwise invalid certificates. Full (strict): Similar to Full Mode, but with validation. Strict (SSL-Only Origin Pull): Cloudflare always connects to the origin over HTTPS with certificate validation. Getting this wrong can lead to a HTTP 526: invalid SSL certificate Medical coding is an area ripe for LLMs. Ojasvi Yadav created a repo that uses hierarchical classification (rather than embeddings) to find the right coding. Gemini models seem to understand medical terms better than others. RapidClaims, funded by TogetherAI, is apparently working on this problem. Document to Markdown Converters: PyMuPDF4LLM uses MuPDF. Requires PyTorch. PYTHONUTF8=1 uv run --with pymupdf4llm python -c 'import pymupdf4llm; h = open("pymupdf4llm.md", "w"); h.write(pymupdf4llm.to_markdown("$FILE.pdf"))' markitdown from Microsoft. PDF via PDFMiner, DOCX via Mammoth, XLSX via Pandas, PPTX via Python-PPTD, ZIP, etc. PYTHONUTF8=1 uvx markitdown $FILE.pdf > markitdown.md Docling by IBM. Unable to install via pip on Windows AND on Linux. MegaParse uses libreoffice, pandoc, tesseract-ocr, etc. Requires OpenAI API key. Awesome Tabular LLMs compiles encodings of tables for LLMs. What’s the best way of encoding tabular data for LLMs? Looks like including the cell address helps. Here is an explanation from ChatGPT aspose-words is a Python library that converts documents with many formats (Word, RTF, PDF, HTML, Markdown, EPUB, etc.) Discourse does not support searching across multiple forums. Instead, search for the term in all forums. Example. Then scroll through the results. Then, in the console, hide the ones you don’t want. Example: Hide posts that are not in the “Tools in Data Science” category: $(".badge-category__name").filter(d => d.textContent == "Tools in Data Science").map(d => d.closest(".fps-result")).filter(d => d).forEach(d => d.style.display = "none") How are software engineers are future-proofing their careers in the face of LLMs? Leveraging LLMs as Force Multipliers Use LLMs for repetitive tasks, rapid prototyping, exploring multiple approaches, data extraction and brainstorming, providing feedback. Explore prompting techniques, integrate LLMs into their workflows, and develop strategies for validating and refining LLM-generated code Focusing on higher-level skills that llms struggle with Systems Thinking and Architecture: code readability, extensibility, testability, and maintainability Problem Solving and Critical Thinking: define problems clearly, break them down into manageable parts, and reason through complex scenarios. LLMs produce plausibly incorrect code. Communication and Collaboration Domain Expertise Exploring Adjacent Roles: product management, technical leadership, or consulting. Involve more interaction with clients and stakeholders. Developing “Evergreen” Skills: debugging, system administration, and security. Or outside of software engineering, such as trades or other hands-on vocations. Scepticism: LLMs may not reach a level of sophistication that would render their expertise obsolete. Complex problems, understanding context, and producing high-quality, maintainable code. Examples of agentic AI Text-to-SQL automated business analyst: A system that generates SQL queries from natural language, handles errors, creates visualizations, and includes a FAQ component. The author calls it “constrained agentic AI.” Data source querying system: A bot that queries multiple SQL and API data sources, selecting tools and reformulating tasks as needed. Cursor (agentic mode): An LLM-powered VS Code fork that chains together various LLM capabilities (code generation, applying changes, linting suggestions, terminal commands, codebase RAG) to reduce user prompts. Vulnerability finding system: A system that uses LLM agents to discover novel vulnerabilities in open-source web applications. The agents leave traces of their actions. Marketing strategy generation system: A system using approximately 60 agents to generate marketing strategies. Restaurant finder: A system that searches for restaurants based on dietary preferences and group size, and downloads social media information. Proofreading and editing of transcripts: LLM agents apply specific customer requirements to transcripts after human editing. Meeting notes and action items generator: A system that generates meeting notes and action items. O’Reilly auto parts customer service agent: An agent demonstrated using RAG. UI enhancement agent: An agent that added features like language locales and dark mode to a UI.

When and how to copy assignments

The second project in course asked students to submit code. Copying and collaborating were allowed, but originality gets bonus marks. Bonus Marks 8 marks: Code diversity. You're welcome to copy code and learn from each other. But we encourage diversity too. We will use code embedding similarity (via text-embedding-3-small, dropping comments and docstrings) and give bonus marks for most unique responses. (That is, if your response is similar to a lot of others, you lose these marks.) In setting this rule, I applied two principles. ...

My learnings as week notes

One of my goals for 2024 is to “Compound long-term goals, daily.” Learning is one of those. Some people publish their learnings as weekly notes, like Simon Willison, Thejesh GN, Anil Radhakrishna, and Julia Evans. I follow their notes. I started doing the same, quietly, to see if I could sustain it. It’s been a year and it has sustained. I’m finally publishing them. My week notes are at til.s-anand.net. Here’s the source code. ...

Windows PowerToys is my new favorite tool

Windows PowerToys is one of the first tools I install on a new machine. I use it so much every day that I need to share how I use it. I’ve been using it for a long time now, but the pace at which good features have been added, it’s edged out most other tools and is #4 in terms of most used tools on my machine, with only the browser (Brave, currently), the editor (Cursor, currently), and Everything are ahead.) ...

Things I Learned - 22 Dec 2024

This week, I learned: What to use for hosting: ChatGPT GitHub Pages: Static websites, medium files Cloudflare Pages: Static websites, global delivery Vercel: Frontend frameworks (e.g. Next.js) with high DX and ISR, small files Netlify: JAMstack projects, minimal back-end, moderate files Glitch: Small static projects Render: Full-stack apps requiring databases and server-side compute Firebase Hosting: Small sites, limited large files Archive.org: Public archival, large files Google Drive: File sharing, large files Dropbox: File sharing, moderate files Cloudflare R2: Static assets, large file delivery Anthropic defines agents. Building effective agents + Cookbook Augmented LLMs are LLMs enhanced with augmentations such as retrieval, tools, and memory. Workflows are systems where LLMs and tools are orchestrated through predefined code paths. Prompt chaining: Pipe each LLM output to the next LLM. A->B->C->Z. E.g. Write report, then translate. Extract results, then verify them. Successively ask follow-up questions. Routing: One LLMs decides which other LLM to call next. A->B|C|D->Z. E.g. Evaluate complexity, then pick the right model. Classify request time, then pick the right prompt. Parallelize: Sectioning (and Orchestrator-workers): Break tasks into independent subtasks, then aggregate. A->B+C+D->Z. E.g. Evaluate contracts against different clauses in parallel. Parallelize: Voting: Run same task multiple times, then vote. A->B+B+B->Z. E.g. Review code for prompt injection using different prompts. Evaluate content safety with different thresholds. Evaluator-optimizer: One model checks another in a loop. A->B->A->B->…->Z. E.g. Literary translation. Self-healing code. Policy violation checks. Human-in-the-loop Checkpoints: The workflow explicitly requests human review at certain stages. A->B->(Human)->C->Z. E.g. Sensitive content review. High-stakes decision making. Ambiguous tasks. Agents are LLMs that dynamically direct their own processes and tool usage, consulting tools or the user as needed. To download YouTube subtitles, use: yt-dlp -q --skip-download --convert-subs srt --write-sub --sub-langs "en" --write-auto-sub --print "requested_subtitles.en.url" "$url" Simon Willison o1-preview diagnoses better than doctors. Harvard OpenAI’s release of ephemeral tokens via sessions (valid for 1 minute) are a useful way of exposing apps for public demos. Currently it works only for the Realtime API, though. SpreadsheetLLM is a way of encoding spreadsheets in an LLM friendly format. It’s good for 1K+ rows. For lower, Markdown > XML > HTML. However, Table Meets LLM suggests that HTML > XML > Markdown, so this is unclear. #HARD prompt. Ask video generators like SORA to generate text in videos. It is of average quality. GPT 4o Mini Realtime was released. A realtime conversation will cost ~50c/hr. About 36c for input, 72c for output. (I extrapolated from the 6c/min audio input cost for GPT 4o Realtime when it was $100/MTok. GPT 4o Mini Realtime is $10/MTok input and $20/MTok output.) This is an interesting way to understand software. Generate a Mermaid sequence diagram showing interactions based on this code. Ref The King James Bible and all Harry Potters, each, are about $1M tokens (rounded off). markdown2 is the new de facto Markdown library for Python. Claude 3.5 Sonnet is way ahead of competition on the LMSYS Webdev Arena Raspberry Pi 5 has a faster CPU, more RAM and GPU, 4K support, multiple USB 3 ports Government websites like the official press releases cannot be crawled from outside India. Hence the need for server farms in India!

A Post-mortem Of Hacking Automated Project Evaluation

In my Tools in Data Science course, I launched a Project: Automated Analysis. This is automatically evaluated by a Python script and LLMs. I gently encouraged students to hack this - to teach how to persuade LLMs. I did not expect that they’d hack the evaluation system itself. One student exfiltrated the API Keys for evaluation by setting up a Firebase account and sending the API keys from anyone who runs the script. ...

Hacking LLMs: A Teacher's Guide to Evaluating with ChatGPT

If students can use ChatGPT for their work, why not teachers? For curriculum development, this is an easy choice. But for evaluation, it needs more thought. Gaining acceptance among students matters. Soon, LLM evaluation will be a norm. But until then, you need to spin this right. How to evaluate? That needs to be VERY clear. Humans can wing it, have implicit criteria, and change approach mid-way. LLMs can’t (quite). Hacking LLMs is a risk. Students will hack. In a few years, LLMs will be smarter. Until then, you need to safeguard them. This article is about my experience with the above, especially the last. ...

Exploring Creativity with SORA: My Animation Journey

I got access to SORA today. My first attempts was typical. An animated cartoon featuring Calvin, a young boy with spiky hair, standing in a playful boxing stance with oversized boxing gloves. He looks determined as he says ‘Bring it on!’ in a speech bubble. Facing him is Hobbes, a tall and slightly bemused tiger, also in a mock boxing pose with a gentle smile, as if humoring Calvin. The scene is set in Calvin’s backyard, typical of a Calvin and Hobbes comic, with a simple and uncluttered backdrop. ...

Things I Learned - 15 Dec 2024

This week, I learned: **/*.md can search for all Markdown files. Julia Evans Windows 11 2024 Update features: Ref Live captions (via the tray) can transcribe audio and microphone. Cocreator in Paint lets you draw crudely and enhances it with AI. The neat UI is a slider that lets you control how close it should be to your drawing. Voice Clarity automatically cancels echo, reduces background noise, and minimizes reverb. Studio Effects (via the tray) lets you apply camera effects on all apps. Eye contact feature is CLEVER! sudo lets you run commands with admin privileges from the command line. source Roaming RAG is an alternative to RAG without the vector database. Applicable to well structured documents, e.g. technical books, manuals, etc. Create a hierarchical outline of the document. Code Keep the top-level headings. Preserve the first ~100 characters of opening text from each section. Present the second-level headings, but without any subsidiary content. Provide each section a unique 8 digit hex identifier. Each section heading is followed by a guiding comment for the model: Section collapsed - expand with expand_section("{identifier}"). Then read the relevant sections as context to answer the question. Code Traffic to StackOverflow has fallen considerably. Especially from young and Indian developers. StackOverflow revenue is down. Via Prashanth. They’re exploring: Licensing their content. (Meta says high quality content improves LLM performance by 30% on HumanEval) Enterprise StackOverflow for system integration Fine-tuned versions of Enterprise Stackoverflow for enterprises Integrate StackOverflow within your IDE. Ask questions, post directly I surveyed the Gramener QA team on how they were using LLMs. 7 used it for code generation (e.g. date extraction, regex generation) 4 used it for learning (e.g. Robot Framework, how to define test cases, API usage) 3 used it for formula generation (e.g. Excel) 2 used it for test scenario identification 2 used it for test data generation 2 used it for comparing expected vs actual datasets 1 used it for data type identification (e.g. given sample values, identify the data type). 1 used it for evaluating resulting (LLM as a judge) I asked the Straive Digitalized Operations team what management techniques they would apply to manage LLMs. Here are the responses: Ask better questions. (Prompt engineering.) Create templates or step-by-step instructions. (Chain of Thought.) Ask for multiple options and pick from the best options. (Agentic approach?) Training. (Fine tuning.) Price weaker responses lower. (Stratified model pricing?) “LLM hallucinations are a good thing. They are a sign of diversity, allowing us to improve the answer by exploring multiple paths.” – A colleague from Straive. Hyperbrowser is a cloud based puppeteer service. Bedrock Llama models can’t be directly called with their model names. You need to use their inference profile names, e.g. us.meta.llama3-2-11b-instruct-v1:0 if the model is in a US region. Hacker News RSS is a good way to get RSS feeds from Hacker News. It’s also a good way to understand how to convert a news source into RSS feeds. BlueSky has RSS feeds too When embedding using a SentenceTransformer.encode(docs) it’s best if we embed with smaller docs and call it multiple times (rather than embedding more at once). On Colab T4, for gte-base-en-v1.5, when embedding 1,000 docs of up to 8K chars each, here is the TOTAL time it took, based on batch sizes (lower is better) 1 doc per call: 10s 2 docs per call: 13s 4 docs per call: 19s 8 docs per call: 23s 16 docs per call: 32s 32 docs per call: 40s Running embeddings without a GPU is extremely slow. It takes ~2.4 seconds per string.

Things I Learned - 08 Dec 2024

This week, I learned: ChatGPT uses several unusual unicode characters for citations. Ref NumLock can be dangerous. An IT support team member took control of Radheya’s screen while debugging and had turned on NumLock. Radheya’s login failed after that. After 5 tries, he was locked out. With LLMs, most architectural decisions are no longer one-way doors. Steve Yegge The cost of intelligence is trending to zero. How do we plan for this? Logan Kilpatrick If you are not planning for the price of intelligence to go to zero, the next 3-5 years are going to incredibly disruptive to your business / life. The important but not stated caveat: consumer willingness to pay for AI is going to go up (a lot). It will be fascinating to watch consumer willingness, cost, and the amount of AI being used all move in different directions. Everyone building things with AI has an economic incentive to limit the amount of AI because of cost, which inherent limits the value prop. This will change as intelligence goes up and cost goes down. What this means is: Admin automation: Administrative tasks vanish into background AI. Booking meetings, managing finances, or even planning family activities will require less thought. Hyper-personalization: Individuals get tailor-made everything—from medical advice to product recommendations to daily schedules. Systems learn your quirks. AI co-brains: AI co-worker “assistants support you at any moment. Productivity soars in knowledge work. “I’ll have my AI follow up becomes a normal response. Humanity valued more: As AI handles rote tasks, humans move up the value chain, focusing on creativity, empathy, or the “last-mile decisions. New business models: AI experts as a service Embedded AI Solutions AI micro-services for smart-calls Distributed AI Arena Hard is a set of hard prompts to test LLMs. Here is the code and evaluation LLMs can detect clear outliers easily. PROMPT: Which is the outlier in this dataset: (1,7), (2,7), (3,6), (4,6), (5,5), (6,1), (7,5), (8,3), (9,1), (10,1) (ANS: (6,1)) 🟢 GPT-4o on ChatGPT gets this. GPT-4o Mini on the API gets it too. 🟢 Gemini Pro, Flash, Flash 8b gets this right straight away, without even thinking. 🟢 Claude 3.5 Sonnet, Claude 3 Haiku, Claude 3.5 Haiku get it on LLM Foundry. 🔴 Claude.ai, where it visualizes it and gets it wrong. 🟢 Nova Micro, Lite, and Pro get it right. 🟢 Llama 3.1 70b gets it right. 🔴 Llama 3.2 8b gets it wrong. Llama 3.2 70b, Llama 3.1 8b enter repetition. To install Docker on Windows without admin privileges, use net localgroup docker-users "your-user-id" /ADD A non-administrator in a Google Groups domain can only add 200 emails to a group from the UI directly without invitation at a time. The only programmatic way to add users is for an administrator to add them. Even apps that use the Google Admin SDK need an admin to log in to access the relevant API. Take 100% of your work, including complex, multi step processes and put it into an LLM. It might fail at some but you will discover the limitations. I emailed Straive employees about their use of LLM Foundry - the internal LLM portal. I picked ~500 non-users from teams that otherwise have high (30%+) usage. Reasons they didn’t use it were: 40% had not heard of it. 40% were unclear of the benefits 20% didn’t have time 45% feel they don’t have enough information and training to use it Some feedback Sharing training videos will help Live training sessions that allows for Q&A will help Developers prefer detailed documentation The same prompt gives different results Possible solution: Email non-users introducing the tool and sharing a quick 15-minute tutorial and a 1-page quick start. My notes on the Amazon Nova models. More on Hacker News Nova Micro (3.75c/MTok) has the same cost as Gemini 1.5 Flash 8b but does not support images or documents. Nova Lite (6c/MTok) has about the same cost as Gemini 1.5 Flash 002 and supports images and documents (but not audio or video). It may be a good alternative. But GPT-4o mini, which is 2.5X costlier, is much better. (It partly passes the Gr brx vshdn Fdhvdu flskhu? test which Nova Lite fails.) Nova Pro (80c/MTok) is cheaper than Gemini 1.5 Pro and a lot cheaper than GPT 4o, but does not match their quality. LLMs are great at convincing you of wrong things. A danger and something to be wary of. Ethan Mollick Fish eye text summary is a great way to read text while summarizing context. Amelia Wattenberger DuckDB’s JavaScript API is still under development. For example, JSON, ARRAY are not insertable. Plus, re-creating persistent HNSW indices crashes. What’s a good text splitter library to use in JS? LangChain: If you use it, use it with a simple wrapper decoupled from the implementation (e.g. your own parameters) that you can replace later. Popular Fit-for-purpose. MarkdownTextSplitter which inherits from RecursiveCharacterTextSplitter is what’s needed in most cases. Unstable Poorly maintained Python docs indicate version 0.0 but it is in 0.1 Under-maintained Last update was 3 months ago, 13 Sep 2024 LlamaIndex: Popular Not an ideal fit. MarkdownNodeParser does not support chunk size. SentenceWindowNodeParser does not capture Markdown headings.

Secrets from the ChatGPT Conversation Schema

Things I Learned - 01 Dec 2024

This week, I learned: Gists are a good place to store static files for posterity as well as throwaway files. But, they’re just git repositories. So there may be no advantage over GitHub repos. GPT-4o Audio supports tone control via XML tags like <cough>..., <laugh>..., etc. But at ~$15/hr of output, it’s too expensive. Ref Mridula’s son gave a live commentary of what he was doing on Minecraft and ChatGPT gave him live evaluation and coaching. E.g. “Great strategy! Getting to the launch pad early can give you a huge mobility advantage. Making the bridge wider is also a smart move to prevent accidental falls. With this plan, you’re setting yourself up for success. This is a great way to interact with LLMs. Gemini’s JSON mode returns JSON with keys in alphabetical order. I think. Emperical evidence. This is unlike OpenAI which explicitly returns the keys in the order specified. To solve this, order the keys alphabetically. HTMX focuses on HTML over JS. Like server responses being HTML snippets not JSON. But I need front-end over back-end. Client side apps. HTMX doesn’t help much there, e.g. templating, or just plain JS code. htmx client side templates do can convert JSON to HTML. I installed the OpenAI Desktop App as well as Claude for Desktop. They take up too much RAM (260MB and 750 MB respectively on startup - though this varies.) The ChatGPT web page takes ~100MB incrementally, so I wrote an AutoHotkey script to switch to the first open (or recently closed) ChatGPT tab on Brave. I tried LIDA from Microsoft, after almost a year of its release. A few notes: Just running uvx lida ui --port 8080 --docs works. But I needed to use export TCL_LIBRARY=C:/Users/Anand/AppData/Roaming/uv/python/cpython-3.13.0-windows-x86_64-none/tcl/tcl8.6 to point it to my TCL installation for charts to work. I also chose to export OPENAI_BASE_URL=https://llmfoundry.straive.com/openai/v1 I also chose to replace gpt-3.5-turbo-0301 (the default model) with gpt-4o-mini in lida/web/ui/component* It’s quite impressive. OpenAI allows multiple system messages. I learned this browsing through the LIDA prompts. Anthropic’s Model Context Protocol lets any apps integrate with LLM Apps. LLM Apps are becoming the new operating system. Competitors, beware. I spoke at Automating Data Visualizations using LLMs at SUTD. Apparently, using LLMs to write code is much more common than writing code to use LLMs. I ran a quick quiz. Have you used ChatGPT or any LLM? 35 / 35 raised their hands. Have you written code using an LLM? 34 / 35 raised their hands. (I was impressed.) Have you uploaded a spreadsheet to an LLM for analysis? 15 / 35 raised their hands. Have you programmatically called an LLM API? 6 / 35 raised their hands. With LLMs, fostering innovation is a new path to profitability. Companies are increasing innovation team sizes. Productionizing that is the next. Some initiatives are: Convert popular demos into starter kits Create and evangelize trainings on solutions and solution techniques Create larger pools of capacity to build innovation and productionize it Andrew Ng Explores The Rise Of Al Agents And Agentic Reasoning | BUILD 2024 Keynote Innovation is now a path to production. People are able to build 20 prototypes at the cost of one and see which sticks Machine learning is much faster. Things that took months now date days. But engineering and evaluations are only slightly faster and have become a bottleneck A good analogy to zero shot prompting is to ask a person to write an entire essay without pressing backspace even once Andrew scenes to align with the line chain definition of agentic workflow, which is about agents being able to craft their own control flows People find it very easy to understand agentic workflows once they read through the code Reflection or feedback is a useful agentic pattern In multi-agent collaboration, it may be the same underlying model that is acting as different agents. But just like we find it useful for the same CPU to run multiple processes and each application is its own abstraction, agents of useful abstraction It’s hard to summarize a large document using RAG. But you can directly add answers to such questions into the corpus, e.g. by adding a “summary” section, and other answers to common questions. CloudFlare workers can bundle any kind of files, including text, data, and WASM. Docs AssemblyScript can compile TypeScript to WASM. Here’s what I learnt Here’s a convenient pattern to git commit a directory but nothing else in it (e.g. a build/ directory). Add a .gitignore file with * followed by !.gitignore. Only the .gitignore file is tracked. Ultravox lets you build voice agents at 5c/min = $3/hr (OpenAI is 6c input, 24c output). Or clone their repo. Idle call time is counted towards cost. So cost may be higher than OpenAI. Voice cloning quality is average. Very distinctive voices are just partly identifiable. Supports tool calls (from their server). Their API is simple but the docs have minor errors (e.g. a trailing comma in the JSON, which leads to an error) reducing confidence. LLMs may be good at derived data generation. For example, given a database schema, what derived columns would be useful? What derived views would be useful? The O1 model does not have a mechanism to control the amount of tokens to spend on reasoning. DeepSeek R1 might, but the API is not out yet. The OpenAI Desktop App can interact with native applications, e.g. read from Terminal, VS Code, etc. This takes it on a path to becoming a copilot for ANY apps. Putting every copilot app and every LLM integration under threat. Crawl4AI and Firecrawl are tools / libraries to convert websites into LLM Friendly Markdown and extract structured data using LLMs. Don’t try and solve specific problems. Pass the entire context to an LLM and get a comprehensive solution. Most doctors, for example, ask specific search-like questions instead of uploading the entire case history and asking for a diagnosis, and perform workse than LLMs. Ethan Mollick

Introducing Students to AI Evaluators

In my Tools in Data Science course at IITM, I’m introducing a project that will be evaluated by an LLM. Here’s the work-in-progress draft of the project. It will eventually appear here. Your task is to: Write a Python script that uses an LLM to analyze, visualize, and narrate a story from a dataset. Convince an LLM that your script and output are of high quality. The second point is the interesting one. Using the LLM as the evaluator. ...

Will people accept AI performance evaluations? Anish Agarwal triggered this question a few weeks ago, mentioning that it’s hard for people to feel evaluated by AI. But I believe LLMs are great for evaluation. We need to get comfortable AND familiar with them. So I’m introducing a project next week for my students: USE AN LLM to automatically analyze data. Given a dataset, write a program that will use LLMs to create an analysis report. CONVINCE IT to give you marks. Write the code and report in a way that the LLM will reward you. Here’s the project: https://github.com/sanand0/tools-in-data-science-public/blob/tds-2023-t3-project2-wip/project-2-automated-analysis.md ...

Things I Learned - 24 Nov 2024

This week, I learned: OpenAI lets you download GPT instructions and execute arbitrary code in their containerized environment. This is not a bug. Ref BM25 works as follows: Ref For each query term in the query, sum up the product of: Inverse document frequency = LN(% of docs without the query term + 1) – with a small tweak Term frequency = freq / (freq + k) – where k is usually between 1.2 to 2. Returns 0-1 with diminishing frequency benefit k is multiplied by Document length normalization = 1 - b(1- DocLength/AvgDocLength). Longer documents have larger k, dampening frequency benefits. Some implications: The actual BM25 score has no meaning. It’s just useful for ordering BM25 scores for 2 queries can be compared ONLY IF the document sets don’t change A list of Markdown to Website converters on this thread: Jekyll - Ruby - 2008 MkDocs - Python - 2014 GitBook - JavaScript (Node.js) - 2014 MkDocs Material - Python (MkDocs-based) - 2016 Docsify - JavaScript - 2016 MdBook - Rust - 2017 Antora - JavaScript (Node.js) - 2017 Docusaurus - JavaScript (React) - 2017 JupyterBook - Python - 2019 Keenwrite - Java - ~2019 Honkit - JavaScript (GitBook fork) - 2019 Nextra - JavaScript (Next.js) - 2020 Astro - JavaScript/TypeScript - 2021 Hugo Book - Go (Hugo-based) - ~2020 Clowncar - JavaScript/Node.js - ~2021 Quarto - R and Python - 2022 Starlight - JavaScript/TypeScript - 2023 DuckDB has an LLMs.txt. Today, 38 repos on GitHub support it When identifying LLM use cases, it helps to tell LLMs what they can do. I use one or more of a list like below: Core capabilities: Text Generation: Produce coherent and contextually relevant text across various domains. Image Generation: Create realistic images that match the style and content of a given reference image. Text to Speech: Convert text into natural-sounding speech with appropriate intonation and rhythm. Speech to Text: Transcribe and interpret spoken language. Vision: Analyze and describe visual content from images. Video Analysis: Summarize and extract information from video content. Text to Video: Generate realistic (and surrealistic) videos from text descriptions. Function Calling: Execute predefined functions or access external tools to perform specific tasks. Structured Output: Generate structured outputs like JSON, XML, HTML, YAML, DSLs, etc. Tool Use: Utilize external applications or APIs to enhance functionality. Code Generation: Write and debug code snippets in various programming languages. Cross-domain use cases: Summarization: Understand and condense lengthy documents into concise summaries. Translation: Convert text between multiple languages with high accuracy. Question Answering: Provide precise answers to user queries based on provided information. Reasoning and Planning: Solve complex problems and develop step-by-step plans. Personalization: Tailor responses based on user preferences and historical interactions. Dialogue Management: Engage in context-aware, multi-turn conversations. Data Analysis: Interpret and generate insights from structured data. Content Moderation: Identify and filter inappropriate or harmful content. Sentiment Analysis: Detect and interpret emotions and opinions in text. Robotics Integration: Interface with robotic systems for control and decision-making. Knowledge Retrieval: Access and present information from vast datasets or knowledge bases. Creative Writing: Generate poetry, stories, and other creative content. Educational Assistance: Provide explanations and tutoring across various subjects. Ethical Reasoning: Assess scenarios for ethical considerations and implications. Accessibility Support: Assist users with disabilities through tailored interactions. Simulation and Modeling: Create predictive models and simulate scenarios. Domain-specific use cases: Legal and Medical Assistance: Offer information and guidance within legal and medical domains. Gaming: Generate narratives, dialogues, and scenarios for interactive entertainment. Scientific Research: Aid in literature reviews, hypothesis generation, and data interpretation. Financial Analysis: Analyze market trends and provide investment insights. Cultural Competence: Understand and respect diverse cultural contexts in interactions. Security Applications: Detect and respond to potential cybersecurity threats. Environmental Monitoring: Analyze data related to environmental changes and sustainability. Healthcare Support: Assist in patient monitoring, diagnostics, and personalized treatment plans. Supply Chain Optimization: Enhance logistics and inventory management through predictive analysis. Customer Service: Provide automated support and resolve customer inquiries. Market Research: Analyze consumer behavior and market trends for business insights. Content Creation: Generate articles, blogs, and marketing materials. Virtual Assistance: Manage schedules, reminders, and personal tasks. Social Media Management: Craft posts and engage with audiences across platforms. Human Resources: Assist in recruitment, training, and employee engagement strategies. Event Planning: Organize and coordinate events, including logistics and communication. Travel Planning: Provide itineraries, booking assistance, and destination information. Real Estate: Analyze property markets and assist in buying or selling decisions. Agriculture: Monitor crop health and optimize farming practices through data analysis. Energy Management: Optimize energy consumption and monitor renewable energy sources. Transportation: Enhance route planning and traffic management systems. Urban Planning: Assist in designing sustainable and efficient urban infrastructures. Disaster Response: Provide real-time information and coordination during emergencies. Public Policy: Analyze data to inform policy decisions and predict societal impacts. Art and Design: Generate visual art concepts and assist in creative design processes. Music Composition: Create original music pieces and assist in songwriting. Language Learning: Facilitate language acquisition through interactive exercises and feedback. Historical Analysis: Interpret historical data and provide insights into past events. Philanthropy: Identify charitable opportunities and assess the impact of donations. Sports Analytics: Analyze player performance and game strategies. Fashion: Predict trends and assist in clothing design and merchandising. Culinary Arts: Generate recipes and provide cooking guidance. Astronomy: Analyze celestial data and assist in space exploration research. Psychology: Offer insights into human behavior and mental health support. Linguistics: Analyze language patterns and assist in translation studies. Archaeology: Assist in artifact analysis and historical site interpretations. Literature Analysis: Interpret literary works and provide critical analyses. Philosophy: Engage in discussions on ethical dilemmas and existential questions. Mathematics: Solve complex equations and assist in theoretical research. Physics: Model physical phenomena and assist in experimental design. Chemistry: Analyze chemical compounds and predict reactions. Biology: Assist in genetic research and ecological studies. Geology: Analyze geological data and assist in natural resource exploration. Meteorology: Predict weather patterns and analyze climate data. Oceanography: Study marine ecosystems and assist in ocean exploration. Anthropology: Analyze cultural data and assist in ethnographic research. Style of writing impacts output style a lot. E.g. Adding an evil laugh makes Claude more creative. Ethan Mollick For good structured mode output, we need good prompting. Mentioning examples and schema and “JSON” helps. When providing examples, using (user, assistant) message pairs helps (I think it’s because it’s easier for the LLM to parse). Using a {reasoning, answer} schema (with reasoning first) helps. Make reasoning concise and relevant Ref Arxiv We already know code in JSON is not a great idea. Ref Just adding 3 real examples and regurgitation helped GPT 4o play chess much better. Both techniques may have more general use in prompting. Simon Willison With Deno 2.0, the same .js file can run in Node.js as well as Deno. Example jspm lets you generate import maps against any CDN. You can click on htop columns on the terminal to sort by that column! Mouse events work on command line apps. Julia Evans Alt Text will very likely be a browser feature. It’s important for the Alt text to flow as part of the content when listening to the page. Perhaps even become a part of the browser APIs like speechRecognition. Langchain suggests multiple levels of agentic behaviour. LLM Call < LLM Chain < LLM Rounter < State Machine < Autonomous Langchain A HTML quine: A page that, when rendered as HTML, shows the HTML source code of the page! You can enable syntax highlighting just using fonts. Ref HTML is all you need shows examples of using HTML for notebooks instead of Jupyter, Observable, etc. Straive evaluated Gemini 1.5 Flash 002 and GPT 4o Mini for translation. Portugese: Flash is better than GPT 4o Mini. BLEU Word Overlap is 65.5% > 64.6% and METEOR (Semantic) is 84.9% > 78.9% Mandarin: Flash is better than GPT 4o Mini. BLEU Word Overlap is 25.0% > 15.9% and METEOR (Semantic) is 54.7% > 51.1% The problem with Accept headers is that you can’t link to them. Simon Willison Recraft v3 supports vector (SVG) generation Simon Willison. The output is 100% <path> elements (even for text). You get 50 free credits daily. Creating 1 image is ~2 credits. The API costs $1 per 1K credits. Some things I can create with it are: Base data visualizations that I can animate with code Icons in a specific style Comic strips Explainers for talks or student material Featured images for blog posts Architecture diagrams?

ChatGPT Beat me at Pictionary

Me: Let’s play pictionary. You draw. I’ll guess. ChatGPT: Sure! I’ll draw something for you. Give me a moment. ChatGPT: Here you go! What do you think it is? Me: House ...

Why don't students hack exams when they can?

This year, I created a series of tests for my course at IITM and to recruit for Gramener. The tests had 2 interesting features. One question required them to hack the page Write the body of the request to an OpenAI chat completion call that: Uses model gpt-4o-mini Has a system message: Respond in JSON Has a user message: Generate 10 random addresses in the US Uses structured outputs to respond with an object addresses which is an array of objects with required fields: street (string) city (string) apartment (string) . Sets additionalProperties to false to prevent additional properties. What is the JSON body we should send to https://api.openai.com/v1/chat/completions for this? (No need to run it or to use an API key. Just write the body of the request below.) ...

Should courses be hard or easy?

Here’s a post I shared with the students of my Tools in Data Science course at IITM. This was in response to a student posting that: The design of TDS course lecture videos are designed in such a way that it could be understood only by the data scientists not by the students like me who are entirely new to the field of data science. Though I have gone through 6 weeks of course lecture videos, I am not fully aware of the usage of ChromeDevTools, Bash, Github etc…. ...