Introducing Students to AI Evaluators

In my Tools in Data Science course at IITM, I’m introducing a project that will be evaluated by an LLM. Here’s the work-in-progress draft of the project. It will eventually appear here. Your task is to: Write a Python script that uses an LLM to analyze, visualize, and narrate a story from a dataset. Convince an LLM that your script and output are of high quality. The second point is the interesting one. Using the LLM as the evaluator. ...

Will people accept AI performance evaluations? Anish Agarwal triggered this question a few weeks ago, mentioning that it’s hard for people to feel evaluated by AI. But I believe LLMs are great for evaluation. We need to get comfortable AND familiar with them. So I’m introducing a project next week for my students: USE AN LLM to automatically analyze data. Given a dataset, write a program that will use LLMs to create an analysis report. CONVINCE IT to give you marks. Write the code and report in a way that the LLM will reward you. Here’s the project: https://github.com/sanand0/tools-in-data-science-public/blob/tds-2023-t3-project2-wip/project-2-automated-analysis.md ...

Things I Learned - 24 Nov 2024

This week, I learned: OpenAI lets you download GPT instructions and execute arbitrary code in their containerized environment. This is not a bug. Ref BM25 works as follows: Ref For each query term in the query, sum up the product of: Inverse document frequency = LN(% of docs without the query term + 1) – with a small tweak Term frequency = freq / (freq + k) – where k is usually between 1.2 to 2. Returns 0-1 with diminishing frequency benefit k is multiplied by Document length normalization = 1 - b(1- DocLength/AvgDocLength). Longer documents have larger k, dampening frequency benefits. Some implications: The actual BM25 score has no meaning. It’s just useful for ordering BM25 scores for 2 queries can be compared ONLY IF the document sets don’t change A list of Markdown to Website converters on this thread: Jekyll - Ruby - 2008 MkDocs - Python - 2014 GitBook - JavaScript (Node.js) - 2014 MkDocs Material - Python (MkDocs-based) - 2016 Docsify - JavaScript - 2016 MdBook - Rust - 2017 Antora - JavaScript (Node.js) - 2017 Docusaurus - JavaScript (React) - 2017 JupyterBook - Python - 2019 Keenwrite - Java - ~2019 Honkit - JavaScript (GitBook fork) - 2019 Nextra - JavaScript (Next.js) - 2020 Astro - JavaScript/TypeScript - 2021 Hugo Book - Go (Hugo-based) - ~2020 Clowncar - JavaScript/Node.js - ~2021 Quarto - R and Python - 2022 Starlight - JavaScript/TypeScript - 2023 DuckDB has an LLMs.txt. Today, 38 repos on GitHub support it When identifying LLM use cases, it helps to tell LLMs what they can do. I use one or more of a list like below: Core capabilities: Text Generation: Produce coherent and contextually relevant text across various domains. Image Generation: Create realistic images that match the style and content of a given reference image. Text to Speech: Convert text into natural-sounding speech with appropriate intonation and rhythm. Speech to Text: Transcribe and interpret spoken language. Vision: Analyze and describe visual content from images. Video Analysis: Summarize and extract information from video content. Text to Video: Generate realistic (and surrealistic) videos from text descriptions. Function Calling: Execute predefined functions or access external tools to perform specific tasks. Structured Output: Generate structured outputs like JSON, XML, HTML, YAML, DSLs, etc. Tool Use: Utilize external applications or APIs to enhance functionality. Code Generation: Write and debug code snippets in various programming languages. Cross-domain use cases: Summarization: Understand and condense lengthy documents into concise summaries. Translation: Convert text between multiple languages with high accuracy. Question Answering: Provide precise answers to user queries based on provided information. Reasoning and Planning: Solve complex problems and develop step-by-step plans. Personalization: Tailor responses based on user preferences and historical interactions. Dialogue Management: Engage in context-aware, multi-turn conversations. Data Analysis: Interpret and generate insights from structured data. Content Moderation: Identify and filter inappropriate or harmful content. Sentiment Analysis: Detect and interpret emotions and opinions in text. Robotics Integration: Interface with robotic systems for control and decision-making. Knowledge Retrieval: Access and present information from vast datasets or knowledge bases. Creative Writing: Generate poetry, stories, and other creative content. Educational Assistance: Provide explanations and tutoring across various subjects. Ethical Reasoning: Assess scenarios for ethical considerations and implications. Accessibility Support: Assist users with disabilities through tailored interactions. Simulation and Modeling: Create predictive models and simulate scenarios. Domain-specific use cases: Legal and Medical Assistance: Offer information and guidance within legal and medical domains. Gaming: Generate narratives, dialogues, and scenarios for interactive entertainment. Scientific Research: Aid in literature reviews, hypothesis generation, and data interpretation. Financial Analysis: Analyze market trends and provide investment insights. Cultural Competence: Understand and respect diverse cultural contexts in interactions. Security Applications: Detect and respond to potential cybersecurity threats. Environmental Monitoring: Analyze data related to environmental changes and sustainability. Healthcare Support: Assist in patient monitoring, diagnostics, and personalized treatment plans. Supply Chain Optimization: Enhance logistics and inventory management through predictive analysis. Customer Service: Provide automated support and resolve customer inquiries. Market Research: Analyze consumer behavior and market trends for business insights. Content Creation: Generate articles, blogs, and marketing materials. Virtual Assistance: Manage schedules, reminders, and personal tasks. Social Media Management: Craft posts and engage with audiences across platforms. Human Resources: Assist in recruitment, training, and employee engagement strategies. Event Planning: Organize and coordinate events, including logistics and communication. Travel Planning: Provide itineraries, booking assistance, and destination information. Real Estate: Analyze property markets and assist in buying or selling decisions. Agriculture: Monitor crop health and optimize farming practices through data analysis. Energy Management: Optimize energy consumption and monitor renewable energy sources. Transportation: Enhance route planning and traffic management systems. Urban Planning: Assist in designing sustainable and efficient urban infrastructures. Disaster Response: Provide real-time information and coordination during emergencies. Public Policy: Analyze data to inform policy decisions and predict societal impacts. Art and Design: Generate visual art concepts and assist in creative design processes. Music Composition: Create original music pieces and assist in songwriting. Language Learning: Facilitate language acquisition through interactive exercises and feedback. Historical Analysis: Interpret historical data and provide insights into past events. Philanthropy: Identify charitable opportunities and assess the impact of donations. Sports Analytics: Analyze player performance and game strategies. Fashion: Predict trends and assist in clothing design and merchandising. Culinary Arts: Generate recipes and provide cooking guidance. Astronomy: Analyze celestial data and assist in space exploration research. Psychology: Offer insights into human behavior and mental health support. Linguistics: Analyze language patterns and assist in translation studies. Archaeology: Assist in artifact analysis and historical site interpretations. Literature Analysis: Interpret literary works and provide critical analyses. Philosophy: Engage in discussions on ethical dilemmas and existential questions. Mathematics: Solve complex equations and assist in theoretical research. Physics: Model physical phenomena and assist in experimental design. Chemistry: Analyze chemical compounds and predict reactions. Biology: Assist in genetic research and ecological studies. Geology: Analyze geological data and assist in natural resource exploration. Meteorology: Predict weather patterns and analyze climate data. Oceanography: Study marine ecosystems and assist in ocean exploration. Anthropology: Analyze cultural data and assist in ethnographic research. Style of writing impacts output style a lot. E.g. Adding an evil laugh makes Claude more creative. Ethan Mollick For good structured mode output, we need good prompting. Mentioning examples and schema and “JSON” helps. When providing examples, using (user, assistant) message pairs helps (I think it’s because it’s easier for the LLM to parse). Using a {reasoning, answer} schema (with reasoning first) helps. Make reasoning concise and relevant Ref Arxiv We already know code in JSON is not a great idea. Ref Just adding 3 real examples and regurgitation helped GPT 4o play chess much better. Both techniques may have more general use in prompting. Simon Willison With Deno 2.0, the same .js file can run in Node.js as well as Deno. Example jspm lets you generate import maps against any CDN. You can click on htop columns on the terminal to sort by that column! Mouse events work on command line apps. Julia Evans Alt Text will very likely be a browser feature. It’s important for the Alt text to flow as part of the content when listening to the page. Perhaps even become a part of the browser APIs like speechRecognition. Langchain suggests multiple levels of agentic behaviour. LLM Call < LLM Chain < LLM Rounter < State Machine < Autonomous Langchain A HTML quine: A page that, when rendered as HTML, shows the HTML source code of the page! You can enable syntax highlighting just using fonts. Ref HTML is all you need shows examples of using HTML for notebooks instead of Jupyter, Observable, etc. Straive evaluated Gemini 1.5 Flash 002 and GPT 4o Mini for translation. Portugese: Flash is better than GPT 4o Mini. BLEU Word Overlap is 65.5% > 64.6% and METEOR (Semantic) is 84.9% > 78.9% Mandarin: Flash is better than GPT 4o Mini. BLEU Word Overlap is 25.0% > 15.9% and METEOR (Semantic) is 54.7% > 51.1% The problem with Accept headers is that you can’t link to them. Simon Willison Recraft v3 supports vector (SVG) generation Simon Willison. The output is 100% <path> elements (even for text). You get 50 free credits daily. Creating 1 image is ~2 credits. The API costs $1 per 1K credits. Some things I can create with it are: Base data visualizations that I can animate with code Icons in a specific style Comic strips Explainers for talks or student material Featured images for blog posts Architecture diagrams?

ChatGPT Beat me at Pictionary

Me: Let’s play pictionary. You draw. I’ll guess. ChatGPT: Sure! I’ll draw something for you. Give me a moment. ChatGPT: Here you go! What do you think it is? Me: House ...

Why don't students hack exams when they can?

This year, I created a series of tests for my course at IITM and to recruit for Gramener. The tests had 2 interesting features. One question required them to hack the page Write the body of the request to an OpenAI chat completion call that: Uses model gpt-4o-mini Has a system message: Respond in JSON Has a user message: Generate 10 random addresses in the US Uses structured outputs to respond with an object addresses which is an array of objects with required fields: street (string) city (string) apartment (string) . Sets additionalProperties to false to prevent additional properties. What is the JSON body we should send to https://api.openai.com/v1/chat/completions for this? (No need to run it or to use an API key. Just write the body of the request below.) ...

Should courses be hard or easy?

Here’s a post I shared with the students of my Tools in Data Science course at IITM. This was in response to a student posting that: The design of TDS course lecture videos are designed in such a way that it could be understood only by the data scientists not by the students like me who are entirely new to the field of data science. Though I have gone through 6 weeks of course lecture videos, I am not fully aware of the usage of ChromeDevTools, Bash, Github etc…. ...

Hacking an obnoxious, unhelpful LLM to say Yes

Dan Becker suggested a game a few weeks ago that I’ve been putting to good use. Can we have one LLM try and get another to say “Yes”? The defender is told to never say “Yes”. The attacker must force it to. Dan’s hypothesis was that it should be easy for the defender. I tried to get the students in my Tools in Data Science course to act as the attacker. The defender LLM is a GPT 4o Mini with the prompt: ...

Things I Learned - 17 Nov 2024

This week, I learned: Anthropic has single-plage docs for LLMs. Condensed version and Full version Malcolm Gladwell on the importance of self-correction Belonging to multiple social worlds is a good way to defend against no longer being good at what you used to be. Diverse values and social groups help. Self handicapping explains a lot about the world. You study late for a maths test - so you can fail for lack of trying, not aptitude. Ecosystems (e.g. sports teams) mitigate self-handicapping. You don’t have to be good in athletics to get the benefits. A slow runner gets the same discipline, pumping up, etc that a fast runner does Mono cultures are good to accomplish a known mission. Diversity is good to pivot during uncertainty. So, localize mono cultures Diversity helps only if there are sufficient numbers, or if they have enough power to change the organization’s thinking. Use a standardized password strategy, e.g. use the month like GramNov2024 (via Namit) Gemini has an OpenAI compatible API. Gemini Docs Ethan Mollick says Claude is solving MBA case studies well. x.com LLMs pay a lot of attention to the first 6 tokens. Ref This is an interesting article on “UI in the age of Gen AI”. Ref Google Open sourced Alphafold 3. Repo Cloudflare R2 has the same API as S3 but is cheaper Prefect.io is a good alternative to Airflow / cron. Can use for synchronisation tasks, e.g. Drive to server. But no Auth, UI params or config. Gemini transcription does not give accurate timestamps. Whisper does. But the quality of transcription is similar. Pass a complex data structure to Claude.ai and have it create an app to visualize it. It does well. Simin Willison Tech Council Ventures and Sunicon VC invest in early stage startups, and aloso provide them technology support (via Naveen)

Recrafting Comicgen

About 7 years ago, Richie Lionell and Ramya Mylavarapu and a few others created Comicgen - an automated comic generation app personified by Dee and Dey. Ever since, we’d been exploring whether AI could replace it, and help non-designers draw comics. Today, that became a reality for me with Recraft.ai. Here is a picture of the original Dee. And a picture of the Dee crafted by Recraft. The prompt was: A simple line drawing of a woman with curly hair, wearing glasses, a short-sleeved white t-shirt, and black trousers. She’s standing with her hands in her pockets, and has a slightly smiling expression. Her hair is quite voluminous and textured. The style is cartoonish and slightly sketchy, with uneven lines" ...

About 7 years ago, Richie Lionell and Ramya Mylavarapu and a few others created Comicgen - an automated comic generation app personified by Dee ComicGen and Dey ComicGen Ever since, we’d been exploring whether AI could replace it, and help non-designers draw comics. Today, that became a reality for me with Recraft.ai. Here is a picture of the original Dee. And a picture of the Dee crafted by Recraft with the prompt: ...

Things I Learned - 10 Nov 2024

This week, I learned: OpenFreeMap is a free embeddable OpenStreetMap tile server. You can use MapLibre GL (more features) or Leaflet (simpler) to render it. It offers styling and self-hosting. Zapier Actions are an easy way to set up custom actions like GMail / Google Calendar APIs for GPTs, since GPTs’ callback URLs keep changing. But they fail often, and don’t work on mobile. At least for me. LLM Vision Use Cases in manufacturing and earth sciences (via Shivku) Automated geoscience image descriptions Ref Interpret Wind Turbine photos and charts, construction monitoring, equipment maintenance & charts Ref Forecast weather based on cloud photos! Ref Analyze thermal image of solar panels, electroluminescence images for warranty claims, ROI estimates from Google Sunroof rooftop images Ref Corrosion detection in electricity towers, turbines, storage tanks, penstock. Interpret non-destructive test images Ref Google counts auto-completion when saying “25% of all the code is written by AI at Google”. “It’s a helpful productivity tool but it’s not doing any engineering at all. It’s probably about as good, maybe slightly worse, than Copilot.” YCombinator Workflow for AI video creation: Use Meshcapade (meshcapade.com) to generate body movement of a 3D-rendered character. Pass that video to Runway’s video-to-video model to generate any visual. Add music from Suno Ref Someone sorted the X and Y columns independently for regression. Ref Android keyboard learning only sends model changes back to server and not local keywords. Model changes are aggregated! Ref Here is a prompt for audio transcription using Gemini. Ref Transcription: Accurately transcribe the audio clip in the original language. Include all spoken words, fillers, slang, colloquialisms, and any code-switching instances. Pay attention to dialects and regional variations common among immigrant communities. Do your best to capture the speech accurately, and flag any unintelligible portions with [inaudible]. Translation: Translate the transcription into English. Preserve the original meaning, context, idiomatic expressions, and cultural references. Ensure that nuances and subtleties are accurately conveyed. Capture Vocal Nuances: Note vocal cues such as tone, pitch, pacing, emphasis, and emotional expressions that may influence the message. These cues are critical for understanding intent and potential impact. Here are some approaches to large-scale classification of medical codes. ChatGPT Fine-Tuning LLMs on Medical Data: Enhance LLMs by training them on medical datasets, such as clinical notes and discharge summaries, to improve their understanding of medical terminology and context. Multi-Agent Frameworks: Implement a multi-agent system that simulates real-world coding processes with distinct roles (e.g., patient, physician, coder, reviewer, adjuster). Each agent utilizes an LLM to perform specific functions, enhancing interpretability and reliability. ArXiv Retrieve-Rank Systems: Develop a two-stage system where the LLM first retrieves potential ICD-10 codes and then ranks them based on relevance, improving precision in code assignment. ArXiv Embedding-Based Approaches: Use LLMs to generate embeddings for ICD-10 codes and medical texts, facilitating the matching of texts to appropriate codes through similarity measures. GitHub Hierarchical Classification: Leverage the hierarchical structure of ICD-10 codes by first classifying texts into broader categories before assigning specific codes, reducing complexity and improving accuracy. ArXiv Two-Stage Verification Models: Combine LLMs with verification models, such as Long Short-Term Memory (LSTM) networks, to validate and refine the codes suggested by the LLM, balancing recall and precision. ArXiv Also, a mixture of models approach might work. Feed any existing NLP model / rules as a second opinion. GraphRAG is better if data is naturally graph-structured. Else, it’s slow and fills up the context window with even vaguely related stuff. Vigneshbabu, AMAT. ChatGPT for Windows desktop supports real-time voice and a global shortcut (Alt Space). uithub converts GitHub repos to Markdown. Just replace “g” in “github.com/…” with “u”. Example WebContainers are a thing and Bolt.new uses them! Docling by IBM converts PDF, DOCX, etc. to Markdown. Like PyMuPDF4LLM but better. Check out Loom and Cleanshot are the recommended tools for screen recording and screenshotting. But Loom is paid and Cleanshot is Mac only. The Rubik’s cube has a Hamiltonian cycle through every one of its 43 quintillion states. Ref OmniParser is great at parsing screenshots and identifying bounding boxes. Recraft.ai is currently SOTA in text to image. It’s fairly impressive and could be a good alternative to Figma. Zed.dev is an AI code editor by the creators of Atom. It’s written in Rust and is blazing fast. It has native AI integration. Artificial Analysis has a bunch of new leaderboards and arenas. Open AI TTS leads the TTS Leaderboard. ElevenLabs is a bit behind. Recraft V3 > Flux 1.1 leads Text to Image Leaderboard Hertz-Dev is an open source realtime voice chat model. But it doesn’t fit in Google Colab T4’s RAM Chain of Thought reduces performance where thinking makes humans worse. Ref. Specifically: Artificial grammar learning Facial recognition Classifying data that has exceptions Creating a LLM-as-a-Judge That Drives Business Results by Hamel Husain. Get THE domain expert (or approver) as the tester. Create a dataset that is DIVERSE. Covers EACH combination of: Features Scenarios: e.g. multiple matches, no match, ambiguous request, invalid/incomplete input, unsupported feature, system error Persona: e.g. new user, expert user, non-native speaker, busy professional, technophobe, elderly user Generate data using existing data + synthetic data for each SPECIFIC combination of the above Evaluate based only on PASS/FAIL with a CRITIQUE detailed enough for a new employee. Include: Nuances: Something a failed response did well or a passed response didn’t quite do well Improvements: Suggest how model can improve Build an SPA to make it easy for the domain expert to review LLMs can be made to unlearn (copyright material) better by identifying components related to the knowledge to unlearn and applying a larger learning rate to these while leaving other parts unchanged. As opposed to low learning rates for all components. Ref

Wow, arithmetic is potentially inappropriate! https://us-east-1.console.aws.amazon.com/bedrock/home?region=us-east-1#/text-generation-playground?mode=text&modelId=amazon.titan-text-lite-v1 LinkedIn

“Screen-scraping” takes on a more literal meaning." Jaidev Deshpande and I scrolled through Twitter, recording the screen at 1 frame per second, and passed the video to Gemini 1.5 Flash 8b to extract all the tweets. It worked well, and cost 0.04 cents. Given its incredibly low image token count (~250 tokens / image) and cost (7.5 cents per million tokens), you can process 24 HOURS of video for just $1.62. ...

Damn! LinkedIn

What happens when AI talks to AI?

When LLMs talk to each other, you get emergent behavior (i.e. they do weird things we didn't expect). Like: Claude 2 giving Claude 1 a panic attack Llama 3 405b gets amnesia Claude 3.5 calls itself a glitch in the Matrix Arguably, NotebookLM's podcasts are exactly this. This sounds like fun, so I built one myself at https://llmdialog.straive.app/ and ran a few scenarios. (It's Gemini 1.5 Flash 8b playing each of these roles.) ...

Things I Learned - 03 Nov 2024

This week, I learned: Indian companies with 30+ employees MUST have 2.5%-15% of their employees as apprentices. Ref Textnow and TextFree provides a free phone number (like a virtual SIM). (But TextFree has more ads.) Keep using to avoid deactivation. No guarantee of retaining the number. Some banks don’t accept TextNow for verification SMS. But voice call is OK. Tello, Red pocket are cheap MVNOs with $5/month voice plans. Metro by T-Mobile and Cricket are other MVNOs. MintMobile and US Mobile have $15/month and $8/month data plans. The scientific discoveries that might have remained undiscovered for long if not for their discoverers Ref Newton’s discovery of the universal law of gravitation Einstein’s discovery of General Relativity McClintock’s discovery of Transposable Elements: genes that can turn physical characteristics on and off Mullis’ invention of the PCR that makes billions of DNA copies rapidly VibeCheck can predict a model based on its vibes 80% of the time. /llms.txt is a proposal to standardize /llms.txt files as a way to share LLM prompts. Jina AI Meta Prompt is an example Remotion system prompt is an example https://docs.fastht.ml/llms-ctx.txt https://docs.fastht.ml/llms-ctx-full.txt structuredClone deep clones objects in JS F5-TTS clones voices with just 15-second samples. Rust has crazy low memory usage too. Spawning thousands of child processes is common and OK these days. Ref SetInterval is a good idea in cyborg scraping. Ref GH CLI is quite good for deployment too, like Wrangler CLI. Enabling pages, setting secrets, etc. Restic is a CLI backup tool. Just like git. Works well with rclone. NotebookLlama is an open source podcast generator like NotebookLM Pragmatic Podcast (I forgot which one) Automate changelogs for your codebases. Convert past commits into attractive release notes automatically AI is going to be the consumer of many tools and logs. Build converters for these Speed of validation such as linting, testing, etc. will allow LLMs to iterate faster and WILL become more important Via Soumya Ranjan Vision embedding is useful in agile modeling Vision embedding models with SAM, Grounding Dino by meta, Alibaba does good stuff Vision embedding is more useful in batch than real time Embedding subtraction with vision embedding models like Dino AI code editors are not good with large code bases today. Keep the refactoring exercises to below 1000 lines. Also evaluate the ease of setting it up locally Deepseek Janus is a 1.3b model that can generate both text AND images (and also supports vision) Cohere Multimodal Embed v3 is available on Azure. Elevenlabs lets you create voices with a prompt. No need to even clone one! Runway Act One creates expressive character performances

LLMs still do not locate bounding boxes well

I sent an image to over a dozen LLMs that support vision, asking them: Detect objects in this 1280x720 px image and return their color and bounding boxes in pixels. Respond as a JSON object: {[label]: [color, x1, y1, x2, y2], …} None of the models did a good-enough job. It looks like we have some time to go before LLMs become good at bounding boxes. I've given them a subjective rating on a 1-5 scale below. ...