August 11, 2024

This week, I learned: Embedding models can be fine-tuned. Example: #TODO Agentic RAG (Ravi Theja, LlamaIndex) RAG via top-k retrieval fails with summarization => need to read all chunks comparison: compare product X vs Y => need to split and re-combine structured analytics. e.g. most expensive employees => Text2SQL first multi-part questions. e.g. Tell me about speed of model X AND cost of model Y and recommend => need to split and re-combine RAG failures: It’s single shot. No query planning. No tools. No correction. No memory. Agents that help in RAG Route to the right tool E.g. retrieve via vector top-k search or vector summary search or keyword search or combination? One-shot query planning E.g. Break query into multiple specific queries. RAG those. Then combine. #TRY - maybe in DocSearch Tool use E.g. Schema retrieval, Text2SQL, Calendar, Chat, APIs, Search, etc. Agent orchestration ReAct: An agent reasoning loop. Reason + Act. {Thought, Action, Action Input, Observation}*. Orchestrate tools with a prompt Multi-agent task solver: Llama agents Instead of a single agent loop, use different agents. Also allows parallelization Allow services to register. (MS TaskWeaver stores tool descriptions in YAML) LlamaHub Tools has ideas for agents Notes on LLM Fine-Tuning Rouge 2 and Bleu and such metrics are NOT good. Create you own benchmarks Non-PEFT fine tuning needs 6X GPU RAM. Optimizer states, Gradient, Activations are the overhead. PEFT is about tuning a subset of parameters. LORA adds additional weights without updating the model. It’s a low rank matrix multiplication. You can change these adapters in runtime. Saves space. Fast to train Quantization: Stick to bitsandbytes or AWQ (may be a bit better) QLORA = Quantization + LORA Predibase has open-sourced Lora Adapters in “Lora Land”. Existing adapters are pretty good. ghcr.io/predibase/lorax:main Docker image works on Docker compose to run locally. devices: on Docker Compose lets you specify NVIDIA GPU devices Locust is a HTTP load testing lib in Python Techniques for inference optimization Dynamic adapters: Loads right LORAX adapters WHEN a request comes in Multi-adapter batching: Process all inputs in parallel on the same GPU, but different users are post-processed using different adapters Notes from a 4-hour flight: What We’ve Learned From A Year of Building with LLMs Strategy IS IT TOO HARD/EXPENSIVE? Log it. LLMs are getting cheaper and better. WILL OPENAI BUILD IT? If so, wait for it instead of building. HAS A STARTUP BUILT IT? If so, use it instead. It’s a generic use case there’s no point re-inventing. FOCUSED USE CASES over generic. Build trust by starting small. Tools for LLM Ops (feedback): LangSmith, Log10, LangFuse, W&B Weave, HoneyHive #TRY Human in the Loop is about humans evaluating model outputs. That’s different from AI in the loop, human in the center, where AI accelerates human output (like Github Copilot) Operations CHECK EMBEDDINGS DRIFT over time. Users might be input-ing different things than before. LOG AND REVIEW everything. Instructor coaxes structured output from LLM APIs. #TRY IMPLICIT FEEDBACK collection is easy. Just let users edit stuff. #TRY Tactical Try n-shot prompting (n=5-12) before bigger models. #TRY Always structure for output: Markdown, XML/HTML tags. Combine RAG with Keyword search. It reduces user frustration in edge cases. Prefer multiple small prompts to one big prompt. Do X. Then Y. Then Z. Jitter prompts for diversity beyond temperature. LLM-as-judge works better when comparing outputs (not rating 1 output). Keep length similar (LLMs prefer wordiness). Swap order and compare. Allow for ties. Ask for reason FIRST. Hermes: A Text-to-SQL solution at Swiggy “Hermes performed significantly better for charters with well-defined metadata and a relatively smaller number of tables.” “We collect feedback on the accuracy of the returned query from stakeholders directly within the Slack bot.” How I use AI and “Replacing my right hand with AI” EMBED in every app/workflow. E.g. Auto-fix spellings. Auto-review code. Auto-ask LLM on errors and apply patch! Auto-search for answer, assess, continue. PERSIST. Stick with the LLM to the end. Don’t fix it yourself. It’s faster. #TRY INTERVENE FAST. If an LLM can’t solve it by itself in 2 tries, it needs in-depth help. APP-IFY one-off tasks. Disposable tools. “Write web-app to convert JSON to tab-delimited.” “Extract fields as a table.” “Diff JSON.” #TRY BEST language/frameworks preferred. CUDA in Python. Rust. C. Raspberry Pi. Arduino. Bluetooth. Modern ESM/JS. #TRY TEACH examples. “Here’s the LLM Foundry API.” “Here’s how to use gramex.data.” DUMP entire code. Models can handle it. Refactoring to SQLAlchemy 2, Pandas 2. API Documentation. Test case generation. #TRY ASK for features & packages. Docker without root access. GPU access inside docker. Windows CLI-only C++ compiler. TEST CASE writing. #TRY SPEC IN DETAIL. Use these libraries. Write like this: code example. SPEC USAGE in detail. “I will just pipe it into sqlite”, or “I will just run ffmpeg -i filename [YOUR OPTIONS]. Describe the UI, API input/output, data structure, and internal data structure. HELP on usage. “ffmpeg to get audio.mp3”. My benchmark for large language models LLM(text) is a useful function to have in JS and Python too. Useful as a simple pip install llmfoundry Allow images, files in LLM() Current list of #IMPOSSIBLE (or hard) things for LLMs Translate technical documents to Dutch – because they don’t understand the technical terms well Translate large documents (JSON to XML, English to Chinese, Python to Rust, Wrong to right spelling) – because the output tokens are limited micro-agent generates test cases first when asked to build an app. Then it iterates until the test cases pass. Alternative interfaces to YouTube: Piped.video, CloudTube, Invidious, NewPipe, FreeTube Deepseek Context Caching reduces price to 1.4 cents/MTok for portions of chat messages that are repeated. That’s a 10X reduction for long conversations!