This week, I learned:
- Embedding models can be fine-tuned. Example: #TODO
- Agentic RAG (Ravi Theja, LlamaIndex)
- RAG via top-k retrieval fails with
- summarization => need to read all chunks
- comparison: compare product X vs Y => need to split and re-combine
- structured analytics. e.g. most expensive employees => Text2SQL first
- multi-part questions. e.g. Tell me about speed of model X AND cost of model Y and recommend => need to split and re-combine
- RAG failures: It’s single shot. No query planning. No tools. No correction. No memory.
- Agents that help in RAG
- Route to the right tool
- E.g. retrieve via vector top-k search or vector summary search or keyword search or combination?
- One-shot query planning
- E.g. Break query into multiple specific queries. RAG those. Then combine. #TRY - maybe in DocSearch
- Tool use
- E.g. Schema retrieval, Text2SQL, Calendar, Chat, APIs, Search, etc.
- Route to the right tool
- Agent orchestration
- ReAct: An agent reasoning loop. Reason + Act. {Thought, Action, Action Input, Observation}*.
- Multi-agent task solver: Llama agents
- Instead of a single agent loop, use different agents. Also allows parallelization
- Allow services to register. (MS TaskWeaver stores tool descriptions in YAML)
- LlamaHub Tools has ideas for agents
- RAG via top-k retrieval fails with
- Notes on LLM Fine-Tuning
- Rouge 2 and Bleu and such metrics are NOT good. Create you own benchmarks
- Non-PEFT fine tuning needs 6X GPU RAM. Optimizer states, Gradient, Activations are the overhead. PEFT is about tuning a subset of parameters.
- LORA adds additional weights without updating the model. It’s a low rank matrix multiplication. You can change these adapters in runtime. Saves space. Fast to train
- Quantization: Stick to bitsandbytes or AWQ (may be a bit better)
- QLORA = Quantization + LORA
- Predibase has open-sourced Lora Adapters in “Lora Land”. Existing adapters are pretty good.
- ghcr.io/predibase/lorax:main Docker image works on Docker compose to run locally.
devices:on Docker Compose lets you specify NVIDIA GPU devices - Locust is a HTTP load testing lib in Python
- ghcr.io/predibase/lorax:main Docker image works on Docker compose to run locally.
- Techniques for inference optimization
- Dynamic adapters: Loads right LORAX adapters WHEN a request comes in
- Multi-adapter batching: Process all inputs in parallel on the same GPU, but different users are post-processed using different adapters
- Notes from a 4-hour flight:
- What We’ve Learned From A Year of Building with LLMs
- Strategy
- IS IT TOO HARD/EXPENSIVE? Log it. LLMs are getting cheaper and better.
- WILL OPENAI BUILD IT? If so, wait for it instead of building.
- HAS A STARTUP BUILT IT? If so, use it instead. It’s a generic use case there’s no point re-inventing.
- FOCUSED USE CASES over generic. Build trust by starting small.
- Tools for LLM Ops (feedback): LangSmith, Log10, LangFuse, W&B Weave, HoneyHive #TRY
- Human in the Loop is about humans evaluating model outputs. That’s different from AI in the loop, human in the center, where AI accelerates human output (like Github Copilot)
- Operations
- CHECK EMBEDDINGS DRIFT over time. Users might be input-ing different things than before.
- LOG AND REVIEW everything.
- Instructor coaxes structured output from LLM APIs. #TRY
- IMPLICIT FEEDBACK collection is easy. Just let users edit stuff. #TRY
- Tactical
- Try n-shot prompting (n=5-12) before bigger models. #TRY
- Always structure for output: Markdown, XML/HTML tags.
- Combine RAG with Keyword search. It reduces user frustration in edge cases.
- Prefer multiple small prompts to one big prompt. Do X. Then Y. Then Z.
- Jitter prompts for diversity beyond temperature.
- LLM-as-judge works better when comparing outputs (not rating 1 output). Keep length similar (LLMs prefer wordiness). Swap order and compare. Allow for ties. Ask for reason FIRST.
- Strategy
- Hermes: A Text-to-SQL solution at Swiggy
- “Hermes performed significantly better for charters with well-defined metadata and a relatively smaller number of tables.”
- “We collect feedback on the accuracy of the returned query from stakeholders directly within the Slack bot.”
- How I use AI and “Replacing my right hand with AI”
- EMBED in every app/workflow. E.g. Auto-fix spellings. Auto-review code. Auto-ask LLM on errors and apply patch! Auto-search for answer, assess, continue.
- PERSIST. Stick with the LLM to the end. Don’t fix it yourself. It’s faster. #TRY
- INTERVENE FAST. If an LLM can’t solve it by itself in 2 tries, it needs in-depth help.
- APP-IFY one-off tasks. Disposable tools. “Write web-app to convert JSON to tab-delimited.” “Extract fields as a table.” “Diff JSON.” #TRY
- BEST language/frameworks preferred. CUDA in Python. Rust. C. Raspberry Pi. Arduino. Bluetooth. Modern ESM/JS. #TRY
- TEACH examples. “Here’s the LLM Foundry API.” “Here’s how to use gramex.data.”
- DUMP entire code. Models can handle it. Refactoring to SQLAlchemy 2, Pandas 2. API Documentation. Test case generation. #TRY
- ASK for features & packages. Docker without root access. GPU access inside docker. Windows CLI-only C++ compiler.
- TEST CASE writing. #TRY
- SPEC IN DETAIL. Use these libraries. Write like this: code example.
- SPEC USAGE in detail.
- “I will just pipe it into sqlite”, or “I will just run
ffmpeg -i filename [YOUR OPTIONS]. - Describe the UI, API input/output, data structure, and internal data structure.
- “I will just pipe it into sqlite”, or “I will just run
- HELP on usage. “ffmpeg to get audio.mp3”.
- My benchmark for large language models
- LLM(text) is a useful function to have in JS and Python too. Useful as a simple
pip install llmfoundry - Allow images, files in LLM()
- LLM(text) is a useful function to have in JS and Python too. Useful as a simple
- Current list of #IMPOSSIBLE (or hard) things for LLMs
- Translate technical documents to Dutch – because they don’t understand the technical terms well
- Translate large documents (JSON to XML, English to Chinese, Python to Rust, Wrong to right spelling) – because the output tokens are limited
- What We’ve Learned From A Year of Building with LLMs
- micro-agent generates test cases first when asked to build an app. Then it iterates until the test cases pass.
- Alternative interfaces to YouTube: Piped.video, CloudTube, Invidious, NewPipe, FreeTube
- Deepseek Context Caching reduces price to 1.4 cents/MTok for portions of chat messages that are repeated. That’s a 10X reduction for long conversations!