How fast are LLMs in production?

At Straive, we use an LLM Router. Since ChatGPT, etc. are blocked for most people, this is the main way to access LLMs. One thing we measure is the speed of models, i.e. output tokens per second. Fast models deliver a much smoother experience for users. This is a different methodology than ArtificialAnalysis.ai. I’m not looking purely at the generation time but the total time (including making the connection and the initial wait time) for all successful requests. So, if the provider is having a slow day or is slowing down responses, these numbers will be different. ...

Things I Learned - 01 Sep 2024

This week, I learned: LLMs are so good that they can simulate Doom in real time. gamengen Val.town’s code generation system prompt uses https://maxm-imggenurl.web.val.run/the-description-of-your-image to dynamically generate images Practice for each thought: “What would make me change my mind? How likely is that?” Cursor uses speculative edits and a variety of other techniques to speed up code editing. ChatGPT does a better job at cartoon generation than even Flux.1