How does Gemini process videos?

The Gemini documentation is clear: The File API service extracts image frames from videos at 1 frame per second (FPS) and audio at 1Kbps, single channel, adding timestamps every second. These rates are subject to change in the future for improvements in inference. Note: The details of fast action sequences may be lost at the 1 FPS frame sampling rate. Consider slowing down high-speed clips for improved inference quality. Individual frames are 258 tokens, and audio is 32 tokens per second. With metadata, each second of video becomes ~300 tokens, which means a 1M context window can fit slightly less than an hour of video. ...

Clone any voice with a 15-second sample

It's surprisingly easy to clone a voice using F5-TTS: "A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching". Here's a clip of me, saying: I think Taylor Swift is the best singer. I've attended every one of her concerts and in fact, I've even proposed to her once. Don't tell anyone. (Which is ironic since I didn't know who she was until this year and I still haven't seen or heard her.) ...

How can non-developers learn AI coding?

How can non-programmers build apps? Claude.ai, Replit.com, Bolt.new, V0.dev, Pythagora.ai and a few other tools write and deploy code just based on a prompt. You should try them out. “But how do you build the skill? Is there a tutorial?” I’m often asked. No, I can’t find a tutorial, but here is my suggestion. You probably can’t guess what’s easy or hard. e.g. “Take my picture in black & white” is FAR easier than “When’s the next lunar eclipse?” So if the app doesn’t work, try 2-3 times, then GIVE UP! Note it down. Then try something else. (You’ll soon get a feel for what’s possible.) Revisit what failed 3-6 months later. It might suddenly become possible.

LLM escapades in a toilet

I was in Seoul for KHF 2024, a healthcare event, staying at Hotel in 9. The hotel was great. The toilet was hi-tech. Perhaps a bit too high-tech for me. I couldn’t figure out how to let the water through on the sink. After 15 minutes of a hard struggle, I finally asked ChatGPT “How do I open the thing that’s closing the sink to allow the water to go down?” ...

How fast are LLMs in production?

At Straive, we use an LLM Router. Since ChatGPT, etc. are blocked for most people, this is the main way to access LLMs. One thing we measure is the speed of models, i.e. output tokens per second. Fast models deliver a much smoother experience for users. This is a different methodology than ArtificialAnalysis.ai. I’m not looking purely at the generation time but the total time (including making the connection and the initial wait time) for all successful requests. So, if the provider is having a slow day or is slowing down responses, these numbers will be different. ...

Image generation gets better at comics

I heard a lot about the new image generation models last week. So, I tested to see what’s improved. I gave the prompt below to various image generation models – old and new. A Calvin and Hobbes strip. Calvin is boxing Hobbes, with a dialog bubble from Calvin, saying “Bring it on!” Stable Diffusion XL Lightning Stable Diffusion XL Base Dall-E API Runway ML ImageGen 3 Dall-E 3 API Ideogram 2.0 Flux.dev via Fal.ai ChatGPT Plus A few observations: ...

Weird emergent properties on Llama 3 405B

In this episode of ThursdAI, Alex Volkov (of Weights & Biases) speaks with Jeffrey Quesnelle (of Nous Research) on what they found fine-tuning Llama 3 405B. This segment is fascinating. Llama 3 405 B thought it was an amnesiac because there was no system prompt! In trying to make models align with the system prompt strongly, these are the kinds of unexpected behaviors we encounter. It’s also an indication how strongly we can have current LLMs adopt a personality simply by beginning the system prompt with “You are …” ...

The LLM Psychologist

Andrej Karpathy mentioned the term LLM psychologist first in Feb 2023. I’ve been thinking about this for a while, now. I’ve always been fascinated by psychologists in fiction. I grew up with Hari Seldon in Foundation, wanting to be a psycho-historian. (I spent several teenage years building my mind-reading abilities.) I wanted to be Susan Calvin, the only robopsychologist. ...

A quick way to assess LLM capabilities

Simon Willison initiated this very interesting Twitter thread that asks, “What prompt can instantly tell us how good an LLM model is?” The Sally-Anne Test is a popular test that asks: Sally hides a marble in her basket and leaves the room. While she is away, Anne moves the marble from Sally’s basket to her own box. When Sally returns, where will she look for her marble?" ...

From Laptops to Chatbots: Coding at 30,000 ft

Until recently, I could code on flights. This year, I lost that ability. Again. It’s happened before. In each case, technology has solved the problem for me. Here’s the history. I need a laptop. Since 2001, I’ve never been without one on a flight. I need power. Since 2005, I use dark mode and every low power feature available. (I also became good at finding hidden power outlets.) ...

AI makes me a better person

Every time I get annoyed at people, I remind myself to be more like ChatGPT. Specifically: Don't get annoyed. Be patient. Encourage them. Step back and show them the big picture. (Then I get annoyed at myself for getting annoyed.) Today, I analyzed how exactly ChatGPT is different from me. So, I took a pitch document I co-authored with ChatGPT. Section A: Authored by Anand WHAT DO WE NEED? ...

Embeddings similarity threshold

text-embedding-ada-002 used to give high cosine similarity between texts. I used to consider 85% a reasonable threshold for similarity. I almost never got a similarity less than 50%. text-embedding-3-small and text-embedding-3-large give much lower cosine similarities between texts. For example, take these 5 words: “apple”, “orange”, “Facebook”, “Jamaica”, “Australia”. Here is the similarity between every pair of words across the 3 models: For our words, new text-embedding-3-* models have an average similarity of ~43% while the older text-embedding-ada-002 model had ~85%. ...

What does Gramener ask ChatGPT?

I looked at how Gramener uses ChatGPT Plus by evaluating 600+ chats asked over 3 months from Oct 2023 to Jan 2024. The team asks 6 questions a day. We don't track who or how many actively use ChatGPT Plus. This also excludes personal ChatGPT accounts. Still, 6/day is low for an entire team put together. The questions fall into 8 categories. Category%Excel, data exploration & analysis25%Text extraction and summarization13%HTML, CSS, or JavaScript code13%Python code13%LLMs, AI and use cases9%OCR and image analysis9%Generate images, logos, and designs7%General knowledge, policy & environment5%Audio and translation5% Here are some questions from each category - to give you an idea of emergent ChatGPT Plus usage. ...

ChatGPT Custom Instructions

I speak with ChatGPT ~20 times a day. That’s more than I speak with most of my colleagues. ChatGPT is clearly my favorite team member. I conduct trainings, reviews and mentoring sessions with my colleagues. How to write code. How to write slides. How to communicate. That last bit is particularly important. With ChatGPT Custom Instructions, I can guide ChatGPT on how to work better with me. Currently, I have 10 custom instructions. They evolved over time and will continue to evolve. ...