A quick way to assess LLM capabilities

Simon Willison initiated this very interesting Twitter thread that asks, “What prompt can instantly tell us how good an LLM model is?” The Sally-Anne Test is a popular test that asks: Sally hides a marble in her basket and leaves the room. While she is away, Anne moves the marble from Sally’s basket to her own box. When Sally returns, where will she look for her marble?" ...

When picking a number between 1-100, do #LLMs pick randomly? Or pick like a human? Leniolabs_ found #ChatGPT prefers 42. Gramener re-ran the experiment. Things have changed a bit. Now, 47 is the new favorite. But Claude 3 Haiku latched on to 42 as its favorite. Gemini’s favorite is 72. See https://sanand0.github.io/llmrandom/ They all avoid multiples of 10 (10, 20, …), repeated digits (11, 22, …), single digits (1, 2, …) and prefer 7-endings (27, 37, …). These are clearly human #biases – avoiding regular / round numbers and seeking 7 as “random”. ...

This is the coolest data visualization I’ve seen in a long time. It makes you think about human behaviour. Please try and GUESS why the AirBnB occupancy rates shoot up in the red areas on Apr 7 before you read the comments! LinkedIn