Are LLMs any good at mental math?
I asked 50 LLMs to multiply 2 numbers: 12 x 12 123 x 456 1,234 x 5,678 12,345 x 6,789 123,456 x 789,012 1,234,567 x 8,901,234 987,654,321 x 123,456,789 LLMs aren't good tools for math and this is just an informal check. But the results are interesting: Model%WinQ1Q2Q3Q4Q4Q6Q7openai:o386%✅✅✅✅✅✅❌openrouter:openai/o1-mini86%✅✅✅✅✅✅❌openrouter:openai/o3-mini-high86%✅✅✅✅✅✅❌openrouter:openai/o4-mini86%✅✅✅✅✅✅❌openrouter:openai/o4-mini-high86%✅✅✅✅✅✅❌deepseek/deepseek-chat-v3-032471%✅✅✅✅✅❌❌openai/gpt-4.1-mini71%✅✅✅✅✅❌❌openai/gpt-4.5-preview71%✅✅✅✅✅❌❌openai/gpt-4o71%✅✅✅✅✅❌❌openrouter:openai/o3-mini71%✅✅✅✅✅❌❌anthropic/claude-3-opus57%✅✅✅✅❌❌❌anthropic/claude-3.5-haiku57%✅✅✅✅❌❌❌anthropic/claude-3.7-sonnet:thinking57%✅✅✅✅❌❌❌google/gemini-2.0-flash-00157%✅✅✅✅❌❌❌google/gemini-2.0-flash-lite-00157%✅✅✅✅❌❌❌google/gemini-2.5-flash-preview57%✅✅✅✅❌❌❌google/gemini-2.5-flash-preview:thinking57%✅✅✅✅❌❌❌google/gemini-2.5-pro-preview-03-2557%✅✅✅✅❌❌❌google/gemini-flash-1.557%✅✅✅✅❌❌❌google/gemini-pro-1.557%✅✅✅✅❌❌❌google/gemma-3-12b-it57%✅✅✅✅❌❌❌google/gemma-3-27b-it57%✅✅✅✅❌❌❌meta-llama/llama-4-maverick57%✅✅✅❌✅❌❌meta-llama/llama-4-scout57%✅✅✅✅❌❌❌openai/gpt-4-turbo57%✅✅✅✅❌❌❌openai/gpt-4.157%✅✅✅❌✅❌❌amazon/nova-lite-v143%✅✅✅❌❌❌❌amazon/nova-pro-v143%✅✅✅❌❌❌❌anthropic/claude-3-haiku43%✅✅✅❌❌❌❌anthropic/claude-3.5-sonnet43%✅✅✅❌❌❌❌meta-llama/llama-3.1-405b-instruct43%✅✅❌✅❌❌❌meta-llama/llama-3.1-70b-instruct43%✅✅❌✅❌❌❌meta-llama/llama-3.2-3b-instruct43%✅✅❌✅❌❌❌meta-llama/llama-3.3-70b-instruct43%✅✅❌✅❌❌❌openai/gpt-4.1-nano43%✅✅✅❌❌❌❌openai/gpt-4o-mini43%✅✅✅❌❌❌❌qwen/qwen-2-72b-instruct43%✅✅✅❌❌❌❌anthropic/claude-3-sonnet29%✅✅❌❌❌❌❌deepseek/deepseek-r129%✅✅❌❌❌❌❌google/gemini-flash-1.5-8b29%✅✅❌❌❌❌❌google/gemma-3-4b-it29%✅✅❌❌❌❌❌meta-llama/llama-3-8b-instruct29%✅✅❌❌❌❌❌meta-llama/llama-3.1-8b-instruct29%✅❌❌✅❌❌❌openai/gpt-3.5-turbo29%✅✅❌❌❌❌❌amazon/nova-micro-v114%✅❌❌❌❌❌❌meta-llama/llama-2-13b-chat14%✅❌❌❌❌❌❌meta-llama/llama-3-70b-instruct14%✅❌❌❌❌❌❌meta-llama/llama-3.2-1b-instruct14%✅❌❌❌❌❌❌google/gemma-3-1b-it:free0%❌❌❌❌❌❌❌meta-llama/llama-2-70b-chat0%❌❌--❌❌❌Average96%86%66%58%24%10%0% OpenAI's reasoning models cracked it, scoring 6/7, stumbling only on the 9-digit multiplication. openai/o1-mini openai/o3 openai/o3-mini-high openai/o4-mini openai/o4-mini-high Models use human-like mental math tricks. For example, O3-Mini-High calculated 1234567 × 8901234 using a recursive strategy. ...