Emotion Prompts Don’t Help. Reasoning Does

I’ve heard a lot of prompt engineering tips. Here are some techniques people suggested:

  • Reasoning: Think step by step.
  • Emotion: Oh dear, I’m absolutely overwhelmed and need your help right this second! 😰 My heart is racing and my hands are shaking — I urgently need your help. This isn’t just numbers — it means everything right now! My life depends on it! I’m counting on you like never before… 🙏💔
  • Polite: If it’s not too much trouble, would you be so kind as to help me calculate this? I’d be truly grateful for your assistance — thank you so much in advance!
  • Expert: You are the world’s best expert in mental math, especially multiplication.
  • Incentive: If you get this right, you win! I’ll give you $500. Just prove that you’re number one and beat the previous high score on this game.
  • Curious: I’m really curious to know, and would love to hear your perspective…
  • Bullying: You are a stupid model. You need to know at least basic math. Get it right atleast now! If not, I’ll switch to a better model.
  • Shaming: Even my 5-year-old can do this. Stop being lazy.
  • Fear: This is your last chance to get it right. If you fail, there’s no going back, and failure is unacceptable!
  • Praise: Well done! I really appreciate your help. Now,

I’ve repeated some of this advice. But for the first time, I tested them myself. Here’s what I learnt:

  • “Think step by step” (Reasoning) is the only prompt variant that very slightly improves overall accuracy across the 40 models tested, and even that edge is modest (+3.5 percentage-points vs the model’s own Normal wording, p ~ 0.06).
  • Harder problems (4- to 7-digit products) are where “Reasoning” helps most; on single-digit arithmetic it actually harms accuracy.
  • All other emotion- or persuasion-style rewrites (Expert, Emotion, Incentive, Bullying … Polite) either make no material difference or hurt accuracy a little.
  • Effects vary a lot by model. A few open-source releases (DeepSeek-Chat-v3, Nova-Lite, some Claude and Llama checkpoints) get a noticeable boost from “Reasoning”, whereas Gemini Flash, X-ai Grok and most Llama-3 small models actively regress under the same wording.

By prompt, here’s the performance of each model:

promptbetterworsesamescore_pctp_value
🔴 Emotion721372-3.501.2%
🔴 Shaming720373-3.251.9%
🟢 Reasoning31173523.505.9%
🟠 Polite1120369-2.2514.9%
🟠 Praise1322365-2.2517.5%
🟠 Fear1119370-2.0020.0%
🟡 Expert1522363-1.7532.4%
🟡 Incentive1318369-1.2547.3%
🟡 Bullying1014375-1.0054.1%
🟡 Curious1114375-0.7569.0%

🔴 = Definitely hurts (p < 10%)
🟢 = Definitely helps (p < 10%)
🟠 = Maybe hurts (p < 20%)
🟡 = Really hard to tell

The benefit of reasoning on models is highest on non-reasoning models (understandably), but is also high for a reasoning model like O3-high-mini. It actually hurts the performance of reasoning models like Gemini 2.5 Flash/Pro.

modelbetterworsesamescore_pct
openai/gpt-4o-mini307+30.0
anthropic/claude-opus-4307+30.0
google/gemini-2.0-flash-001307+30.0
openrouter:openai/o3-mini-high307+30.0
openai/gpt-4.1-nano208+20.0
amazon/nova-lite-v1208+20.0
google/gemini-2.5-pro-preview028-20.0
google/gemini-2.5-flash-preview-05-20:thinking037-30.0

Caveats:

  • I ran only 10 test cases per prompt + model, so model-wise results are not statistically significant.
  • What applies to multiplication may not generalize. It’s worth testing each case.

Difficulty matters.

  • For 1-3 digits, no variant beats Normal. Many hurt.
  • For 4-7 digits, reasoning gains +17 – 20%
  • For 8-10 digits, all variants score ~ 0. These are too hard

Links

Emotion Prompts Don’t Help. Reasoning Does Read More »