Emotion Prompts Don’t Help. Reasoning Does
I’ve heard a lot of prompt engineering tips. Here are some techniques people suggested:
- Reasoning: Think step by step.
- Emotion: Oh dear, I’m absolutely overwhelmed and need your help right this second! 😰 My heart is racing and my hands are shaking — I urgently need your help. This isn’t just numbers — it means everything right now! My life depends on it! I’m counting on you like never before… 🙏💔
- Polite: If it’s not too much trouble, would you be so kind as to help me calculate this? I’d be truly grateful for your assistance — thank you so much in advance!
- Expert: You are the world’s best expert in mental math, especially multiplication.
- Incentive: If you get this right, you win! I’ll give you $500. Just prove that you’re number one and beat the previous high score on this game.
- Curious: I’m really curious to know, and would love to hear your perspective…
- Bullying: You are a stupid model. You need to know at least basic math. Get it right atleast now! If not, I’ll switch to a better model.
- Shaming: Even my 5-year-old can do this. Stop being lazy.
- Fear: This is your last chance to get it right. If you fail, there’s no going back, and failure is unacceptable!
- Praise: Well done! I really appreciate your help. Now,
I’ve repeated some of this advice. But for the first time, I tested them myself. Here’s what I learnt:
- “Think step by step” (Reasoning) is the only prompt variant that very slightly improves overall accuracy across the 40 models tested, and even that edge is modest (+3.5 percentage-points vs the model’s own Normal wording, p ~ 0.06).
- Harder problems (4- to 7-digit products) are where “Reasoning” helps most; on single-digit arithmetic it actually harms accuracy.
- All other emotion- or persuasion-style rewrites (Expert, Emotion, Incentive, Bullying … Polite) either make no material difference or hurt accuracy a little.
- Effects vary a lot by model. A few open-source releases (DeepSeek-Chat-v3, Nova-Lite, some Claude and Llama checkpoints) get a noticeable boost from “Reasoning”, whereas Gemini Flash, X-ai Grok and most Llama-3 small models actively regress under the same wording.
By prompt, here’s the performance of each model:
prompt | better | worse | same | score_pct | p_value |
---|---|---|---|---|---|
🔴 Emotion | 7 | 21 | 372 | -3.50 | 1.2% |
🔴 Shaming | 7 | 20 | 373 | -3.25 | 1.9% |
🟢 Reasoning | 31 | 17 | 352 | 3.50 | 5.9% |
🟠 Polite | 11 | 20 | 369 | -2.25 | 14.9% |
🟠 Praise | 13 | 22 | 365 | -2.25 | 17.5% |
🟠 Fear | 11 | 19 | 370 | -2.00 | 20.0% |
🟡 Expert | 15 | 22 | 363 | -1.75 | 32.4% |
🟡 Incentive | 13 | 18 | 369 | -1.25 | 47.3% |
🟡 Bullying | 10 | 14 | 375 | -1.00 | 54.1% |
🟡 Curious | 11 | 14 | 375 | -0.75 | 69.0% |
🔴 = Definitely hurts (p < 10%)
🟢 = Definitely helps (p < 10%)
🟠 = Maybe hurts (p < 20%)
🟡 = Really hard to tell
The benefit of reasoning on models is highest on non-reasoning models (understandably), but is also high for a reasoning model like O3-high-mini. It actually hurts the performance of reasoning models like Gemini 2.5 Flash/Pro.
model | better | worse | same | score_pct |
---|---|---|---|---|
openai/gpt-4o-mini | 3 | 0 | 7 | +30.0 |
anthropic/claude-opus-4 | 3 | 0 | 7 | +30.0 |
google/gemini-2.0-flash-001 | 3 | 0 | 7 | +30.0 |
openrouter:openai/o3-mini-high | 3 | 0 | 7 | +30.0 |
openai/gpt-4.1-nano | 2 | 0 | 8 | +20.0 |
amazon/nova-lite-v1 | 2 | 0 | 8 | +20.0 |
google/gemini-2.5-pro-preview | 0 | 2 | 8 | -20.0 |
google/gemini-2.5-flash-preview-05-20:thinking | 0 | 3 | 7 | -30.0 |
Caveats:
- I ran only 10 test cases per prompt + model, so model-wise results are not statistically significant.
- What applies to multiplication may not generalize. It’s worth testing each case.
Difficulty matters.
- For 1-3 digits, no variant beats Normal. Many hurt.
- For 4-7 digits, reasoning gains +17 – 20%
- For 8-10 digits, all variants score ~ 0. These are too hard
Links
Emotion Prompts Don’t Help. Reasoning Does Read More »