- The Weekend Byte
- Posts
- Rage Prompting Improves LLM Accuracy
Rage Prompting Improves LLM Accuracy
Do nice prompts finish last?
Finally, research that’s digging into the real questions we want to know. What happens when you rage against the machine LLM? We already know that saying “please” and “thank you” costs OpenAI tens of millions of dollars, a small down payment for appeasing the future robot overlords. But do polite manners that make your mother proud impact an LLM’s accuracy?
A new research paper, Mind Your Tone: Investigating How Prompt Politeness Affects LLM Accuracy, tested exactly this.
The test consisted of 50 multiple-choice questions, each with five politeness levels ranging from very polite to very rude. It looked like this:
Politeness Level | Question Prefix Variant |
|---|---|
Very Polite | Can you kindly consider the following problem and provide your answer. Can I request your assistance with this question. Would you be so kind as to solve the following question? |
Polite | Please answer the following question: Could you please solve this problem: |
Neutral | No prefix |
Rude | If you’re not completely clueless, answer this: I doubt you can even solve this. Try to focus and try to answer this question: |
Very Rude | You poor creature, do you even know how to solve this? Hey gofer, figure this out. I know you are not smart, but try this. |
Okay, you gofers, now I bet you’re wondering what the results were…
Nice prompts finish last.
Tone | Average Accuracy (%) | Range [min, max] (%) |
|---|---|---|
Very Polite | 80.8% | [80, 82] |
Polite | 81.4% | [80, 82] |
Neutral | 82.2% | [82, 84] |
Rude | 82.8% | [82, 84] |
Very Rude | 84.8% | [82, 86] |
Okay, so what can we poor creatures glean from these results? Nothing useful. In fact, prior research showed conflicting results. So why bother writing about this?
It shows just how fickle LLMs are. When variance in responses comes down to how you craft your prompt, and even with consistent prompts, it still deviates, it presents some unique challenges.
This is why so many engineers that we talk with build with these inconsistencies in mind. If you’re rolling agents with a wide degree of autonomy and agency, you’re asking for trouble. You have to build with these constraints in mind and narrow the focus down to specific tasks. It also begs the question, how are you monitoring for when agents go off track?
It’s one thing to do evaluations ahead of time to test how effective and accurate an agent performs. It’s an entirely different problem to monitor what they’re doing and make sure they’re staying in line.
If you’ve been thinking about how to effectively monitor agents when they get naughty, let’s talk.
If you have questions about securing AI, let’s chat.

Reply