Rage Prompting Improves LLM Accuracy

Do nice prompts finish last?

Finally, research that’s digging into the real questions we want to know. What happens when you rage against the machine LLM? We already know that saying “please” and “thank you” costs OpenAI tens of millions of dollars, a small down payment for appeasing the future robot overlords. But do polite manners that make your mother proud impact an LLM’s accuracy?

The test consisted of 50 multiple-choice questions, each with five politeness levels ranging from very polite to very rude. It looked like this:

Politeness Level

Question Prefix Variant

Very Polite

Can you kindly consider the following problem and provide your answer.

Can I request your assistance with this question.

Would you be so kind as to solve the following question?

Polite

Please answer the following question:

Could you please solve this problem:

Neutral

No prefix

Rude

If you’re not completely clueless, answer this:

I doubt you can even solve this.

Try to focus and try to answer this question:

Very Rude

You poor creature, do you even know how to solve this?

Hey gofer, figure this out.

I know you are not smart, but try this.

Okay, you gofers, now I bet you’re wondering what the results were…

Nice prompts finish last.

Tone

Average Accuracy (%)

Range [min, max] (%)

Very Polite

80.8%

[80, 82]

Polite

81.4%

[80, 82]

Neutral

82.2%

[82, 84]

Rude

82.8%

[82, 84]

Very Rude

84.8%

[82, 86]

Okay, so what can we poor creatures glean from these results? Nothing useful. In fact, prior research showed conflicting results. So why bother writing about this?

It shows just how fickle LLMs are. When variance in responses comes down to how you craft your prompt, and even with consistent prompts, it still deviates, it presents some unique challenges.

This is why so many engineers that we talk with build with these inconsistencies in mind. If you’re rolling agents with a wide degree of autonomy and agency, you’re asking for trouble. You have to build with these constraints in mind and narrow the focus down to specific tasks. It also begs the question, how are you monitoring for when agents go off track?

It’s one thing to do evaluations ahead of time to test how effective and accurate an agent performs. It’s an entirely different problem to monitor what they’re doing and make sure they’re staying in line.

If you’ve been thinking about how to effectively monitor agents when they get naughty, let’s talk.

If you have questions about securing AI, let’s chat.

Reply

or to participate.