LLM prompt evals

For those of us who create prompts for LLMs, understanding how effective our prompts are is of importance.With regards to the generation of Prolog code, the quality of the prompt can often be the difference between succeeding or failing.

In LLM lingo, it is known as evals, short for evaluations. For those of us used to Unit Testing in programming, the similarities are so close that I often just mentally equate the two.

During the 12 days of OpenAI, this question was asked:

What are we as developers not doing as much as you think we should? What do you wish we did differently, or more or less of?

Michelle Pokrass of OpenAI replied:

One big one is evals! I see tons of developers not using evals at all and relying on vibes for rolling out changes to prod. Would highly recommend creating some simple evals using our evals product (or open source offerings) so you can update with confidence when we release new models.

On Twitter, Amanda Askell @AnthropicAI notes:

The boring yet crucial secret behind good system prompts is test-driven development. You don’t write down a system prompt and find ways to test it. You write down tests and find a system prompt that passes them.

What many do not know and it is now starting to gain traction with the LLM model creators are tools to help end users evaluate their prompts.

OpenAI playground:
https://platform.openai.com/docs/guides/evals
Note: This is new and in the OpenAI playground, this is not the evals we have seen for years in OpenAI GitHub (evals)

Anthropic console:

Microsoft .Net framework on Azure:

Disclosure: I have not used any of these automated evaluations, but I have done many simpler evaluations manually by trying different prompts. This will just make it easier.


For more details on the method of asking another (ideally larger or more powerful) model to analyze a review, rather than comparing the model output to human-created output, I recommend this lesson from Colin Jarvis.

Lesson 6: Metaprompting with o1
part of the DeepLearning.AI course: Reasoning with o1 - DeepLearning.AI

2 Likes