LLM prompt evals

EricGT · January 6, 2025, 9:35pm

For those of us who create prompts for LLMs, understanding how effective our prompts are is of importance.With regards to the generation of Prolog code, the quality of the prompt can often be the difference between succeeding or failing.

In LLM lingo, it is known as evals, short for evaluations. For those of us used to Unit Testing in programming, the similarities are so close that I often just mentally equate the two.

During the 12 days of OpenAI, this question was asked:

What are we as developers not doing as much as you think we should? What do you wish we did differently, or more or less of?

Michelle Pokrass of OpenAI replied:

One big one is evals! I see tons of developers not using evals at all and relying on vibes for rolling out changes to prod. Would highly recommend creating some simple evals using our evals product (or open source offerings) so you can update with confidence when we release new models.

On Twitter, Amanda Askell @AnthropicAI notes:

The boring yet crucial secret behind good system prompts is test-driven development. You don’t write down a system prompt and find ways to test it. You write down tests and find a system prompt that passes them.

What many do not know and it is now starting to gain traction with the LLM model creators are tools to help end users evaluate their prompts.

OpenAI playground:
https://platform.openai.com/docs/guides/evals
Note: This is new and in the OpenAI playground, this is not the evals we have seen for years in OpenAI GitHub (evals)

Anthropic console:

Microsoft .Net framework on Azure:

Disclosure: I have not used any of these automated evaluations, but I have done many simpler evaluations manually by trying different prompts. This will just make it easier.

For more details on the method of asking another (ideally larger or more powerful) model to analyze a review, rather than comparing the model output to human-created output, I recommend this lesson from Colin Jarvis.

Lesson 6: Metaprompting with o1
part of the DeepLearning.AI course: Reasoning with o1 - DeepLearning.AI

Topic		Replies	Views
LLM (Large Language Model) such as ChatGPT prompts related to Prolog Wiki chatgpt	0	1910	February 17, 2023
Prolog and LLMs/GenAI? Looking for Group llm	36	637	March 4, 2025
Wiki Discussion: ChatGPT prompt topics Wiki Discussion	2	837	February 23, 2023
LLM and Prolog, a Marriage in Heaven? General llm	7	919	May 20, 2025
Llm: swi-prolog and large language models General	16	2075	September 6, 2023

LLM prompt evals

Related topics