Automated Prompt Testing

11 Dec 2024

One of the most important things you soon discover while creating generative AI applications is the importance of your prompt templates. Those templates impact the quality of the user experience, the accuracy of the application, and so many other factors that make your features seem “intelligent”. It goes without saying then, that having a solid prompt testing strategy in place is crucial for the maintenance of your applications.

This is a lesson quickly learned if you’re asked to change the prompt or model used in a generative AI application. How do you know if the changes you’ve made are for the better? How will you possibly test all the different combinations of natural language user input? Ideally we’d do this in an automated way so that you have a concrete measurement of the effect of your changes at the click of a button.

These are all features that Azure can provide out of the box provided you’re using their AzureAI service. They have a lot of tools built into their web management portal structured around these various use cases - prompt testing, quality assurance, safety checks, etc. But what if you don’t use Azure?

azure prompt testing UI

Image from https://learn.microsoft.com/en-us/azure/ai-studio/how-to/evaluate-flow-results

I’ve worked on a small console project to attempt to automate prompt testing in our application. It has a few key objectives:

Automate prompt testing: ensure this can be done automatically without human intervention
Use prompt templates from the code base: instead of creating another spot to maintain a template, try to pull the template and its settings from the application code
Use real-life examples: using logged conversations from the generative AI application, create test data sets based on real-life examples
Log progress over time: record each test run so that results can be compared and contrasted
Allow for prompt A-B testing: compare current values to potential new ones and measure the improvement
Platform agnostic: don’t rely on hosted prompt management services

I’m happy to say I’ve accomplished these objectives with a little help from Google’s metric prompt templates!

The test file

First off I wanted a human-readable test file that someone other than a developer could maintain. I chose the YAML format as it doesn’t require a lot of markup and is fairly human readable.

prompt-templates:
  - name: "HowDoI"
      model: "llama"
      prompts:
        - data:
            articles: |
              - This screen allows you to...
              - Add Package: Packages are...
              - In this example we will...
            question: How do I create an account package?
          accepted-response: |
            Sure, I'd be happy to help!  To create...

This test file captures a few different things:

The Prompt Template we’re testing, in this case called “HowDoI”. This name corresponds to a prompt static object in our codebase.
The language model we’re generating a response with, “llama”. This lines up with a model runtime class in our application.
A collection of Prompts. Each prompt has two properties, Data, and AcceptedRepsonse. Data is a dictionary of the keyword and content that we use within the prompt template (i.e. {question} is replaced with “How do I create an account package?”), and AcceptedResponse is an example of a previous response generated by a model for this exact prompt that was accepted by an end user.

Using these, we have enough information to cycle through each prompt and programatically generate and measure a response.

The test

The actual console application is pretty straightforward - we read the yaml file, and for each prompt entry we populate the prompt template from code and execute it against the specified model.

var yamlData = LoadYaml("tests.yaml");
foreach(var promptTemplate in yamlData.PromptTemplates)
{
    foreach(var dataset in promptTemplate.Prompts)
    {
        var promptTemplateField = typeof(PromptTemplates).GetField(promptTemplate.Name);
        var promptTemplateValue = promptTemplateField.GetValue(null);
        var prompt = promptManager.GetPrompt(promptTemplateValue.Template, dataset.Data);
        var runtime = GetRuntimeByName(promptTemplate.Model.Name);
        var generatedResponse = await runtime.InvokeModel(prompt, promptTemplateValue.Temperature, promptTemplateValue.TopP);

        //actual tests
        await Test(PromptTemplates.Coherence, generatedResponse);
    }
}

The important part of this code snippet is the fact that we’re getting the exact same prompt template object that we use in our business application via reflection and the prompt template name - we’re not relying on an externally maintained text file or resource, but rather using exactly what gets deployed with our app. We also store the Temperature and TopP values with that prompt template, as those are usually template-specific and can be tweaked with that value.

You could inverse this relationship and instead ensure that your application reads its prompt templates from a configuration file that you share with the testing app - the important thing is that it’s pulling from a single source, not several that have to be kept in sync.

We then take the output of that generation and run it through a few different prompt metrics to take a measurement of the result. This is where I’m using the metric prompt templates from Vertex AI as provided in the link above; they give templates you can feed into language models to judge the coherence, fluency, groundedness, and question answering of the generated result. These are likely similar to what Azure and other tools use to provide these values to users through their UI, rating each section from 1 to 5.

I’ve found in testing various models on AWS that the Mistral Small model seems to do the best, most cost-effective job of running these metric prompts. I’m also using the Converse API to ask the model to return the single digit rating of the result once it has rated the response.

var converseRequest = new ConverseRequest()
{
    ModelId = "mistral.mistral-small-2402-v1:0"
};
converseRequest.System.Add(new SystemContextBlock { Text = systemPrompt }); //this is where the majority of the prompt goes
converseRequest.Messages.Add(new Message 
{ 
    Role = ConversationRole.User, 
    Content = new List<ContentBlock> { new ContentBlock { Text = userPrompt } } //this is where the generated response to test goes
});
var ratingResponse = await ratingRuntime.ConverseAsync(converseRequest);
converseRequest.Messages.Add(ratingResponse.Output.Message)
converseRequest.Messages.Add(new Message
{
  Role = Conversation.Role.User,
  Content = new List<ContentBlock> { new ContentBlock { Text = "Return the rating integer value from the previous response" } }
});
var scoreResponse = await ratingRuntime.ConverseAsync(converseRequest);

With this parsed score, we can store this test run into a database and then generate a history of improvements as we tweak the prompt templates. Changing the models used, the temperature or TopP values, or the template itself would justify running a test with this application to compare and contrast the differences.

Example Use

To illustrate the use of this tool, I’ll modify one of our prompts to use the Amazon Titan language model. Titan is known for its concise responses, compared to other mainstream language models it’s more likely to give you one-word answers. As you can imagine, this works good in some contexts like relying on the model to generate a response you can parse programatically, but when you’re returning the output to a user-facing application, you likely want something a bit more verbose.

# Using Titan Text Express v1
Generated Response: The model could not find sufficient information to answer the question.

Coherence Rating: 2 (Somewhat incoherent)
Fluency Rating: 1 (Inarticulate)
Groundedness Rating: 0
Question Answering Rating: 1 (Very bad)

If I change that prompt to use Llama 3.1 instead…

# Using Llama 3.1 8B Instruct
Generated Response: To create an account package, you can follow these steps: [...]

Coherence Rating: 5 (Completely coherent)
Fluency Rating: 5 (Completely fluent)
Groundedness Rating: 1 (Fully grounded)
Question Answering Rating: 5 (Very good)

Now I have concrete proof that I made a positive change to my prompt template!

Mistral also explains why it chose the answer it did based on the metric prompt template criteria, which might be good to store alongside the score in case you need to investigate specific changes. You can’t put notes on a graph though, so I like parsing out the score as well.

Comparing Previous Answers

If you have historically accepted answers from your deployed generative AI application, you can also incorporate them in this test pattern. The Google metric prompt templates also include templates for pairwise comparison; instead of rating a single result from 1 to 5, it can instead indicate a preference for which of the two responses (the old one or the newly generated one) is better in a given category. That way not only can you measure the value of the generated response, but also compare if you’ve actually improved upon what you originally had.

Here’s an example comparing Llama 3.1 Instruct’s output to our accepted Llama 2 Chat answer.

Generated Response: To create an account package, you can follow these steps: [...]

# "A" is "original response" while "B" is the newly generated one
Coherence Preference: A 

Explanation: Although both responses are of high quality, Response A provides a slightly more detailed explanation, which makes it more helpful to the user.  Additionally, Response A includes a friendly closing remark, which enhances the user experience.  Therefore, Response A is preferred.

The original Llama 2 Chat response is considered (by this Coherence metric, anyways) to be the better one - we’ve got some tweaking to do on our template!

Hopefully you can see the value of having an automated testing process around your application’s prompt templates, making tweaks and changes easy to verify and compare.