AI/LLM

Test Driving AI Applications

By

Paul

on •

Feb 13, 2024

Test Driving AI Applications

TLDR: When we implement Automated Tests early and at the core of AI products, we can reach 99% reliability and beyond. This enables safe, ongoing changes to prompts, LLM interactions with verified results, and a 20x dev time speed boost.

At Artium, we build software products for our clients. In 2024, that means we build software products that leverage text exchanges with large language models (LLMs). LLM responses are unpredictable, of course, and that unpredictability can lead to unexpected behavior, which, in turn, leads to unhappy customers.

Welcome to the Future, TDD

Traditional software operates on predictability and determinism. Our favorite practice, test-driven development (TDD), is a perfect companion in that old world. In this new world, however, TDD needs to adapt.

We’ve started our transformation with prompt testing. Every time we introduce a new prompt within a product and every time we need to modify an existing prompt, we now test before we deploy. This iterative process continues until the prompt consistently meets our quality standards, ensuring reliability before production deployment.

Dialog Testing for Apex

Our Augmented Intelligence R&D team at Artium is working on a product called Apex that chats with entrepreneurs and product managers to decompose high-level product vision into actionable plans to implement software.

Apex includes a prompt that has grown in complexity from a simple set of instructions to a multi-step, role-playing prompt structured to elicit highly specific types of engagement from the LLM. Now the AI not only responds with text, but it also guides the user through a process to a desired outcome. We noticed that in some cases the AI does not reach the end of the process but instead leads the user in circles or jumps to the final step without gathering user input. 

With manual tests, our observations were inconclusive, anecdotal, and hard to measure. We were at “it works sometimes, probably a lot of the time” which was recently described as the 70% problem barrier

Now, we are simulating user-AI conversations with Dialog Tests; we have incorporated that testing into our CI pipeline; and we produce test reports that verify that our AI guides users to the intended outcome at least 99% of the time. Here is a diagram of a Dialog Test:

In the video below, I show a dialog test in action:

Dialog Test demo showing single and repeated runs, failure, success and logs

Moving Forward (without Determinism):

In addition to expanding our ability to test prompts before they’re released into production, we also want to refine our testing strategies for the other core elements of AI products: calling AI functions and interacting with LLM APIs. 

Through all our testing efforts, we hope to help our friends and clients overcome that 70% success barrier. Our experience has suggested that TDD for AI can achieve this at least 20x faster than manual testing.

If you’re looking for help testing a prompt or anything else within your AI-driven application, get in touch with us here.