3 Reasons Why TDD & LLM’s Go Together Like Peanut Butter and Jelly

For the past year at Artium we’ve been busy kicking the tires on all different forms of leveraging generative AI in the software development life cycle to boost speed, quality and the creative experience. What we’ve arrived at is a collection of principles and practices called L.E.A.P. or “LLM Enhanced Agile Process”. LEAP starts with our eXtreme Programming DNA, DNA which itself is perpetually iterating, and folds in the use of the latest transformer-powered code, text, and image generation models.

We’ve found that using generative AI bluntly, especially for code generation, can have diminishing returns. Ask it for a complete program and you’ll often get something that’s kinda-sorta correct, but contains errors. More insidious is when the code works completely, but it doesn’t actually solve the problem or it solves it but misses important edge cases or bugs. Like an extremely over-confident junior developer giving you a response with cheerful certainty.

To combat this tendency, we’ve been leveraging one of our favorite tools from the XP tool-belt: Test Driven Development.

TDD is already amazing for coaxing human developers into writing tighter, well-factored code in a self-documenting manner. The fact the code becomes provably correct is almost a secondary benefit!

Tests are extremely good at communicating context to other developers. In fact, in the popular BDD school of test driven development the context keyword is used to specify a group of related tests (or specs in the case of BDD). And it turns out that tests are also incredibly good at providing specifications (or specs) and managing context in your favorite code completion LLM.

Reason #1 - Breaking the Problem into Small Steps

I see a lot of articles running LLM’s through code generation and one of the most common mistakes I see is asking the LLM to do too much. Here’s a good example of a great article (that you should read!) where the author asks for holistic solutions to somewhat difficult problems: AI and the end of Programming. Unsurprisingly, most of the time the LLM can only get some of the program correct. Though the fact it returns a compilable result that gets anywhere is still pretty impressive!

These are stochastic auto-completes, so that means the more context you try to jam into the window, the less likely they are to perform well. Even LLM’s with very large context windows, like Anthropic’s Claude, struggle to return a good result if over-stuffed with context (see the recent Stanford paper “Lost in the Middle” for more).

Additionally, the longer the completion it returns, the more likely it is to get lost. Have you ever noticed that the hallucinations often start in the latter half of a completion? In smaller models you can even watch longer completions go from proper english to nonsense words to random characters as the chain of statistically likely subsequent tokens gets longer and longer.

One of the great features of TDD, both for humans and for LLM’s, is that it forces you to break the problem into very small steps. Write a test, prompt the LLM (or other developer back in my day :D) to write the simplest code to solve it. This allows you to write programs one small step at a time, which means passing less context into the window, and receiving for a shorter completion. Both of which increase the likelihood of an accurate result immensely.

Reason #2 - Specification by Spec

We’ve found managing context is one of the most important things to get a good and accurate result from LLM’s. That means being careful not to “context stuff”, as mentioned above, and careful to specify exactly the result you’re looking for.

Passing a test or spec to the LLM, preceded by some context about available API’s to leverage, is an amazing way to specify desired output from the LLM. A good unit test will effectively document the desired behavior AND the desired API to the code.

Tests and specs are specific & descriptive by design.

Specification of your system’s internal APIs is important for maintaining good design in your codebase. At this moment it is still very much software engineers providing the intent and broad design of the system. LLM’s can provide some guidance on the broader architecture of the system, but designing for changeability, maintainability, and performance remains firmly in the realm of humans.

Reason #3 - Immediate Error Detection & Fast Feedback Loop

The final reason tests are such a great way to prompt the LLM is that they’re executable!! You get immediate feedback on whether or not the LLM returned a correct result without needing to pick through the code with a fine toothed comb.

In many ways this is one of the reasons TDD is so powerful for human developers – it removes the need to hold every aspect of the logic of your system in your head at once. You can instead write that logic down in a way that actually ensures consistency of your system.

When is that even more important? When you have an overly-confident junior developer (ahem, or LLM) cheerfully giving you a wrong answer.

Wrong answers are no longer much of an issue or annoyance when you can get immediate feedback on their correctness.

Even better, much of the time you can feed the test failure back to the LLM and get a correct result the second time around. It’s like a free retry with a better chance of getting it right.

Conclusion

I hope this is helpful to folks looking to dive into an LLM Enhanced Agile Programming workflow. When done well, it feels like being a kid again in terms of power and joy in programming.

At Artium we’re all about the intersection of building great software that affects the world AND the creative experience for the builders. Give this a shot, I bet you’ll find it’s a pretty dang fun and powerful way to code.

For the past year at Artium we’ve been busy kicking the tires on all different forms of leveraging generative AI in the software development life cycle to boost speed, quality and the creative experience. What we’ve arrived at is a collection of principles and practices called L.E.A.P. or “LLM Enhanced Agile Process”. LEAP starts with our eXtreme Programming DNA, DNA which itself is perpetually iterating, and folds in the use of the latest transformer-powered code, text, and image generation models.

We’ve found that using generative AI bluntly, especially for code generation, can have diminishing returns. Ask it for a complete program and you’ll often get something that’s kinda-sorta correct, but contains errors. More insidious is when the code works completely, but it doesn’t actually solve the problem or it solves it but misses important edge cases or bugs. Like an extremely over-confident junior developer giving you a response with cheerful certainty.

To combat this tendency, we’ve been leveraging one of our favorite tools from the XP tool-belt: Test Driven Development.

TDD is already amazing for coaxing human developers into writing tighter, well-factored code in a self-documenting manner. The fact the code becomes provably correct is almost a secondary benefit!

Tests are extremely good at communicating context to other developers. In fact, in the popular BDD school of test driven development the context keyword is used to specify a group of related tests (or specs in the case of BDD). And it turns out that tests are also incredibly good at providing specifications (or specs) and managing context in your favorite code completion LLM.

Reason #1 - Breaking the Problem into Small Steps

I see a lot of articles running LLM’s through code generation and one of the most common mistakes I see is asking the LLM to do too much. Here’s a good example of a great article (that you should read!) where the author asks for holistic solutions to somewhat difficult problems: AI and the end of Programming. Unsurprisingly, most of the time the LLM can only get some of the program correct. Though the fact it returns a compilable result that gets anywhere is still pretty impressive!

These are stochastic auto-completes, so that means the more context you try to jam into the window, the less likely they are to perform well. Even LLM’s with very large context windows, like Anthropic’s Claude, struggle to return a good result if over-stuffed with context (see the recent Stanford paper “Lost in the Middle” for more).

Additionally, the longer the completion it returns, the more likely it is to get lost. Have you ever noticed that the hallucinations often start in the latter half of a completion? In smaller models you can even watch longer completions go from proper english to nonsense words to random characters as the chain of statistically likely subsequent tokens gets longer and longer.

One of the great features of TDD, both for humans and for LLM’s, is that it forces you to break the problem into very small steps. Write a test, prompt the LLM (or other developer back in my day :D) to write the simplest code to solve it. This allows you to write programs one small step at a time, which means passing less context into the window, and receiving for a shorter completion. Both of which increase the likelihood of an accurate result immensely.

Reason #2 - Specification by Spec

We’ve found managing context is one of the most important things to get a good and accurate result from LLM’s. That means being careful not to “context stuff”, as mentioned above, and careful to specify exactly the result you’re looking for.

Passing a test or spec to the LLM, preceded by some context about available API’s to leverage, is an amazing way to specify desired output from the LLM. A good unit test will effectively document the desired behavior AND the desired API to the code.

Tests and specs are specific & descriptive by design.

Specification of your system’s internal APIs is important for maintaining good design in your codebase. At this moment it is still very much software engineers providing the intent and broad design of the system. LLM’s can provide some guidance on the broader architecture of the system, but designing for changeability, maintainability, and performance remains firmly in the realm of humans.

Reason #3 - Immediate Error Detection & Fast Feedback Loop

The final reason tests are such a great way to prompt the LLM is that they’re executable!! You get immediate feedback on whether or not the LLM returned a correct result without needing to pick through the code with a fine toothed comb.

In many ways this is one of the reasons TDD is so powerful for human developers – it removes the need to hold every aspect of the logic of your system in your head at once. You can instead write that logic down in a way that actually ensures consistency of your system.

When is that even more important? When you have an overly-confident junior developer (ahem, or LLM) cheerfully giving you a wrong answer.

Wrong answers are no longer much of an issue or annoyance when you can get immediate feedback on their correctness.

Even better, much of the time you can feed the test failure back to the LLM and get a correct result the second time around. It’s like a free retry with a better chance of getting it right.

Conclusion

I hope this is helpful to folks looking to dive into an LLM Enhanced Agile Programming workflow. When done well, it feels like being a kid again in terms of power and joy in programming.

At Artium we’re all about the intersection of building great software that affects the world AND the creative experience for the builders. Give this a shot, I bet you’ll find it’s a pretty dang fun and powerful way to code.

For the past year at Artium we’ve been busy kicking the tires on all different forms of leveraging generative AI in the software development life cycle to boost speed, quality and the creative experience. What we’ve arrived at is a collection of principles and practices called L.E.A.P. or “LLM Enhanced Agile Process”. LEAP starts with our eXtreme Programming DNA, DNA which itself is perpetually iterating, and folds in the use of the latest transformer-powered code, text, and image generation models.

We’ve found that using generative AI bluntly, especially for code generation, can have diminishing returns. Ask it for a complete program and you’ll often get something that’s kinda-sorta correct, but contains errors. More insidious is when the code works completely, but it doesn’t actually solve the problem or it solves it but misses important edge cases or bugs. Like an extremely over-confident junior developer giving you a response with cheerful certainty.

To combat this tendency, we’ve been leveraging one of our favorite tools from the XP tool-belt: Test Driven Development.

TDD is already amazing for coaxing human developers into writing tighter, well-factored code in a self-documenting manner. The fact the code becomes provably correct is almost a secondary benefit!

Tests are extremely good at communicating context to other developers. In fact, in the popular BDD school of test driven development the context keyword is used to specify a group of related tests (or specs in the case of BDD). And it turns out that tests are also incredibly good at providing specifications (or specs) and managing context in your favorite code completion LLM.

Reason #1 - Breaking the Problem into Small Steps

I see a lot of articles running LLM’s through code generation and one of the most common mistakes I see is asking the LLM to do too much. Here’s a good example of a great article (that you should read!) where the author asks for holistic solutions to somewhat difficult problems: AI and the end of Programming. Unsurprisingly, most of the time the LLM can only get some of the program correct. Though the fact it returns a compilable result that gets anywhere is still pretty impressive!

These are stochastic auto-completes, so that means the more context you try to jam into the window, the less likely they are to perform well. Even LLM’s with very large context windows, like Anthropic’s Claude, struggle to return a good result if over-stuffed with context (see the recent Stanford paper “Lost in the Middle” for more).

Additionally, the longer the completion it returns, the more likely it is to get lost. Have you ever noticed that the hallucinations often start in the latter half of a completion? In smaller models you can even watch longer completions go from proper english to nonsense words to random characters as the chain of statistically likely subsequent tokens gets longer and longer.

One of the great features of TDD, both for humans and for LLM’s, is that it forces you to break the problem into very small steps. Write a test, prompt the LLM (or other developer back in my day :D) to write the simplest code to solve it. This allows you to write programs one small step at a time, which means passing less context into the window, and receiving for a shorter completion. Both of which increase the likelihood of an accurate result immensely.

Reason #2 - Specification by Spec

We’ve found managing context is one of the most important things to get a good and accurate result from LLM’s. That means being careful not to “context stuff”, as mentioned above, and careful to specify exactly the result you’re looking for.

Passing a test or spec to the LLM, preceded by some context about available API’s to leverage, is an amazing way to specify desired output from the LLM. A good unit test will effectively document the desired behavior AND the desired API to the code.

Tests and specs are specific & descriptive by design.

Specification of your system’s internal APIs is important for maintaining good design in your codebase. At this moment it is still very much software engineers providing the intent and broad design of the system. LLM’s can provide some guidance on the broader architecture of the system, but designing for changeability, maintainability, and performance remains firmly in the realm of humans.

Reason #3 - Immediate Error Detection & Fast Feedback Loop

The final reason tests are such a great way to prompt the LLM is that they’re executable!! You get immediate feedback on whether or not the LLM returned a correct result without needing to pick through the code with a fine toothed comb.

In many ways this is one of the reasons TDD is so powerful for human developers – it removes the need to hold every aspect of the logic of your system in your head at once. You can instead write that logic down in a way that actually ensures consistency of your system.

When is that even more important? When you have an overly-confident junior developer (ahem, or LLM) cheerfully giving you a wrong answer.

Wrong answers are no longer much of an issue or annoyance when you can get immediate feedback on their correctness.

Even better, much of the time you can feed the test failure back to the LLM and get a correct result the second time around. It’s like a free retry with a better chance of getting it right.

Conclusion

I hope this is helpful to folks looking to dive into an LLM Enhanced Agile Programming workflow. When done well, it feels like being a kid again in terms of power and joy in programming.

At Artium we’re all about the intersection of building great software that affects the world AND the creative experience for the builders. Give this a shot, I bet you’ll find it’s a pretty dang fun and powerful way to code.

Home

Our Work

Our Services

Who We Are

Insights

Work With Us

3 Reasons Why TDD & LLM’s Go Together Like Peanut Butter and Jelly

3 Reasons Why TDD & LLM’s Go Together Like Peanut Butter and Jelly

Reason #1 - Breaking the Problem into Small Steps

Reason #2 - Specification by Spec

Reason #3 - Immediate Error Detection & Fast Feedback Loop

Conclusion

Reason #1 - Breaking the Problem into Small Steps

Reason #2 - Specification by Spec

Reason #3 - Immediate Error Detection & Fast Feedback Loop

Conclusion

Reason #1 - Breaking the Problem into Small Steps

Reason #2 - Specification by Spec

Reason #3 - Immediate Error Detection & Fast Feedback Loop

Conclusion

About Us

Resources

Get In Touch

About Us

Resources

Get In Touch

About Us

Resources

Get In Touch