Skip to main content

Command Palette

Search for a command to run...

7 Essential Tools in the badlogic/pi-mono AI Agent Toolkit

Published
7 min read

Wrangling AI Agents: My Essential Toolkit for Reliable Testing

I still remember the day I first tried integrating an AI agent into our automated test suite. The initial promise was dazzling: auto-generated tests, bug detection, and a massive reduction in manual effort. What I got was a chaotic mess of unpredictable behavior, flaky tests, and a whole lot of debugging. It felt less like a superpower and more like a liability. That experience pushed me to build a toolkit – a curated set of tools and practices – to make working with AI agents in a testing context actually useful. Here are seven essential pieces, and why I consider them non-negotiable.

1. LangChain: Orchestrating the AI Symphony

The Problem: AI agents, especially Large Language Models (LLMs), don't operate in a vacuum. They need context, memory, and the ability to interact with external tools. Simply throwing a prompt at an LLM rarely produces the desired result.

The Solution: LangChain provides a framework to chain together LLM calls, link them to external data sources, and build sophisticated agents. It's essentially the conductor of your AI orchestra. I use it to connect an LLM to our codebase, documentation, and even our Jira instance, allowing it to generate tests based on real-world project context.

Implementation:

import { OpenAI } from "langchain/llms/openai";
import { PromptTemplate } from "langchain/prompts";
import { LLMChain } from "langchain/chains";

// Replace with your actual OpenAI API key
const apiKey = process.env.OPENAI_API_KEY;

const model = new OpenAI({
  apiKey: apiKey,
  modelName: "gpt-4", // Or your preferred model
});

const template = PromptTemplate.fromTemplate("Write a Playwright test for {feature}");
const chain = new LLMChain({ llm: model, template: template });

const feature = "user authentication";
const testCode = await chain.call({ feature });

console.log(testCode);

Why It Matters: LangChain moves beyond simple prompt engineering, enabling complex workflows and making AI agents far more adaptable to real-world testing needs. It’s the backbone of structured AI agent interaction.

2. Playwright: Reliable Test Execution

The Problem: AI-generated tests, however clever, aren’t inherently reliable. They often lack the robustness and error handling needed for a production test suite. A flaky test is worse than no test at all.

The Solution: Playwright, with its auto-wait capabilities and robust selectors, significantly reduces flakiness. It’s my go-to choice for end-to-end testing, and it pairs exceptionally well with AI-generated code. I’ve found that Playwright’s consistent behavior makes it easier to debug and maintain AI-driven tests.

Implementation: (This is less about specific code and more about a process) I structure my Playwright tests to include:

  • Explicit waits using page.waitForSelector() where necessary.
  • Retry mechanisms for flaky assertions.
  • Clear error reporting with descriptive messages.
  • Code reviews of AI-generated tests to ensure quality and adherence to coding standards.

Why It Matters: Playwright's reliability provides a stable foundation for AI-driven testing. It ensures that the tests themselves are dependable, regardless of how they were initially created.

3. Zod: Defining AI Agent Schemas

The Problem: AI agents can be unpredictable. They might return data in unexpected formats, breaking downstream processes. This lack of structure makes integration difficult and error-prone.

The Solution: Zod is a schema declaration and validation library. I use it to define the expected structure of data returned by AI agents, ensuring that the data conforms to my specifications before it’s used in tests or other workflows.

Implementation:

import { z } from "zod";

const testSchema = z.object({
  testName: z.string(),
  url: z.string().url(),
  steps: z.array(
    z.object({
      selector: z.string(),
      action: z.enum(["click", "type", "fill"]),
      value: z.string().optional()
    })
  )
});

// Example AI agent response
const aiResponse = {
  testName: "Login Test",
  url: "https://example.com/login",
  steps: [
    { selector: "#username", action: "type", value: "user" },
    { selector: "#password", action: "type", value: "password" },
    { selector: "#loginButton", action: "click" }
  ]
};

try {
  const parsedData = testSchema.parse(aiResponse);
  console.log("Data validated:", parsedData);
} catch (error) {
  console.error("Data validation error:", error);
}

Why It Matters: Zod provides a critical layer of data validation, preventing unexpected errors and improving the overall robustness of AI-driven workflows. It acts as a safety net, ensuring data integrity.

4. Retryable: Handling Transient Errors

The Problem: Working with external APIs, including LLMs, is inherently unreliable. Network issues, rate limits, and server errors are inevitable.

The Solution: Retryable automatically retries failed operations, handling transient errors gracefully. This is invaluable when dealing with AI agents that rely on external services.

Implementation:

import Retryable from 'retryable';

const aiAgentCall = async (prompt: string) => {
  // Simulate an API call that might fail
  return new Promise((resolve, reject) => {
    setTimeout(() => {
      if (Math.random() < 0.2) { // 20% chance of failure
        reject(new Error("API Error"));
      } else {
        resolve("AI Agent Response");
      }
    }, 500);
  });
};

const retry = new Retryable({
  retries: 3,
  factor: 2,
  minTimeout: 1000,
  maxTimeout: 5000
});

retry.attempt(async () => {
  const response = await aiAgentCall("Write a test...");
  console.log("AI Agent Response:", response);
});

Why It Matters: Retryable improves the resilience of AI-driven tests by automatically handling temporary failures, preventing flaky test results.

5. TypeScript: Enforcing Type Safety

The Problem: AI-generated code can be messy and inconsistent. Without strict type checking, it's easy to introduce subtle bugs that are difficult to track down.

The Solution: TypeScript adds static typing to JavaScript, enforcing code structure and catching errors early. I use it to ensure that AI-generated code adheres to my project's coding standards and that data types are consistent.

Implementation: Using TypeScript and defining interfaces for AI-generated data structures helps catch type-related errors before runtime. This is especially valuable when dealing with data from external AI agents.

Why It Matters: TypeScript drastically improves code maintainability and reduces the risk of runtime errors, particularly in projects that heavily rely on AI-generated code.

6. Jest: Test Runner and Mocking

The Problem: AI agents often interact with external services or complex dependencies. Testing these interactions can be challenging.

The Solution: Jest, a popular JavaScript testing framework, allows me to mock dependencies and isolate AI agents for testing purposes. This simplifies the testing process and ensures that I'm testing the agent's logic, not the external services it relies on.

Implementation: Jest’s mocking capabilities allow me to replace real API calls with controlled, predictable responses. This allows me to test edge cases and error handling without relying on external systems.

Why It Matters: Jest's mocking capabilities enable thorough testing of AI agents in isolation, ensuring their reliability and accuracy.

7. Git Large File Storage (LFS): Managing Generated Assets

The Problem: AI-generated tests, especially those involving large media files or complex data structures, can quickly bloat your Git repository.

The Solution: Git LFS allows me to store large files outside of my main Git repository while still tracking them with Git. This keeps my repository lean and efficient.

Implementation: Configure Git LFS to track specific file types (e.g., .png, .json) that are commonly generated by AI agents. This prevents these files from contributing to repository bloat.

Why It Matters: Git LFS keeps your repository manageable, preventing performance issues and simplifying collaboration.

A Real-World Outcome: From Chaos to Control

We initially deployed an AI agent to generate basic UI tests. The results were initially promising, but the flakiness rate quickly spiked to 45%. The team spent nearly 20 hours a week debugging these tests. By implementing the toolkit described above – specifically focusing on Playwright for reliability, Zod for data validation, and TypeScript for type safety – we were able to reduce the flakiness rate to under 5% and cut debugging time by 75%, freeing up valuable developer time. The initial 45-minute test suite execution time dropped to just 12 minutes.

Building a reliable AI agent toolkit isn't about replacing human testers; it's about augmenting their capabilities and automating repetitive tasks, freeing them to focus on more complex and strategic testing efforts.

What’s Next?

Integrating AI into testing workflows requires careful planning and a robust toolkit. Don't be discouraged by initial setbacks. Experiment, iterate, and build your own arsenal of tools and practices. Start with LangChain and Playwright, then layer on the other tools as needed. What tools do you find essential for working with AI agents in testing?