- Published on
- • 4 min read
Building a more predictable way to test LLMs: Introducing @node-llm/testing
- Authors

- Name
- Shaiju Edakulangara
- @eshaiju
One of the biggest frustrations I've had while building NodeLLM is how hard it is to write a reliable test for an LLM.
Either you spend a fortune on live API calls during development, or you spend hours writing manual mocks that don't really reflect how the model actually behaves. And then there's the constant worry about accidentally committing an API key to your test fixtures.
To solve this for myself, I've been working on a dedicated testing package, now published as @node-llm/testing.
How I’m approaching the problem
I’ve settled on a two-tier approach that has made my own development workflow a lot smoother:
1. Recording what actually happens (VCR)
I wanted a way to record a real interaction once and then just replay it forever. The VCR pattern in this package does exactly that. The first time you run a test, it talks to the provider and saves the response. Every time after that, it just reads from a local "cassette" file.
import { describeVCR, withVCR } from "@node-llm/testing";
describeVCR("Sentiment Analysis", () => {
it("calculates positive sentiment correctly", withVCR(async () => {
const result = await mySentimentAgent.run("I love NodeLLM!");
expect(result.sentiment).toBe("positive");
// Saved to: test/cassettes/sentiment-analysis/calculates-positive-sentiment-correctly.json
}));
});
It's helped me in three ways:
- Speed: Tests run in milliseconds after the first recording.
- Fidelity: It captures the full response, including tool calls and token usage, so the "replay" is highly accurate.
- Safety: It’s designed to fail-fast in CI if a recording is missing, so I don't accidentally leak costs or hit limits in my build pulse.
2. Mocking logical edge cases (Mocker)
When I need to test how my code handles a specific error (like a 429 rate limit) or a very specific tool-calling sequence, I use the Mocker. It’s a simple, fluent API that lets me define exactly what should happen without any network overhead.
import { mockLLM } from "@node-llm/testing";
const mocker = mockLLM({ strict: true });
// Simulate an API error to test your retry logic
mocker.chat("Hello").respond({ error: new Error("Rate limit exceeded") });
// Or simulate a tool call
mocker.chat("Check weather").callsTool("get_weather", { location: "London" });
3. Deterministic Time (Time Travel)
AI applications often depend on time. Whether you're testing message history expiration, rate-limiting windows, or just want your logs to be consistent, you need to control the clock. I added a Time utility (inspired by Ruby's Timecop) that wraps Vitest's timers in a cleaner API.
import { Time } from "@node-llm/testing";
await Time.frozen("2025-01-01", async () => {
const result = await agent.run("What happened today?");
// System time is frozen at Jan 1st for this block
});
// Or manual control
Time.freeze("2025-12-31T23:59:59");
Time.advance(2000); // 2 seconds later...
expect(new Date().getFullYear()).toBe(2026);
Time.restore();
A few details that made a difference for me
While building this, I realised that simple JSON stringification wasn't enough for AI data. I added a few things that I found essential:
- Handling complex data: LLM results often involve things like
Map,Set, orDateobjects. I wrote a custom serialiser to make sure these types are preserved accurately when saved to disk. - Automatic Scrubbing: I was tired of manually redacting my cassettes. This package now automatically finds and redacts things like OpenAI keys or sensitive headers. You can even add your own:
withVCR({
sensitiveKeys: ["user_id", "session_token"],
sensitivePatterns: [/secret-[a-z0-9]+/g]
}, async () => { ... });
- Prompt Snapshots & History: I wanted an easy way to verify that my system prompts hadn't drifted. The Mocker now maintains a full call history, allowing you to snapshot the exact request payload:
// Inspect the history
const lastCall = mocker.getLastCall();
expect(lastCall.method).toBe("chat");
// Snapshots the full message history and tool definitions sent to the LLM
expect(lastCall.prompt).toMatchSnapshot();
Take it for a spin
This is a set of tools that has made my life easier as a solo builder.
If you've been finding LLM testing as frustrating as I have, feel free to give it a try:
npm install @node-llm/testing
It's open source, and you can check out the documentation, browse the source code, or contribute over on GitHub.
Building with NodeLLM? Join the conversation on GitHub.