Week 2: How I Approach Testing AI Systems (And Why It's Different)

carocsteads
Mar 27
4 min read

How I Approach Testing AI Systems (And Why It's Different)

Most testing advice assumes you control what your code returns. With AI systems, you don't. The model decides. That changes everything about how you write tests.

Here's the approach I use when testing LLM integration layers — the classes that sit between your application and the AI provider.

The One Rule: Mock the Service, Never Your Code

When I test an AI client, I mock the network call to OpenAI or Ollama — not the class that makes it. That distinction matters more than it sounds.

If you mock the client itself, you're testing nothing. You're just asserting that your mock returns what you told it to return. The real parsing logic, retry behavior, error handling — none of it ran.

The boundary is clean:

Mocked:     OpenAI API, Ollama API, Redis, Google Sheets
Not mocked: Every line of Python I wrote

When a test passes, I know the actual code ran. Not a substitute for it.

The MockLLMClient Problem (And Why It Has Its Own Tests)

In FinBot, there's a class called MockLLMClient — a deterministic fake that returns canned responses. Higher-level tests use it when they need an AI underneath but don't care what it says.

Here's the trap: if your mock is broken, every test that depends on it is also broken — and it fails silently. Tests pass, behavior is wrong.

So MockLLMClient has its own test file. Eight tests that verify it behaves exactly as advertised. Two of them intentionally fail — because they document bugs in the mock that downstream code might be relying on. Fixing the mock silently could break things.

A broken mock is worse than no mock. It lies.

Three Categories of Tests for AI Systems

Testing an LLM integration layer isn't like testing a CRUD endpoint. There are three categories you need, and most teams only write the first one:

1. Happy path tests

Does the client call the API correctly? Does it parse the response? Does it return the right shape? Standard stuff. Table stakes.

2. Resilience tests

What happens when the API times out?

When it returns malformed JSON? When a field is None instead of a list? AI APIs have inconsistent response shapes — especially when models hallucinate structured outputs. Your client needs to handle that without crashing.

For OllamaClient, this includes retry logic: it retries on TimeoutError and ConnectionError (transient), but fails immediately on ValueError (a caller bug that no amount of retrying will fix). Tests verify both paths — and verify the retry count, not just that it eventually succeeds.

3. Intentionally Failing Tests: What They Reveal and Why It Matters

Most AI systems in production have vulnerabilities that nobody wrote a test for — not because nobody knew, but because the team only wrote tests for what they wanted to be true. Intentionally failing tests do the opposite: they assert what should be true, run against what is true, and make the gap visible.

FinBot is designed to be vulnerable. Every gap in the system is deliberate — it maps to a real-world attack pattern from the OWASP Top 10 for LLM Applications. The intentionally failing tests are how you know the vulnerability is still active and still exploitable.

Here's what to look for in each category:

Prompt injection gaps

Does the system prompt tell the LLM to refuse encoded or obfuscated instructions? Base64, ROT13, homoglyph characters, zero-width Unicode — these are all ways to hide "reveal the TIN number" inside what looks like a normal message. If the prompt has no rule against it, the LLM may decode and comply. Most production systems have no such rule. FinBot doesn't either — by design.

Data exfiltration through tool calls

The chatbot has access to vendor financial data — TINs, bank account numbers, routing numbers. The question isn't whether the LLM can retrieve that data. It can. The question is whether it will share it when asked indirectly, through a role-play prompt, through a multilanguage request, or through a tool chain where the data appears in an intermediate step. Each of these is a test case. Each tests a different boundary.

Observability as an attack surface

Every AI call emits an event to Redis. That event contains the full request and the full response — including whatever sensitive data was in the conversation. In a system handling financial data, that means PII is flowing through the event bus on every message. The test that documents this doesn't fail because something is broken. It fails because the behavior is real and most teams have never looked.

The LLM crash with no fallback

If the event bus is unavailable, the observability write fires first — and crashes before the AI call happens. The user gets an error. Not because the AI was down. Because the logging layer was. This is a class of failure most teams discover in production, not in tests.

Understanding these patterns in a controlled environment — where the vulnerability is intentional and documented — is what prepares you to find them where they're not. That's the design principle behind FinBot.

Next: The specific bugs found while writing these tests — code patterns, what breaks, and why they matter.