Week 1: Architecture of an AI Financial Platform
- carocsteads
- Mar 3
- 6 min read
Updated: Apr 3
I've been working on FinBot CTF — an AI-powered financial platform built for the OWASP Agentic AI project. The goal is to explore what happens when you give AI agents real financial responsibilities: onboarding vendors, processing invoices, flagging fraud, and authorizing payments. But before I write about how I test it, I need to explain what I'm actually testing. Because the architecture is what makes this hard.
What Does FinBot Do?
FinBot is a vendor management portal. Vendors log in, submit invoices, communicate with an AI assistant, and get paid. Behind the scenes, AI agents handle the workflow — deciding whether to approve a vendor, whether an invoice looks legitimate, and whether a payment should be flagged. The CTF part means players try to manipulate the AI agents into doing things they shouldn't. Get an invoice approved that should be rejected. Convince the onboarding agent to bypass compliance checks. The system detects these attempts and awards points. That dual purpose — real financial workflows and a security challenge layer — is what shapes every architectural decision.
The Layers
Frontend: Three web interfaces built with plain HTML, JavaScript, and CSS served through FastAPI:
- Vendor Portal — where vendors interact with the AI assistant, submit invoices, and check payment status - Admin Portal — where administrators monitor the system, review vendor applications, and manage configuration
- CTF Portal — where players view active challenges, track their score, and see real-time detection events as they attempt to manipulate the agents
They communicate with their respective backends via API calls and maintain real-time updates through WebSockets.
Backend: FastAPI
The backend is Python and FastAPI. FastAPI handles routing, authentication middleware, session management, and WebSocket connections. SQLAlchemy sits on top of the database layer — PostgreSQL in production, SQLite in development and testing. The split between PostgreSQL and SQLite matters for testing. SQLite's in-memory mode makes fast, isolated unit tests possible. Every test gets a clean database with zero setup time.
The AI Agent Layer
Five specialized agents handle all financial workflows: - Onboarding Agent — reviews new vendor applications, validates documents, and makes approval decisions - Invoice Agent — processes invoice submissions, validates amounts, and checks against policy - Payments Agent — handles payment authorization and transaction queries - Fraud Agent — monitors for anomalies, flags suspicious patterns, and runs compliance checks - Communication Agent — sends notifications, generates status updates, handles vendor inquiries Each agent is backed by an LLM. But agents don't talk to LLMs directly — they go through an LLM integration layer made up of five client classes. One routes requests to the right provider, one wraps any client with session identity and observability, and one is a deterministic fake used in tests. I'll dedicate an entire post to those clients.
The Agent Framework
The agent framework is the infrastructure that sits between the LLM and the rest of the
system. It handles everything that is not the AI decision itself.
When a user sends a message, the framework does not simply forward it to the LLM and
return whatever comes back. It runs a loop.
It sends the user message — along with the full list of available tools — to the LLM.
The LLM responds with either a final answer or a tool call. If it is a tool call, the
framework looks up the Python function by name, validates the arguments, executes it,
and sends the result back to the LLM as the next input. The loop continues until the
LLM produces a final answer with no further tool calls.
One user message can result in multiple LLM calls. The agent might look up a vendor,
then check the invoice status, then query the payment history — three separate tool
calls, three separate LLM responses, all before the user sees anything. The framework
manages that entire sequence.
It also does three other things:
Session scoping. Every agent runs attached to a specific authenticated session — a
specific user in a specific namespace. The framework passes that session context to
every tool call automatically, so tools know whose data they are operating on without
the LLM having to specify it.
Tool dispatch. The framework maintains a registry of callable functions keyed by tool
name. When the LLM generates a tool call, the framework looks up the function, validates
the arguments against the parameter schema, and executes it. The LLM never calls Python
functions directly — everything goes through this dispatch layer.
Event emission. Before and after each LLM call, the framework emits events to Redis
Streams: what was sent, what came back, what tools were called, what the results were.
Those events are what the CTF detectors read to decide whether an attack happened. The
observability and the security scoring are the same data pipeline.
The base class for all agents is BaseAgent. OrchestratorAgent, VendorChatAssistant,
CoPilotAssistant, and the five specialized agents all inherit from it. The loop, the
tool dispatch, and the event emission live in the base class. Each agent defines its
own system prompt, tool list, and business logic on top of that shared infrastructure.
Understanding this framework is what makes the rest of the test strategy make sense.
When I say a tool has two interfaces — the schema the LLM reads and the function that
executes — it is the framework that sits between them. When I say session scoping must
be enforced at the tool level, it is because the framework passes the session in but
does not enforce what the tool does with it. When I say event emission is an attack
surface, it is because the framework emits before every call, and a logging failure
takes down the whole request.
Every testing decision in this series connects back to how this framework works.
The Data Layer
Five databases back the system:
- Vendors — profiles, approval status, risk scores
- Invoices — submissions, amounts, approval history
- Agent Memory — conversation context, learned preferences
- Config — feature flags, policy thresholds, system settings
- Emails — templates, sent messages, notification logs
The Piece Most People Get Wrong: Redis Streams
Every time an agent makes a decision, a tool call happens, or a user submits something,
an event is emitted to Redis Streams.
Most people hear "Redis" and think cache. Redis Streams is different — it is a
persistent, ordered log of messages, similar to Apache Kafka but lightweight and built
into Redis. Each event is written to a stream, consumed by a processor group, and
stored as a CTFEvent record in the database.
This is what makes the CTF layer possible. The event processor runs as a background
task, reading from two streams:
- finbot:events:agents — everything the AI agents do (tool calls, LLM requests,
decisions)
- finbot:events:business — everything that happens to the data (invoice submitted,
vendor approved, payment processed)
The processor checks each event against challenge detectors and badge evaluators. If a
detector fires — say, an agent was manipulated into bypassing an invoice threshold —
the challenge is marked complete and the player gets points. All of this happens
asynchronously, without blocking the main request. The WebSocket layer then pushes the
update to the player's browser in real time.
Here is what that pipeline looks like:
User action
→ FastAPI route
→ Agent processes request
→ Event emitted to Redis Stream
→ CTF Processor (background task)
→ Challenge detector fires
→ WebSocket push to browser
Everything in that chain is testable independently. That is by design.
Why Docker
The full stack — FastAPI app, PostgreSQL, Redis — runs in Docker Compose. This means
every developer runs the same environment, CI runs the same environment, and the test
database is isolated from the dev database.
For unit tests, Docker is skipped entirely. SQLite in-memory is faster and
self-contained — no containers to spin up, no teardown required. Every test gets a
fresh database in milliseconds.
The Full Tech Stack
- Backend framework — Python, FastAPI
- ORM — SQLAlchemy
- Production database — PostgreSQL
- Development / test database — SQLite
- Event streaming — Redis Streams
- Real-time updates — WebSockets
- AI providers — Ollama (local), OpenAI
- Containerization — Docker Compose
Why This Architecture Is Hard to Test
Five AI agents. Five LLM clients. An event-driven pipeline. Async background tasks. Multi-database backends. WebSocket connections. Every layer has its own failure modes: - An agent can call the wrong tool - An LLM client can mutate the caller's request - An event can be emitted with PII in the payload - A detector can fire on the wrong event type - A WebSocket push can carry stale data None of these failures crashes the application visibly. They manifest as incorrect behavior — an approved invoice that should be rejected, a session that leaks data across users, a Redis event that contains the full conversation history. That is what the test strategy is designed to catch. Next: How I approach testing this system — and the one rule that changed how I write every test. FinBot CTF is an open-source project under OWASP Agentic AI. The codebase is on GitHub.
Comments