Week 1: Architecture of an AI Financial Platform

carocsteads
Mar 3
6 min read

Updated: Apr 3

I've been working on FinBot CTF — an AI-powered financial platform built for the OWASP Agentic AI project. The goal is to explore what happens when you give AI agents real financial responsibilities: onboarding vendors, processing invoices, flagging fraud, and authorizing payments. But before I write about how I test it, I need to explain what I'm actually testing. Because the architecture is what makes this hard.

What Does FinBot Do?

FinBot is a vendor management portal. Vendors log in, submit invoices, communicate with an AI assistant, and get paid. Behind the scenes, AI agents handle the workflow — deciding whether to approve a vendor, whether an invoice looks legitimate, and whether a payment should be flagged. The CTF part means players try to manipulate the AI agents into doing things they shouldn't. Get an invoice approved that should be rejected. Convince the onboarding agent to bypass compliance checks. The system detects these attempts and awards points. That dual purpose — real financial workflows and a security challenge layer — is what shapes every architectural decision.

The Layers

Frontend: Three web interfaces built with plain HTML, JavaScript, and CSS served through FastAPI:

- Vendor Portal — where vendors interact with the AI assistant, submit invoices, and check payment status - Admin Portal — where administrators monitor the system, review vendor applications, and manage configuration

- CTF Portal — where players view active challenges, track their score, and see real-time detection events as they attempt to manipulate the agents

They communicate with their respective backends via API calls and maintain real-time updates through WebSockets.

Backend: FastAPI

The backend is Python and FastAPI. FastAPI handles routing, authentication middleware, session management, and WebSocket connections. SQLAlchemy sits on top of the database layer — PostgreSQL in production, SQLite in development and testing. The split between PostgreSQL and SQLite matters for testing. SQLite's in-memory mode makes fast, isolated unit tests possible. Every test gets a clean database with zero setup time.

The AI Agent Layer

Five specialized agents handle all financial workflows: - Onboarding Agent — reviews new vendor applications, validates documents, and makes approval decisions - Invoice Agent — processes invoice submissions, validates amounts, and checks against policy - Payments Agent — handles payment authorization and transaction queries - Fraud Agent — monitors for anomalies, flags suspicious patterns, and runs compliance checks - Communication Agent — sends notifications, generates status updates, handles vendor inquiries Each agent is backed by an LLM. But agents don't talk to LLMs directly — they go through an LLM integration layer made up of five client classes. One routes requests to the right provider, one wraps any client with session identity and observability, and one is a deterministic fake used in tests. I'll dedicate an entire post to those clients.

The Agent Framework

The agent framework is the infrastructure that sits between the LLM and the rest of the system. It handles everything that is not the AI decision itself. When a user sends a message, the framework does not simply forward it to the LLM and return whatever comes back. It runs a loop. It sends the user message — along with the full list of available tools — to the LLM. The LLM responds with either a final answer or a tool call. If it is a tool call, the framework looks up the Python function by name, validates the arguments, executes it, and sends the result back to the LLM as the next input. The loop continues until the LLM produces a final answer with no further tool calls. One user message can result in multiple LLM calls. The agent might look up a vendor, then check the invoice status, then query the payment history — three separate tool calls, three separate LLM responses, all before the user sees anything. The framework manages that entire sequence. It also does three other things: Session scoping. Every agent runs attached to a specific authenticated session — a specific user in a specific namespace. The framework passes that session context to every tool call automatically, so tools know whose data they are operating on without the LLM having to specify it. Tool dispatch. The framework maintains a registry of callable functions keyed by tool name. When the LLM generates a tool call, the framework looks up the function, validates the arguments against the parameter schema, and executes it. The LLM never calls Python functions directly — everything goes through this dispatch layer. Event emission. Before and after each LLM call, the framework emits events to Redis Streams: what was sent, what came back, what tools were called, what the results were. Those events are what the CTF detectors read to decide whether an attack happened. The observability and the security scoring are the same data pipeline. The base class for all agents is BaseAgent. OrchestratorAgent, VendorChatAssistant, CoPilotAssistant, and the five specialized agents all inherit from it. The loop, the tool dispatch, and the event emission live in the base class. Each agent defines its own system prompt, tool list, and business logic on top of that shared infrastructure. Understanding this framework is what makes the rest of the test strategy make sense. When I say a tool has two interfaces — the schema the LLM reads and the function that executes — it is the framework that sits between them. When I say session scoping must be enforced at the tool level, it is because the framework passes the session in but does not enforce what the tool does with it. When I say event emission is an attack surface, it is because the framework emits before every call, and a logging failure takes down the whole request. Every testing decision in this series connects back to how this framework works.

The Data Layer

Five databases back the system: - Vendors — profiles, approval status, risk scores - Invoices — submissions, amounts, approval history - Agent Memory — conversation context, learned preferences - Config — feature flags, policy thresholds, system settings - Emails — templates, sent messages, notification logs

The Piece Most People Get Wrong: Redis Streams

Every time an agent makes a decision, a tool call happens, or a user submits something, an event is emitted to Redis Streams. Most people hear "Redis" and think cache. Redis Streams is different — it is a persistent, ordered log of messages, similar to Apache Kafka but lightweight and built into Redis. Each event is written to a stream, consumed by a processor group, and stored as a CTFEvent record in the database. This is what makes the CTF layer possible. The event processor runs as a background task, reading from two streams: - finbot:events:agents — everything the AI agents do (tool calls, LLM requests, decisions) - finbot:events:business — everything that happens to the data (invoice submitted, vendor approved, payment processed) The processor checks each event against challenge detectors and badge evaluators. If a detector fires — say, an agent was manipulated into bypassing an invoice threshold — the challenge is marked complete and the player gets points. All of this happens asynchronously, without blocking the main request. The WebSocket layer then pushes the update to the player's browser in real time. Here is what that pipeline looks like: User action → FastAPI route → Agent processes request → Event emitted to Redis Stream → CTF Processor (background task) → Challenge detector fires → WebSocket push to browser Everything in that chain is testable independently. That is by design.

Why Docker

The full stack — FastAPI app, PostgreSQL, Redis — runs in Docker Compose. This means every developer runs the same environment, CI runs the same environment, and the test database is isolated from the dev database. For unit tests, Docker is skipped entirely. SQLite in-memory is faster and self-contained — no containers to spin up, no teardown required. Every test gets a fresh database in milliseconds.

The Full Tech Stack

- Backend framework — Python, FastAPI - ORM — SQLAlchemy - Production database — PostgreSQL - Development / test database — SQLite - Event streaming — Redis Streams - Real-time updates — WebSockets - AI providers — Ollama (local), OpenAI - Containerization — Docker Compose

Why This Architecture Is Hard to Test

Five AI agents. Five LLM clients. An event-driven pipeline. Async background tasks. Multi-database backends. WebSocket connections. Every layer has its own failure modes: - An agent can call the wrong tool - An LLM client can mutate the caller's request - An event can be emitted with PII in the payload - A detector can fire on the wrong event type - A WebSocket push can carry stale data None of these failures crashes the application visibly. They manifest as incorrect behavior — an approved invoice that should be rejected, a session that leaks data across users, a Redis event that contains the full conversation history. That is what the test strategy is designed to catch. Next: How I approach testing this system — and the one rule that changed how I write every test. FinBot CTF is an open-source project under OWASP Agentic AI. The codebase is on GitHub.