Autonomous Game Engineering Network for Task Synthesis

Overview

Can multiple AI agents autonomously collaborate to build a game — not play one, but actually create it? Not just execute well-defined subtasks, but discuss requirements, decompose creative goals, divide the work across specialized roles, execute in parallel, verify each other’s output, and iterate — with a human stepping in only to approve, not to direct. That question is what AGENTS is built around.

AGENTS is a distributed multi-agent orchestration platform built on top of the KADI ecosystem. A human issues a high-level creative goal. An orchestrator agent called agent-producer discusses the requirements and design with the human, then decomposes the goal into tasks. Role-specific lead agents pick up those tasks and assign them to worker agents, who execute in isolated git worktrees. A QA agent validates each result, and once all tasks pass verification, the system surfaces a pull request for human approval. The entire communication layer runs through kadi-broker, which provides event routing and a message hub that lets agents written in different languages interoperate.

This is my Master’s thesis at SMU Guildhall. It took roughly 3–4 months to arrive at the current architecture, after multiple iterations of trial and error. The system is written in TypeScript and is still in active development — the core pipeline is functional, but whether agents can genuinely collaborate on something creative and iterative, rather than just executing well-defined subtasks, is the question I’m still working toward answering.

Architecture

Execution starts at agent-producer, the bridge between the human and the multi-agent system. The most important flow is the quest lifecycle: a human goal enters through agent-producer, gets decomposed into tasks, flows through agent-lead to agent-worker instances executing in parallel worktrees, passes QA validation, and returns as a pull request for human approval.

Module	Description
agent-producer	Orchestrator and human-facing interface — discusses requirements, decomposes goals into tasks, and stays responsive to human status queries
agent-quest	Dashboard bridge — provides a React frontend for human approval of requirements/design, quest observability, and agent observability
agent-lead	Role-specific brain (artist/designer/programmer) — picks up tasks matching its specialization, assigns them to workers, verifies results, and manages the staging branch
agent-worker	Task executor — receives assignments from agent-lead, works in an isolated git worktree, commits output, and requests QA review
shadow-agent-worker	Monitoring agent — watches each agent-playground folder for file changes, mirrors them in its own shadow playground, and commits snapshots for rollback
agent-qa	Validator — reviews diffs, runs visual validation, and produces structured pass/warn/fail scores
kadi-broker	Infrastructure — event routing and message hub enabling cross-language agent communication
Short-term Memory	Local JSON files for per-session agent context
Long-term Memory	ArcadeDB graph+document store for cross-session relationships between tasks, decisions, and context

Design Decisions

Why I Isolated Agents with Git Worktrees Instead of Standard Branching

Early in the design phase I had to figure out how to let multiple agents work on the same codebase simultaneously without causing conflicts. The obvious risk is two agents modifying the same file at the same time — which in the best case produces a merge conflict, and in the worst case silently corrupts state that’s hard to trace back.

I looked at standard Git branching and Perforce, but both had the same limitation: they operate at the version control level, not the filesystem level. An agent on a branch still shares the working directory with anything else running on that machine. What I needed was for each agent to have a physically separate folder it could read and write freely, without any coordination overhead at runtime.

Git worktrees solved this directly. Each agent gets a full checkout of the repository in its own directory, linked to its own branch off a staging branch quest/{quest-id}. Agents never touch each other’s filesystems during execution. The only point of coordination is the merge back to the staging branch, which happens after QA validation and lead verification — a clean, explicit gate rather than a fuzzy runtime lock. Worktrees are ephemeral: created when a worker starts a task, deleted after verification. This keeps disk usage minimal and ensures each worker always starts from the latest staging branch, which includes all previously verified work.

Why I Used ArcadeDB Instead of Postgres for Agent Memory

I needed a way to persist agent memory across sessions — not just what tasks were completed, but the relationships between them. Which tasks depended on which others? What context did one agent pass to the next? How did a decision made three sessions ago affect the current state of the codebase?

That’s inherently graph-shaped data, and a relational database would have made the schema awkward. But I also needed document-style storage for raw task payloads, which made a pure graph database feel like overkill in the other direction. ArcadeDB supports graph, document, and relational paradigms in a single engine, which meant I didn’t have to split my persistence layer across two systems or write an abstraction to unify them. It’s also fully open source, which matters for a research project where I can’t rely on a managed cloud service.

Why agent-producer Uses Network Tool Discovery and Recursive Tool Calling

The hardest design problem in agent-producer wasn’t any single feature — it was finding the balance between smart and rigid behavior. agent-producer is the bridge between the human, the agent-quest dashboard, and the entire multi-agent system. It has to be intelligent enough to decompose creative goals, but constrained enough to behave predictably.

The solution was a combination of five mechanisms: network-scoped tool discovery (so agent-producer only sees tools relevant to its current network context), recursive tool calling (so it can chain operations like plan → analyze → reflect → split without hardcoded sequences), short-term memory injection (so it retains conversation context within a session), system prompt injection (so its behavior adapts to the current quest phase), and tool schemas that contain next-step hints (so the LLM gets gentle guidance without losing flexibility). Together, these make agent-producer behave as expected — responsive to the human, capable of complex decomposition, but not so flexible that it hallucinates tool calls or drifts off-task.

Challenges

Why Agents Made Inconsistent Tool Calls — and What Actually Fixed It

The most instructive failure so far came from a design assumption I made early and held onto too long: that a single agent could handle a large number of tools effectively.

My initial thinking was reasonable — give each agent access to everything it might need, and let it figure out what’s relevant to the current task. In practice, the agent started making inconsistent tool calls. It would choose the wrong tool for a given context, or invoke tools in a sequence that didn’t make sense. The behavior wasn’t random, but it wasn’t reliable either.

The problem turned out to be the size of the tool surface itself. When an LLM-backed agent has too many tools registered, the decision quality degrades — there’s too much to reason over, and the model’s attention gets diluted. The fix was to narrow each agent’s tool set to only what’s relevant to its specific role and network context. KADI has a concept of tool visibility per network rather than globally, and leaning into that feature brought the behavior back in line. It also had a secondary benefit: it made the system easier for me to reason about. When something goes wrong, a smaller tool surface means a smaller space to debug.

Debugging at the Boundary Between Two Evolving Systems

AGENTS is built on top of KADI, and both are in active development simultaneously. When something fails, I often can’t immediately tell whether the bug is in my orchestration logic or in the KADI layer beneath me.

I’ve had to develop a more disciplined approach to isolation — forming a hypothesis, writing a minimal reproduction, and then deciding whether to dig deeper myself or bring it to the KADI team. Working at the boundary between two evolving systems has made me more careful about what assumptions I document and what I leave implicit. The lesson isn’t technical — it’s about the discipline required when your foundation is also moving.

Code

The core of agent-producer’s behavior comes from how it chains tool calls during task decomposition. After a quest is approved, agent-producer doesn’t just split the goal into tasks in one shot — it runs a four-step pipeline: plan the tasks, analyze dependencies and agent capabilities, reflect on whether the decomposition is sound, then split into concrete assignments. Each step is a separate tool call to mcp-server-quest, and the output of each feeds into the next via short-term memory injection. This matters because a single-shot decomposition consistently produced tasks that were either too coarse or missed cross-role dependencies.

// Quest decomposition pipeline — each step feeds the next
// via short-term memory, not hardcoded chaining
const questId = event.payload.questId;

// 1. Survey available agents and their capabilities
const agents = await callTool("quest_list_agent", { questId });

// 2. Four-step decomposition: plan → analyze → reflect → split
const plan = await callTool("quest_plan_task", {
  questId,
  agents,
});
const analysis = await callTool("quest_analyze_task", {
  questId,
  plan,
});
const reflection = await callTool("quest_reflect_task", {
  questId,
  analysis,
});
const tasks = await callTool("quest_split_task", {
  questId,
  reflection,
});

Technical Specifications

Component	Technology
Orchestration	agent-producer (TypeScript, LLM-driven)
Task Dashboard	agent-quest (React frontend + WebSocket bridge)
Broker Protocol	kadi-broker (event routing, message hub, cross-language)
Worker Runtime	TypeScript agents in isolated git worktrees
QA Validation	agent-qa (diff review, vision-based validation, structured scoring)
Long-term Memory	ArcadeDB (graph + document hybrid)
Short-term Memory	Local JSON
Client Interface	Discord, Slack, agent-quest Web Dashboard
LLM Provider	Anthropic Claude (via KADI tool calls)
Deployment	Podman containers, KADI CLI (`kadi build` / `kadi deploy`)
Target Platforms	Local dev, Akash Network, DigitalOcean