A Reading · Research Companion

What is Neil?

An autonomous engineer is the surface of a system, not a prompt in a box. A reading of the applied paper — five layers of optimization, one learning engine, and the architectural reasons that "AI engineer" is not a thing you can buy.

Sourcewhat-is-neil.pdf →

AuthorIvan Novak

Paper DateApril 2026

Reading Length≈ 14 minutes

The Short Answer

Neil is the name a deployment team gave to their AI engineer. Underneath the name is an autonomous agent deployed as a peer member of a software development team. It picks up tickets, investigates issues, writes code, ships pull requests with tests and evidence, posts standups, and responds to code review. Human approvals are required before any of its work merges.

That is the surface. Underneath it is a vertically integrated engineering system — designed to do the work and to get better at doing the work, automatically, with measurable outcomes.

Neil is not an AI coding assistant. Neil is an AI engineer, and the distinction is real because the visible behavior is the surface expression of simultaneous optimization across five distinct layers of the stack.

CONTENTS

§ 01 Foundation
§ 02 Optimization Hierarchy
§ 03 Learning Engine
§ 04 Tier Escalation
§ 05 Recon
§ 06 Context Exhaustion
§ 07 Anatomy of a Ticket
§ 08 Vertical Integration
§ 09 Synthesis

§ 01 · FOUNDATION

Principles, taken as given.

The paper is the applied companion to earlier theoretical work on the ontological foundations of agentic systems. That prior work argues, at length, for a particular set of design commitments. This paper takes those commitments as given and shows what it looks like to actually build with them.

The principles, briefly. Identity precedes function — define what agents are before what they do; behavior follows from ontological clarity rather than from behavioral instruction. Bounded epistemology — agents should know the limits of their knowledge and operate strictly within their epistemic domain; epistemic overreach is a dominant failure mode. Perspectival knowing — knowledge is role-dependent; what an agent knows is a function of its boundaries, not a global truth claim. Separation of what things are from what is known about them — the entity, its description, and reasoning over it are distinct concerns and should live in distinct representations.

The theoretical case for these principles has been made elsewhere. The novelty here is that they are operational — that there is a running system, with measurable outcomes, in which the principles are not aspirations but architectural facts.

§ 02 · THE OPTIMIZATION HIERARCHY

Five layers. Each one a ceiling.

Every AI-driven use case has five layers that determine its quality, and the layers are not equal. Each one sets a ceiling on what the layers below it can achieve. Optimizing at any single layer reaches a local maximum; the absolute maximum requires simultaneous optimization across all five.

The metaphor is worth making explicit. The model is not a foundation that supports weight from above, like a building's base. It is a ceiling. The other four layers determine how close you get to it, but they cannot exceed it. A weak model with perfect prompts, context, and tooling produces polished mediocrity — every downstream optimization working within a low upper bound. A strong model with no optimization underneath stands on the floor. The whole point of the system is to raise the ceiling and then reach it.

The Five-Layer Optimization Hierarchy

Each layer constrains the layers below it. Compounding gains require simultaneous optimization, not depth at any single layer.

01Layer

Model

The Research Lab.

↑ ceiling for everything below — which model runs the workload

Owned hardware on Blackwell architecture. Multiple DGX Spark units, a dedicated fabric for multi-node tensor parallelism, frontier-class open models served and fine-tuned without a cloud provider. The control surface is wide: model selection by domain benchmark, fine-tuning on real codebases and ticket patterns, inference configuration tunable per workload, unconstrained experimentation after hardware amortization.
02Layer

System Prompt

Per-stage identity.

↑ orders of magnitude within the model's ceiling

A custom agent harness provides per-stage, per-archetype control. A bug investigation agent receives a fundamentally different orientation than a feature implementation agent. Identity precedes function; behavioral expectations flow from it. System prompts are continuously optimized — not a one-time authoring effort but a tunable parameter under empirical pressure.
03Layer

User Prompt

Structured task framing.

↑ significant, but bounded by system prompt quality

If the system prompt defines who the agent is, the user prompt defines what it does in this specific execution within that identity. In a pipeline architecture, user prompts are mandates — precisely engineered task specifications, not freeform chat. Each pipeline stage's mandate is itself a tunable parameter, scored against the same evaluation corpus that scores system prompts.
04Layer

Context

Lux and Corpus.

↑ determines whether the right answer is reachable

Two systems. Lux understands what the code is — semantic retrieval, structural code intelligence, relation-aware retrieval, domain and expert discovery, with LSP as one of several structural lenses. Corpus understands what is known about it — a structured, git-backed repository of architectural decisions, methodology, runbooks, accumulated research findings. Together they answer the questions a senior engineer would ask before writing any code.
05Layer

Tooling

Purpose-built capabilities.

↑ determines whether the right answer is executable

The harness controls tool exposure per stage and per task type. Counterintuitively but empirically: models given fewer, more relevant tools often outperform the same models given full access to a broad tool set. Constrained tools force more disciplined reasoning. Tool composition is itself an optimization target — discovering load-bearing tools versus ancillary ones, finding combinations that produce emergent quality neither tool alone provides.

A weak model with perfect everything else produces polished mediocrity. A strong model with nothing under it stands on the floor. The compounding gains live in the simultaneity.

> Opus

Lab-tuned, on the primary evaluation benchmark. Fine-tuned models running on this infrastructure have now exceeded Anthropic's Opus — the highest-capability model in their commercial lineup — on the primary evaluation benchmark. Not a general claim about open models equaling frontier APIs; a specific claim about what happens when the full stack is optimized for a known workload.

§ 03 · THE LEARNING ENGINE

An automated research platform.

The infrastructure described above is static without a mechanism to improve it. Researcher — the internal name for the learning engine — is that mechanism. It is an automated research platform that runs experiments across the full optimization hierarchy, treating every prompt, every configuration, every context strategy, and every tool exposure as a tunable parameter to be empirically calibrated.

The loop is simple in shape and demanding in practice. Identify a tunable parameter. Generate variants. Run experiments against a corpus of test cases with known-good baselines. Score the results quantitatively, against defined metrics, not subjective assessments. The winning variant becomes the new baseline. Losing variants — and the reasons they failed — become institutional knowledge. Repeat.

What this loop runs over splits into two surfaces. Pipeline optimization — every prompt-driven decision point in the development lifecycle, from triage and routing through estimation and review. Execution optimization — the actual code that ships, with the considerable advantage that approximately 80% of execution quality metrics are binary and automated. Tests pass or fail. Types check or don't. The build succeeds or it doesn't. Expensive frontier models are not required to judge whether code is good. The code itself tells you.

~80%

Of execution metrics are binary and automated. Tests, types, lint, build, regressions, iteration count, token consumption, post-review revisions. This makes execution-layer experimentation cheap and conclusive — in a way that pipeline-stage optimization, which often requires expensive model judges to evaluate qualitative output, is not.

§ 04 · THE TIER ESCALATION MODEL

Each tier builds on the one below.

Researcher operates across seven optimization tiers, each providing a different kind of leverage over quality — and each building on the signal produced by the tiers before it. Lower tiers are cheap and fast. Higher tiers are expensive and slow, but they consume the corpus of findings that lower tiers accumulate, which is what makes them possible at all.

0Prompt

Prompt variant optimization.

Generate variants of the subject prompt, test against the evaluation corpus, select the best performer. Automated variant generation discovers things intuition misses — phrasing patterns that consistently outperform, framing choices that reduce hallucination, constraint injection that improves structural consistency.

Finding: Framing a bug investigation as "trace the failure path" rather than "investigate this bug" reduces surface-level analysis that misses the root cause by 23% on the evaluation corpus.
1Hyper

Hyperparameter optimization.

Temperature, top-p, top-k, repetition penalties, sampling strategy. These interact with subject matter in non-obvious ways.

Finding: Temperature 0.4 outperforms 0.1 on feasibility investigation — the task benefits from exploratory reasoning. But 0.1 outperforms 0.4 on structured output stages where consistency matters more than creativity.
2Routing

Model routing.

Which model handles which stage. Not all stages require the same capability, and capability scales with cost. The boundary between models is discovered empirically by testing each stage across the candidate model set, and it shifts as models improve.

Finding: Routine decomposition of well-specified feature tickets runs at quality parity on a smaller, cheaper model. Bug investigation of complex concurrency failures consistently requires the higher-capability one.
3RAG

Retrieval optimization.

The wrong context is worse than no context — it consumes reasoning budget on irrelevant information and can actively mislead. Tier 3 targets retrieval strategy: what to inject, how much, in what form, from which sources, at what dependency depth, with what temporal weighting.

Finding: Injecting only interface definitions and docstrings of direct dependencies — rather than full file contents — reduces token consumption by 60% while maintaining 95% of investigation quality on the majority of tickets.
4Tools

Tool configuration.

Which tools are available to the agent during a given stage. More tools are not always better; broader tool sets can produce unfocused behavior. Researcher experiments with composition per stage and per archetype, discovering load-bearing tools versus ancillary ones.

Finding: During the Architecture stage, combining graph-based dependency traversal with semantic similarity search against prior architectural decisions produces qualitatively better analysis than either tool alone.
5LoRA

Fine-tuning.

Researcher moves from optimizing how a model is used to optimizing the model itself. Training signal comes from the evaluation corpus the lower tiers have accumulated: high-scoring evaluations become positive training examples, low-scoring evaluations with known failure modes become negative ones. This is how lab-tuned models have exceeded frontier API performance — not because the base model is more capable in general, but because it has been specifically trained on the patterns that matter for the evaluation criteria.
6Data

Dataset curation.

Fine-tuning is only as good as the training data. Tier 6 is the most human-intensive layer — deliberate curation of the dataset that shapes what Tier 5 learns from. Positive example selection, negative example selection, anti-pattern injection, distribution management across archetypes, quality floor enforcement. This is the layer where human judgment about what constitutes excellent work gets encoded into the training signal that shapes the model's future behavior.

The flywheel is the mechanism. Every ticket that flows through the system enriches every layer. The evaluation corpus grows automatically. Researcher gets better at optimizing each stage as more work flows through. The fine-tuning dataset gets richer. The model improves. The improved model produces better evaluations. Better evaluations produce better training signal. The cycle is self-reinforcing — and the asymmetry between cost and value compounds with every iteration.

The system is designed to improve itself — with quantitative evidence at every step.

What Is Neil? · §3

§ 05 · RECON

An investigation engine.

Researcher is the learning engine. Recon is its most heavily optimized consumer — the investigation system that evaluates development work before implementation begins. When a ticket enters the pipeline, the first question is not how do we implement this. It is do we understand this well enough to implement it correctly. Recon answers that question. Its output is a set of targeted pushback questions — the questions a senior engineer would ask during planning, generated systematically and grounded in actual codebase analysis.

Recon is where bounded epistemology becomes architectural. Each investigation stage is a distinct agent with a single mandate, typed input and output, and a bounded epistemic domain. No stage sees the full investigation. No stage is permitted to reason outside its scope. The pipeline is the mechanism that enforces the principle structurally, rather than relying on prompt-level instructions a model might ignore. The exact stage count, branch structure, and archetype taxonomy are the current baseline — downstream of Researcher's optimization, not architectural constants.

Recon Pipeline (current baseline)

Stage 1

Reconnaissance

Codebase intelligence gathering
Stage 2

Archetype Classification

Classify the kind of work
Stage 3

Problem Scoping

Constrain the problem
Stage 4

Investigation

Branches: bugs · features · spikes

Output: targeted pushback questions, grounded in codebase signals and cross-source evidence.

The faculty system.

Pipeline stages are not staffed by a single generic agent. Each stage has a faculty of specialist agents — per-stage, per-archetype roles that accumulate domain-specific expertise over many tickets. A bug investigation specialist that has analyzed two hundred tickets in a particular service develops empirical knowledge about that service's failure patterns, its flaky dependencies, its historical pain points. That knowledge is captured, versioned, and becomes a first-class retrieval source for future investigations of the same service.

Faculty specialists are what make the bounded-epistemology principle productive rather than merely restrictive. A specialist knows its domain deeply — and knows the limits of its domain. Outside its scope, it defers. Inside its scope, it has access to accumulated expertise that a generic agent would have to rediscover every time. Which specialists exist, which stages they staff, and how expertise is partitioned across them are themselves Researcher optimization targets.

The data fabric.

Real investigations cross-reference code with tickets, communication threads, error monitoring, deployment logs, and database state. A bug whose root cause is visible only by correlating a Sentry error cluster with a Slack thread and a recent deployment is invisible to an agent that can only see source code. The data fabric is the architectural surface that solves this — a workspace-scoped configuration of external data sources that investigation agents can query at will during analysis. Each connector represents a persistent data ecosystem around a codebase: Linear, Slack, Sentry, a read-only database replica, a CloudWatch log group, a documentation store. Credentials are resolved per-workspace and injected as ephemeral, read-only access for the duration of the investigation.

Lux is the first and most heavily exercised fabric connector — code intelligence as a data source, the same abstraction as Linear or Slack. An agent investigating a ticket queries Lux for structural understanding, then queries Linear for related ticket history, then queries Slack for discussion context, then queries Sentry for error patterns — all during a single investigation, all with read-only access scoped to the current evaluation. The fabric itself is tunable. Adding a source gives investigation agents access to information they couldn't reason about before. Removing a noisy source improves signal quality. The fabric directly controls signal-to-noise ratio, and that ratio is itself a Tier 3 RAG optimization target.

§ 06 · CONTEXT EXHAUSTION

The agent drowns in its own history.

One of the most persistent failure modes in production agentic AI is not that the model can't write code. It is that the model drowns in its own accumulated context before it can finish the job. Every interaction has a context window — a fixed amount of information the model can attend to at once. In a long-running task, context fills with previous tool calls, file contents, error messages, conversation history. Eventually the model spends its reasoning budget on navigating its own history rather than solving the problem. Quality degrades. This is context exhaustion, and it reliably appears in any agentic system that accumulates context without constraint.

The architecture attacks this problem at four levels simultaneously. None of the four is sufficient on its own; together they make the problem tractable.

Level 1

Persona segmentation.

Different concerns get different agents with isolated contexts. A coordinating agent handles task management. A deep investigation agent analyzes a specific bug. An implementation agent writes code against a specific task. No single agent accumulates the full context of an entire workflow. They hand off typed outputs, not raw conversation history.
Level 2

Ontological segmentation.

The system separates what things are into distinct, structured representations: code structure via Lux, domain entities via Corpus, work state via project management integration. Each representation is queryable independently. An agent requests the specific knowledge it needs for the current reasoning step. The rest does not enter context.
Level 3

Epistemological segmentation.

The system further separates what is known about things from the things themselves. Specialist faculty carry accumulated expertise about specific domains. Research findings capture what experiments have revealed. Historical patterns record what has succeeded and failed in analogous situations. This knowledge exists permanently but enters context selectively — a specialist's findings about estimation patterns get injected during estimation, not during code review.
Level 4

DAG orchestration.

Recon's pipeline is the structural embodiment of context management. Each stage receives typed input, has a single mandate, produces typed output, and operates in a fresh bounded context. WorkStream provides DAG orchestration at the implementation level: each task in its own context, dependencies as explicit edges, independent tasks running in parallel, cross-task context minimal and typed. Not discipline — a constraint the architecture cannot bypass.

§ 07 · ANATOMY OF A TICKET

What actually happens.

The team sees a developer who picks up tickets, writes solid code, and ships pull requests. The infrastructure sees an optimization target that improves with every iteration. Both are true. The lifecycle below is what is happening underneath the surface activity.

01

Recon evaluates the ticket.

The deterministic pipeline runs: reconnaissance, archetype classification, problem scoping, archetype-specific investigation. The output is targeted pushback questions, grounded in Lux's structural analysis and the data fabric's cross-source signals. If the questions surface a misunderstanding, the ticket goes back before any code is written.
02

WorkStream orchestrates implementation.

The work is decomposed into bounded tasks via DAG, with explicit dependency edges and identification of parallel execution opportunities. Each task will run in its own bounded context. The DAG is the constraint that prevents context from accumulating across tasks.
03

Harness-driven agents execute.

Each task runs with custom system prompts, curated context from Lux and Corpus, and constrained tool sets — every configuration calibrated by Researcher's empirical findings accumulated from prior evaluations. The agent is not improvising. It is executing within a configuration that has been measured and tuned.
04

The code ships.

With tests, type checks, linter compliance, and before/after visual evidence. Human approvals are required before merge. Neil is a peer member of a team, not an autonomous deployer. Authority over what reaches production stays with humans.
05

The completed work feeds back.

Into Researcher's evaluation corpus — enriching the training data for the next round of optimization across all seven tiers. Every ticket that flows through the system makes the next ticket easier. The flywheel is not metaphorical. It is the literal mechanism by which the system improves.

§ 08 · VERTICAL INTEGRATION

Built ourselves. Top to bottom.

smpl is not a wrapper around a public API or an integration of off-the-shelf parts. It is a vertically integrated stack we built ourselves — owned compute, fine-tuned models, codebase intelligence, an investigation pipeline, institutional memory, and a learning engine that calibrates all of it. When you engage with smpl, that is the system on the other side of the conversation.

Owned compute. The model layer is ours, not a vendor's. Multiple DGX Spark units on Blackwell architecture, frontier-class open models served and fine-tuned in our lab. Models can be specialized to specific codebases, ticket patterns, and investigation archetypes — none of which is reachable through a public API.

Codebase intelligence. Lux fuses semantic retrieval, structural code intelligence, relation-aware retrieval, and domain discovery into a single continuously maintained structural view of the codebase. Built and tuned for software work — not adapted from a generic vector store.

Investigation pipeline. Recon runs typed, bounded-epistemology stages with archetype-specialized faculty agents. Built because rigorous investigation of development work is not, in practice, solvable by improvising prompt chains. The pipeline is the constraint that produces the rigor.

Automated research platform. Researcher runs experiments across the seven optimization tiers, scored quantitatively against an evaluation corpus that grows with every real ticket the system handles. Every prompt, every routing decision, every retrieval strategy is calibrated empirically — not by intuition or vendor defaults.

Institutional memory. Corpus captures investigations, decisions, and accumulated faculty expertise as a queryable substrate. The 50th investigation in a service is grounded in everything the previous 49 surfaced — operational compounding, not metaphor.

Orchestration. WorkStream decomposes work into a DAG with explicit dependency edges, runs each task in a bounded context, and structurally prevents context exhaustion. The constraint lives in the architecture, not in agent discipline.

Each layer is independently useful. The compound — fine-tuned models on real ticket patterns, prompts tuned by Researcher against a real evaluation corpus, context curated by Lux and Corpus, tools composed for specific archetypes — only emerges when the full stack runs together. That is what homegrown means at the architecture level: not a brand claim, but the artifact every smpl engagement runs on.

§ 09 · SYNTHESIS

The surface of a system, not a prompt in a box.

Neil is not a chatbot with a GitHub account. The team interacting with Neil sees an engineer who picks up tickets, investigates carefully, writes solid code, and ships pull requests with tests and evidence. That experience is real. The mechanism that produces it is not a clever prompt. It is a vertically integrated stack — a research lab, an investigation pipeline, a codebase intelligence layer, an institutional memory, an orchestration substrate, and a research engine that optimizes all of them simultaneously.

The competitive surface is not visible to anyone using the system. It is the layer below the surface — the model fine-tuned on the patterns that matter, the prompts measured and tuned by Researcher, the context curated by Lux and Corpus, the tools composed to fit the work rather than to demonstrate breadth. The output looks like an engineer; the engineering is what produces it.

Neil is the visible surface of an engineering system that optimizes itself. The system is what makes the surface possible — and the system, not the surface, is the actual artifact.

See the Codebase Intelligence Review → Read the companion paper Back to Research

The paper remains canonical. Source: what-is-neil.pdf →

What is Neil?

Principles, taken as given.

Five layers. Each one a ceiling.

The Research Lab.

Per-stage identity.

Structured task framing.

Lux and Corpus.

Purpose-built capabilities.

An automated research platform.

Each tier builds on the one below.

Prompt variant optimization.

Hyperparameter optimization.

Model routing.

Retrieval optimization.

Tool configuration.

Fine-tuning.

Dataset curation.

An investigation engine.

The faculty system.

The data fabric.

The agent drowns in its own history.

Persona segmentation.

Ontological segmentation.

Epistemological segmentation.

DAG orchestration.

What actually happens.

Recon evaluates the ticket.

WorkStream orchestrates implementation.

Harness-driven agents execute.

The code ships.

The completed work feeds back.

Built ourselves. Top to bottom.

The surface of a system, not a prompt in a box.