Denial-of-Wallet: The Economics of AI Compute Exhaustion

Denial-of-Service is a well-modelled attack class. Denial-of-Wallet is not. Where DoS targets availability — making a service unreachable — DoW targets economics: making a service too expensive to operate. For AI systems billing by the token, the gap between those two outcomes is measured in the amplification ratio between attacker input and compute consumed.

The Amplification Ratio

Every AI system has a cost structure with the following variables:

C_in: cost per input token
C_out: cost per output token
K: retrieval cost per document chunk (for RAG systems)
N: number of LLM calls per user request (for agentic systems)
T_in / T_out: token counts for input and output

The attacker's goal is to maximise (C_in × T_in + C_out × T_out + K × chunks) × N while minimising their own cost per request. Every mechanism that increases any of these variables without a corresponding increase in per-request cost is a DoW vector.

A single GPT-4o call at 128K context costs approximately $0.64 in input tokens. A user request that triggers 8 chained tool calls, each with 16K context, costs $0.82 — from a $0.00 API key (free tier or stolen credential) or a $5/month subscription.

Attack Vectors by Category

Context Window Stuffing

The attacker submits a request containing a large body of text — a legal document, a code file, a transcript — followed by a trivial question. The model processes the full context. The attacker's cost is network bandwidth. The operator's cost is input tokens at scale.

| Scenario | Attacker Input | Model Context | Amplification | |---|---|---|---| | Short question | 12 tokens | 12 tokens | 1× | | + 50K token document | 50,012 tokens | 50,012 tokens | 4,167× | | + RAG retrieval (top-8 chunks) | 50,012 tokens | 58,012 tokens | 4,834× |

Context stuffing is trivially mitigated by input length limits. The more interesting vectors are ones where the amplification happens after the request is received.

Recursive Tool-Chain Abuse

Agentic systems with tool access can be prompted to call tools that produce outputs that trigger further tool calls. An adversary who can influence the system prompt or the agent's task context can construct loops.

BREACH SCENARIO

During a LogicLeak assessment of an agentic research assistant, we demonstrated a task specification that caused the agent to: (1) search the web for a topic, (2) extract URLs from the results, (3) fetch each URL, (4) summarise each page, (5) cross-reference summaries for contradictions, (6) fetch the source documents of each contradiction. From a single user request, the agent made 34 external calls and consumed ~280K tokens — approximately $1.80 per attacker request, at an attacker cost of zero beyond the initial query.

Retrieval Amplification

RAG systems with large K values (top-20, top-50 chunks) and no per-document cost caps can be triggered to retrieve the maximum corpus on every request via low-information queries ("tell me everything about X") or adversarial queries designed to score high similarity against many documents.

Combined with large chunk sizes and expensive reranking passes, a single query can retrieve megabytes of content and run it through multiple model calls before the response is assembled.

Output Length Manipulation

Models can be instructed to produce long outputs. "Write a comprehensive analysis of every aspect of..." followed by any topic will produce outputs constrained only by the model's maximum output context length. At GPT-4o output pricing ($0.015 per 1K tokens), saturating the 4K output context window costs $0.06. Saturating 16K costs $0.24. Multiplied across 10,000 requests per day: $2,400 in output costs from a single attack pattern.

Cost Modelling a Real Exposure

The following model approximates daily DoW exposure for a typical mid-market RAG deployment:

Attack parameters:
- Requests per day:         10,000  (attacker)
- Input tokens per request: 50,000  (context stuffing)
- RAG chunks retrieved:     20
- Tokens per chunk:         500
- Model calls per request:  3       (query + rerank + generate)
- Output tokens:            1,500

Cost per request:
  Input:  (50,000 + 20×500) × $0.0025/1K = $0.15
  Output: 1,500 × $0.015/1K              = $0.023
  Total per request:                        $0.173

Daily exposure:
  10,000 × $0.173 = $1,730/day
  Monthly:          $51,900

This is a conservative model. It does not account for agentic recursion, retry logic, or model fallback chains.

Mitigation Controls

NOTE

No single control eliminates DoW risk. The goal is to increase attacker cost, reduce amplification ratio, and add friction at each layer.

Per-user token budgets cap the total tokens a single authenticated user can consume per hour or day. Effective against volume attacks; does not address unauthenticated endpoints.

Input length limits with hard HTTP 413 rejections before the request reaches the model. Set them at the API gateway layer, not in application code.

Retrieval K-value caps enforce a maximum number of chunks retrieved per query, regardless of how the query is phrased. Combine with a minimum similarity threshold — below 0.72, no chunk is returned.

Tool-call depth limits prevent agent recursion beyond N levels. Every tool invocation increments a counter; at the limit, the agent is instructed to halt and summarise what it has.

Cost-based circuit breakers monitor spend-per-request at the gateway layer and reject requests from sources that exceed a rolling cost threshold. Requires per-request cost instrumentation, which most deployments don't have.

Output length steering includes a system prompt instruction that sets a maximum response length. Combine with a hard truncation at the API response layer — don't rely on the model to self-enforce.

What You Should Measure

If you cannot answer the following questions for your production AI system, you have blind spots:

What is the 99th percentile token count for requests over the last 30 days?
What is the maximum number of tool calls observed in a single agent run?
What is your per-user daily cost cap, and how is it enforced?
At what token velocity per minute does your alerting fire?

DoW is a slow-burn attack. It often looks like legitimate traffic until the billing cycle closes.

LogicLeak includes Denial-of-Wallet testing in every Neural Infrastructure Hardening engagement. Request an assessment →