All posts
LLM App Security

What Is a Denial-of-Wallet Attack?

One motivated user can drain your entire monthly API budget in hours. Here's how wallet-drain attacks work and the enforcement pattern that stops them.

·6 min read

A Denial-of-Service attack tries to knock your server offline. Denial-of-Wallet is quieter. Nothing crashes. Your infrastructure stays up. Your LLM endpoint keeps responding correctly to every request.

But the requests keep coming — valid, correctly formatted, exactly the way your system was designed to handle them — until your API budget is gone. By the time you notice, every user of your app is blocked. Not because the service is down. Because the money ran out.

How it works

When you build an LLM-powered feature, you typically have one API key. Every request from every user hits that same key. Your LLM provider bills you for all of it.

There is no mechanism built into the LLM API layer that says “this user has spent $X today.” That's your responsibility to implement. Most apps don't — not because developers are careless, but because the feature ships fine without it, and the problem only becomes visible once someone exploits it.

So when a single user sends thousands of requests in a day — whether intentionally or because they found a loop in your UI — the cost comes out of your shared budget, equally, alongside every other user, until the budget is exhausted.

At that point you have two bad options: let it drain to zero and block everyone, or raise your billing limit and keep paying.

Why rate limiting doesn't stop it

Rate limiting controls request frequency — how many requests per second, per minute, per hour. It's the right tool for protecting your infrastructure from being overwhelmed. It's the wrong tool for protecting your budget.

The math:a user limited to 60 requests per minute can still make 86,400 requests in a day. At $0.003 per request, that's $259 from a single user. In one day. On your free tier. All within your rate limit.

Rate limiting answers “how fast?” It doesn't answer “how much?” Closing the gap between those two questions is where budget enforcement lives.

The same logic applies to token limits. Capping tokens per request limits individual request cost, not cumulative daily spend per user. A user sending 10,000 short requests still drains the budget; the per-request cap just makes each withdrawal smaller.

The enforcement gap

Every individual request looks valid. Each passes your rate limit. Each uses a reasonable number of tokens. None of them individually looks suspicious.

The damage is cumulative, and it accumulates in a blind spot: the gap between “technically allowed” and “economically safe.”

Closing this gap requires tracking spend per user — and the check has to happen before the LLM call fires, not after. By the time the call returns, the tokens are already consumed. A post-hoc check can log the overage but can't undo it.

The only enforcement that works:read the user's current spend, check it against their limit, and either block or allow — before the request executes. Not after. Before.

The race condition most implementations miss

If you implement per-user spend tracking naively — read the current spend, check if it's under the limit, make the LLM call, write the updated spend — you have a race condition.

Two concurrent requests from the same user arrive simultaneously. Both read the same spend value. Both see “under budget.” Both go through. Both calls fire. Both costs get logged.

You just allowed twice the intended budget in a single window. At low request rates, this is theoretical. At high request rates — exactly the conditions a Denial-of-Wallet attack creates — it breaks your enforcement reliably.

The fix is atomic operations: the read, check, and write have to happen in the same transaction. No request sees a stale spend value. If two requests arrive simultaneously, the second one sees the first one's reservation and is correctly blocked — before the first request has even returned.

Two-phase enforcement: closing the gap

The pattern that solves both problems — the enforcement gap and the race condition — is a two-phase check:

Phase A — before the LLM call fires

  • Check this user's current spend against their daily limit
  • Atomically reserve an estimated cost for this request
  • If over budget: block immediately — before any tokens are consumed, before any compute runs
  • If under budget: allow the request, carry the reservation forward

Phase B — after the LLM call completes

  • Settle the actual cost (replace the estimate with the real token count)
  • Update the user's running total for the day

The reservation in Phase A is what prevents the race condition. Two concurrent requests don't both see “under budget” — the second sees the first's reservation and is blocked correctly. The actual cost settles in Phase B, but the budget was already protected from the moment the first request arrived.

What this looks like in practice

app/api/chat/route.ts
import { Thskyshield } from '@thsky-21/thskyshield'

const shield = new Thskyshield({
  siteId: process.env.SITE_ID!,
  apiKey:  process.env.SHIELD_API_KEY!,
})

export async function POST(req: Request) {
  const { prompt, userId } = await req.json()

  // Phase A — checked BEFORE the LLM call fires
  const { allowed, reason, requestId } = await shield.check({
    externalUserId:  userId,
    model:           'gpt-4o-mini',
    estimatedTokens: { input: 500, output: 200 },
  })

  if (!allowed) {
    // blocked before any tokens were consumed
    return Response.json({ error: reason }, { status: 402 })
  }

  const result = await callYourLLM(prompt)

  // Phase B — settle actual cost after the call
  await shield.log({
    requestId,
    externalUserId: userId,
    model:          'gpt-4o-mini',
    tokens:         { input: result.usage.input, output: result.usage.output },
  })

  return Response.json({ text: result.content })
}

The check() call is Phase A. It runs before your LLM call and blocks the request atomically if the user is over their daily budget. No tokens consumed, no cost incurred.

The log() call is Phase B. It settles the actual cost after the response returns, releasing the reservation and recording the real spend.

The requestId links the two phases together. You need it inlog() to match the settlement back to the reservation.

Who's most exposed

Denial-of-Wallet attacks require no special skill. Any user who understands that your app makes LLM calls and has a motive to abuse it — or just a habit of hammering the “Generate” button — is a potential source. The apps most at risk:

  • Free tiers and freemium products.Free users have the least to lose and the most incentive to extract maximum value. The gap between “free plan” and “unlimited usage” is often just a marketing boundary, not a technical one.
  • Public-facing AI demos and playgrounds. No authentication, no per-user tracking, often no spend visibility at all. A single viral share can generate thousands of requests in hours.
  • Internal tools with shared API keys. One team member who doesn't realize they're in a loop can exhaust the budget for the rest of the org.
  • Any app where LLM calls are user-triggered. If a user can make your app call an LLM, they can make it call that LLM a lot.

The common thread: shared budget, no per-user accounting, no pre-call enforcement. Any app matching that description is one motivated user away from a billing crisis.

Thskyshield for LLM Apps

Per-user daily budgets with atomic enforcement

Two-phase check/log that blocks before the call fires. Atomic Redis operations that close the race condition. Works with any LLM provider.

See how it works