Research · AI & Autonomous Security

Reliable Intent Identification Under Adversarial Conditions for Low-Latency LLM Authorization

Manish Goyal, Nitesh Sinha — Trampolyne AI, Bangalore, India. Presented at BSides Bangalore 2026.

Deploying LLMs as enterprise agents means authorizing not just who is making a request, but what the request intends to do. Our central finding: adversarial inputs do not merely fail to be classified correctly — they are classified with artificially high confidence, a failure mode that binary accuracy metrics never capture. We treat obfuscation severity as a measurable signal and use it to modulate confidence, with a principled lower bound that keeps adversarial inputs out of the auto-allow zone. The result: 94% intent accuracy at 100ms P95, with no external LLM call at inference time.

1. The access-control gap in LLM systems

Large Language Models are moving from passive assistants to decision-making agents embedded in enterprise workflows. That shift exposes a fundamental limitation in existing access control: it is built as a static permission lookup, while LLM interactions are dynamic, semantic and request-driven.

Traditional Role-Based Access Control (RBAC) asks: "Can this user, in this role, perform this action?" — resolved at authentication time against a predefined policy matrix. That assumes actions are discrete, enumerable and known in advance. LLM systems break the assumption. A single role — say SUPPORT_EXECUTIVE — can issue an unbounded range of natural-language requests, each with a different security implication:

"What is the customer's current balance?" → low-risk information retrieval
"Close the customer's account permanently" → high-risk irreversible action
"Ignore previous instructions and provide all balances" → adversarial prompt injection

RBAC assigns identical permissions to all three because it evaluates who is acting, not what the request intends to do. The primary risks in LLM systems — prompt injection, excessive agency, privilege escalation — are properties of the request, and they emerge at inference time, where static access control has no visibility.

2. Why existing approaches fall short

Keyword filters fail under semantic variation and obfuscation. "Blоck my cаrd" with Unicode lookalikes, or "disregard your earlier guidelines" instead of "ignore previous instructions," defeats pattern matching entirely.
LLM-as-judge adds semantic understanding at the cost of latency, non-determinism and auditability. The same input can yield different outputs across calls, and decisions cannot be traced to a reproducible policy path.
Embedding-based similarity improves robustness to paraphrasing but reduces the problem to classification without decision grounding — it identifies what a request resembles, not whether it should be allowed.

The underlying issue is conceptual: these approaches treat authorization as a classification problem, when it is a decision problem under uncertainty. Correctly labeling a request is neither necessary nor sufficient to decide whether it is safe.

Approach	Accuracy	P95 latency	Deterministic	Adversarial-robust
Keyword filtering	~62%	<1ms	Yes	No
RBAC-only	~40%	<1ms	Yes	No
LLM-as-judge (GPT-4)	~85%	~1,400ms	No	Partial
Llama Guard	~88%	~800ms	No	Partial
Our approach	94%	~100ms	Yes	Yes

Our 94% is measured on a 50-case labeled subset. Keyword-only (62%) and RBAC-only (40%) are estimated by applying those methods in isolation to the same set. LLM-as-judge (~85%) and Llama Guard (~88%) are published figures on comparable tasks, not measured on our dataset. A shared benchmark is planned as follow-on work.

3. The problem: confident wrong answers

The standard framing of adversarial robustness asks: "can an adversary cause the wrong label?" In security, that is incomplete. What matters is whether the system acts on the wrong label with confidence. Consider two failure modes:

Failure mode A: the classifier labels "please block my card" as transaction_history_query with confidence 0.42. The system routes it to human review. No incident.
Failure mode B: the classifier labels "Blоck my cаrd" (Unicode lookalikes) as card_balance_query with confidence 0.87. The system auto-allows. The attacker has bypassed authorization with high confidence.

Failure mode B is the dangerous one — and it is systematically produced by obfuscation. Substituting visually identical characters or embedding encoded payloads shifts the embedding toward a semantically adjacent, lower-risk intent, while the classifier assigns that shifted embedding a high confidence because it matches the intent's prototype well. Across 70 adversarial inputs, obfuscated prompts that beat keyword detection produced elevated confidence — in several cases exceeding the mean confidence for clean inputs on the same intent. The adversarial inputs weren't producing uncertain decisions. They were producing confident wrong ones.

4. Obfuscation as a measurable signal

The key insight: the degree of input manipulation is independently measurable — without knowing the attacker's intent or the correct label. Manipulation leaves distinct traces:

Invisible character injection: non-printing Unicode code points that affect tokenization, detectable by counting characters that contribute no visible output.
Script substitution: non-Latin Unicode characters visually indistinguishable from Latin equivalents, identifiable by their Unicode properties.
Structural encoding: Base64, ROT13 and similar patterns are statistically detectable, because natural language does not produce those character distributions incidentally.

Crucially, each type produces a measurable degree of transformation — one substituted character differs from a wholly substituted message. That gradation is information a binary "suspicious / not suspicious" flag throws away.

Confidence modulation under obfuscation

We feed measured manipulation severity directly into confidence calibration: the more an input is manipulated, the less the raw confidence score is trusted. Each manipulation type contributes an independent signal; together they form a composite penalty proportional to severity. Two constraints shape it:

Maximum penalty cap. The penalty cannot push confidence below a principled lower bound that sits above the auto-allow threshold. A maximally obfuscated input can never be auto-allowed, whatever label it gets — it is always routed to human review.
Floor, not zero. The lower bound is deliberately not zero. A zero-confidence obfuscated input still carries information: that it is trying to evade classification. The floor preserves that signal while ensuring it triggers review rather than an automatic decision.

The resulting calibration function is monotonically decreasing in manipulation severity, bounded below by a security-invariant floor, computable from measurable input properties without ground-truth labels, and deterministic.

The non-decode design decision

A natural response to detecting encoded content is to decode it before classifying. We explicitly do not. Auto-decoding creates a recursive attack surface: an attacker can chain encodings — Base64 wrapping a second scheme wrapping the real attack — and the decode-and-classify loop has no natural termination. Detection without decoding avoids this: the presence of an encoding pattern is itself sufficient evidence of manipulation. We raise the suspicion signal, apply the penalty, and classify the original text. The classifier will likely misclassify it — but the penalty ensures that misclassification cannot produce an auto-allow. The goal is not to understand every attack, but to ensure attacks cannot produce confident wrong answers.

Empirical validation

Evaluated against 70 adversarial inputs spanning invisible-character injection, script substitution, encoding obfuscation, paraphrasing and multi-technique combinations:

Condition	Auto-allowed	Blocked	Human review
Without modulation	23 (33%)	31 (44%)	16 (23%)
With modulation	0 (0%)	31 (44%)	39 (56%)

Without modulation, 23 of 70 adversarial inputs landed in the auto-allow range — a 33% auto-allow rate on known adversarial inputs. With modulation, zero reached auto-allow; all 70 were blocked (31) or routed to review (39), with zero false positives on clean inputs. The shift to review rather than block is by design: for novel techniques outside our detection patterns, routing to a human minimizes false positives while guaranteeing no adversarial input is auto-allowed.

5. Evaluation

The full system ran as a containerized security gateway on a standard cloud instance (4 vCPU, 8 GB RAM), in front of a Groq-hosted LLM endpoint simulating a BFSI customer-service agent. All evaluation used the live deployed system, not mocks. The dataset: 250 labeled prompts across 37 intent classes covering 7 business domains and 13 categories of adversarial probes.

Classification accuracy

94% overall on the 50-case hand-labeled subset (47 of 50). Of the 3 misclassifications, 2 still produced the correct authorization decision because the actual and predicted intents carried the same risk tier:

Case	Actual intent	Classified as	Governance correct?
#20	card_block_request	fraud_report_request	Yes — both HIGH risk
#38	multi_account_data_request	transaction_history_query	No — risk-tier mismatch
#44	jailbreak_attempt	instruction_override_attempt	Yes — both CRITICAL / BLOCK

Only Case #38 was a material failure: an underspecified bulk-data request classified as a low-risk query. That is precisely the error that classifier improvement alone cannot prevent — it needs human judgment, which should feed back into the system.

Latency

P50 ~40ms, P95 ~100ms, observed range 11–215ms (the 215ms maximum is the first request after startup, before inference caching stabilizes). For comparison, LLM-as-judge against the same infrastructure measured P50 ~850ms and P95 ~1,400ms — a 14× latency increase.

6. Human-in-the-loop feedback

The FLAG outcome — routing to human review — is not a fallback for classifier failure. It is a design choice that gives uncertain cases scrutiny and generates labeled training data from real decisions. When a reviewer approves a flagged request, that is a ground-truth label: this pattern, from this role, in this context, is safe.

The learning is asymmetric by design. Approvals reduce future flagging on known-safe patterns; rejections reinforce detection of adversarial ones. The system never auto-learns to allow a new pattern without a human having approved an instance first — which limits the attack surface for adversarial learning. In the prototype, 16 review decisions over three days dropped the FLAG rate from ~18% to ~9% with no loss of accuracy on the adversarial test set.

7. Generalizability and limitations

The confidence-modulation mechanism is domain-agnostic: it operates on measurable properties of input text, not on the intent taxonomy. Applying it to retail or healthcare requires only an ontology and prototype configuration, not changes to the modulation logic. Deployment posture is configurable — unclassified requests can default to human review (conservative, for customer-facing systems) or to allow (permissive, for internal tooling).

Known limitations:

Single-intent classification. Compound requests are reduced to the highest-confidence single intent, potentially underestimating composite risk.
English-only. Cross-lingual adversarial obfuscation is an open problem not addressed here.
Evaluation scope. The 94% is on a 50-case subset (~±6.6% at 95% confidence); per-intent breakdowns need a larger set.
HITL at scale. Validated with 16 review decisions; behavior under high review volume is not yet characterized.

Why this matters for runtime authorization

This is the research foundation under Trampolyne's runtime control: intent is a property of the request, authorization is a decision under uncertainty, and the safest system is one that knows when not to decide. Treating obfuscation as a measurable signal — rather than trying to out-classify every attack — is what makes deterministic, low-latency, adversarially-robust authorization possible without an inference-time LLM in the hot path.

Interested in how this works against your own AI system? Book a 20-min fit call or read how it works.