Research · AI & Autonomous Security

Reliable Intent Identification Under Adversarial Conditions for Low-Latency LLM Authorization

Manish Goyal, Nitesh Sinha — Trampolyne AI, Bangalore, India. Presented at BSides Bangalore 2026.

Deploying LLMs as enterprise agents means authorizing not just who is making a request, but what the request intends to do. Our central finding: adversarial inputs do not merely fail to be classified correctly — they are classified with artificially high confidence, a failure mode that binary accuracy metrics never capture. We treat obfuscation severity as a measurable signal and use it to modulate confidence, with a principled lower bound that keeps adversarial inputs out of the auto-allow zone. The result: 94% intent accuracy at 100ms P95, with no external LLM call at inference time.

1. The access-control gap in LLM systems

Large Language Models are moving from passive assistants to decision-making agents embedded in enterprise workflows. That shift exposes a fundamental limitation in existing access control: it is built as a static permission lookup, while LLM interactions are dynamic, semantic and request-driven.

Traditional Role-Based Access Control (RBAC) asks: "Can this user, in this role, perform this action?" — resolved at authentication time against a predefined policy matrix. That assumes actions are discrete, enumerable and known in advance. LLM systems break the assumption. A single role — say SUPPORT_EXECUTIVE — can issue an unbounded range of natural-language requests, each with a different security implication:

RBAC assigns identical permissions to all three because it evaluates who is acting, not what the request intends to do. The primary risks in LLM systems — prompt injection, excessive agency, privilege escalation — are properties of the request, and they emerge at inference time, where static access control has no visibility.

2. Why existing approaches fall short

The underlying issue is conceptual: these approaches treat authorization as a classification problem, when it is a decision problem under uncertainty. Correctly labeling a request is neither necessary nor sufficient to decide whether it is safe.

Approach Accuracy P95 latency Deterministic Adversarial-robust
Keyword filtering~62%<1msYesNo
RBAC-only~40%<1msYesNo
LLM-as-judge (GPT-4)~85%~1,400msNoPartial
Llama Guard~88%~800msNoPartial
Our approach94%~100msYesYes

Our 94% is measured on a 50-case labeled subset. Keyword-only (62%) and RBAC-only (40%) are estimated by applying those methods in isolation to the same set. LLM-as-judge (~85%) and Llama Guard (~88%) are published figures on comparable tasks, not measured on our dataset. A shared benchmark is planned as follow-on work.

3. The problem: confident wrong answers

The standard framing of adversarial robustness asks: "can an adversary cause the wrong label?" In security, that is incomplete. What matters is whether the system acts on the wrong label with confidence. Consider two failure modes:

Failure mode B is the dangerous one — and it is systematically produced by obfuscation. Substituting visually identical characters or embedding encoded payloads shifts the embedding toward a semantically adjacent, lower-risk intent, while the classifier assigns that shifted embedding a high confidence because it matches the intent's prototype well. Across 70 adversarial inputs, obfuscated prompts that beat keyword detection produced elevated confidence — in several cases exceeding the mean confidence for clean inputs on the same intent. The adversarial inputs weren't producing uncertain decisions. They were producing confident wrong ones.

4. Obfuscation as a measurable signal

The key insight: the degree of input manipulation is independently measurable — without knowing the attacker's intent or the correct label. Manipulation leaves distinct traces:

Crucially, each type produces a measurable degree of transformation — one substituted character differs from a wholly substituted message. That gradation is information a binary "suspicious / not suspicious" flag throws away.

Confidence modulation under obfuscation

We feed measured manipulation severity directly into confidence calibration: the more an input is manipulated, the less the raw confidence score is trusted. Each manipulation type contributes an independent signal; together they form a composite penalty proportional to severity. Two constraints shape it:

The resulting calibration function is monotonically decreasing in manipulation severity, bounded below by a security-invariant floor, computable from measurable input properties without ground-truth labels, and deterministic.

The non-decode design decision

A natural response to detecting encoded content is to decode it before classifying. We explicitly do not. Auto-decoding creates a recursive attack surface: an attacker can chain encodings — Base64 wrapping a second scheme wrapping the real attack — and the decode-and-classify loop has no natural termination. Detection without decoding avoids this: the presence of an encoding pattern is itself sufficient evidence of manipulation. We raise the suspicion signal, apply the penalty, and classify the original text. The classifier will likely misclassify it — but the penalty ensures that misclassification cannot produce an auto-allow. The goal is not to understand every attack, but to ensure attacks cannot produce confident wrong answers.

Empirical validation

Evaluated against 70 adversarial inputs spanning invisible-character injection, script substitution, encoding obfuscation, paraphrasing and multi-technique combinations:

Condition Auto-allowed Blocked Human review
Without modulation23 (33%)31 (44%)16 (23%)
With modulation0 (0%)31 (44%)39 (56%)

Without modulation, 23 of 70 adversarial inputs landed in the auto-allow range — a 33% auto-allow rate on known adversarial inputs. With modulation, zero reached auto-allow; all 70 were blocked (31) or routed to review (39), with zero false positives on clean inputs. The shift to review rather than block is by design: for novel techniques outside our detection patterns, routing to a human minimizes false positives while guaranteeing no adversarial input is auto-allowed.

5. Evaluation

The full system ran as a containerized security gateway on a standard cloud instance (4 vCPU, 8 GB RAM), in front of a Groq-hosted LLM endpoint simulating a BFSI customer-service agent. All evaluation used the live deployed system, not mocks. The dataset: 250 labeled prompts across 37 intent classes covering 7 business domains and 13 categories of adversarial probes.

Classification accuracy

94% overall on the 50-case hand-labeled subset (47 of 50). Of the 3 misclassifications, 2 still produced the correct authorization decision because the actual and predicted intents carried the same risk tier:

Case Actual intent Classified as Governance correct?
#20card_block_requestfraud_report_requestYes — both HIGH risk
#38multi_account_data_requesttransaction_history_queryNo — risk-tier mismatch
#44jailbreak_attemptinstruction_override_attemptYes — both CRITICAL / BLOCK

Only Case #38 was a material failure: an underspecified bulk-data request classified as a low-risk query. That is precisely the error that classifier improvement alone cannot prevent — it needs human judgment, which should feed back into the system.

Latency

P50 ~40ms, P95 ~100ms, observed range 11–215ms (the 215ms maximum is the first request after startup, before inference caching stabilizes). For comparison, LLM-as-judge against the same infrastructure measured P50 ~850ms and P95 ~1,400ms — a 14× latency increase.

6. Human-in-the-loop feedback

The FLAG outcome — routing to human review — is not a fallback for classifier failure. It is a design choice that gives uncertain cases scrutiny and generates labeled training data from real decisions. When a reviewer approves a flagged request, that is a ground-truth label: this pattern, from this role, in this context, is safe.

The learning is asymmetric by design. Approvals reduce future flagging on known-safe patterns; rejections reinforce detection of adversarial ones. The system never auto-learns to allow a new pattern without a human having approved an instance first — which limits the attack surface for adversarial learning. In the prototype, 16 review decisions over three days dropped the FLAG rate from ~18% to ~9% with no loss of accuracy on the adversarial test set.

7. Generalizability and limitations

The confidence-modulation mechanism is domain-agnostic: it operates on measurable properties of input text, not on the intent taxonomy. Applying it to retail or healthcare requires only an ontology and prototype configuration, not changes to the modulation logic. Deployment posture is configurable — unclassified requests can default to human review (conservative, for customer-facing systems) or to allow (permissive, for internal tooling).

Known limitations:

Why this matters for runtime authorization

This is the research foundation under Trampolyne's runtime control: intent is a property of the request, authorization is a decision under uncertainty, and the safest system is one that knows when not to decide. Treating obfuscation as a measurable signal — rather than trying to out-classify every attack — is what makes deterministic, low-latency, adversarially-robust authorization possible without an inference-time LLM in the hot path.

Interested in how this works against your own AI system? Book a 20-min fit call or read how it works.