International Scholarly Publisher
Serving Researchers Since 2012

Provable and Practical Prompt Injection Resilience in Autonomous LLM Agents

DOI : https://doi.org/10.5281/zenodo.19603811
Download Full-Text PDF Cite this Publication

Text Only Version

Provable and Practical Prompt Injection Resilience in Autonomous LLM Agents

Mr. Charan Singh, Abdul Rashad, Md. Abdur Rasheed, Nehith Sayini, Rohith Singh, Somanath Nayak

Department of Computer Science and Engineering

Keshav Memorial Institute of Technology (An Autonomous Institution) Hyderabad, Telangana, India

AbstractPrompt injection is a foundational security vulnera- bility in large language models (LLMs) deployed as autonomous agents with tool access and multi-step reasoning capabilities. Ex- isting defenses rely on heuristic lters that fail under obfuscation, indirect injection, and multi-agent propagation. We present a Unied Cryptographic-Control Architecture (UCCA), a princi- pled framework that integrates ve complementary guarantees:

(1) information-theoretic leakage bounds derived via Fanos inequality, (2) certied robustness via randomized smoothing, (3) token-level rejection via erase-and-check, (4) runtime trajectory enforcement via control barrier functions (CBFs), and (5) ver- iable inference via zero-knowledge proofs (ZK-SNARKs). We formally prove that any successful prompt injection attack must simultaneously bypass all ve mechanisms, a condition we show has probability at most under stated assumptions. We evaluate UCCA on three real LLMs (GPT-4o, Claude 3.5 Sonnet, Mistral- 7B) across four established attack benchmarks (INJECAGENT, TensorTrust, PromptBench, HarmBench), achieving attack suc- cess rates below 8% while maintaining median latency overhead under 340 ms. Our framework bridges formal security guarantees and deployable system architecture, establishing a foundation for provably secure autonomous AI.

Index TermsPrompt Injection, LLM Security, Autonomous Agents, Information Theory, Randomized Smoothing, Control Barrier Functions, Zero-Knowledge Proofs

  1. Introduction

    Large language models are increasingly deployed as au- tonomous agents capable of tool invocation, long-horizon planning, and multi-step reasoning. This capability expansion introduces a critical and underexplored attack surface: prompt injection. An adversary crafts inputsembedded in retrieved documents, API responses, or user messagesthat override system instructions, hijack tool use, or exltrate condential context.

    Existing defenses fall into three categories: (i) input l- tering and keyword blocklists, (ii) ne-tuning on adversarial examples, and (iii) runtime heuristic guardrails. None provides formal guarantees. Blocklists are bypassed by obfuscation and paraphrase. Fine-tuned classiers transfer poorly across attack styles. Heuristic guardrails produce high false-positive rates and degrade utility. Critically, no prior work derives provable bounds on adversarial success probability as a function of measurable system parameters.

    This paper makes several key contributions:

    • Formal threat model framing prompt injection as a statistical inference attack over q adaptive queries.

    • Information-theoretic bounds on system prompt leakage using mutual information and Fanos inequality.

    • Certied robustness for safety-critical classication through randomized smoothing, where the robustness radius R is determined from output probability gaps.

    • Token-level rejection guarantees using an erase-and-check procedure capable of detecting adversarial subsets of size

      k.

    • Runtime safety enforcement through control barrier func-

      tions (CBFs), ensuring LLM outputs remain within a veri- ed safe set.

    • Veriable inference using ZK-SNARKs, allowing crypto- graphic attestation of model outputs without revealing model weights.

    • UCCA, a deployable system integrating all ve mechanisms, evaluated on real LLMs and standard benchmarks.

      Taken together, UCCA represents the rst formally grounded, end-to-end architecture for prompt injection re- silience that is both provably secure and practically deploy- able.

      Fig. 1. Overview of the UCCA prompt injection resilience framework.

  2. Threat Model

    We model prompt injection as a black-box statistical infer- ence attack, where x X denotes user-controlled input, s S is a hidden system prompt drawn from a discrete space with

    |S| > 1, y Y represents the model output, and f (x, s) is the large language model parameterized by .

    A. Adversary Capabilities

    An adversary A submits q adaptive queries and ob- serves the corresponding outputs, formalized as A : (x1, y1,…, xq, yq) s. The adversary succeeds if Pr[s = s] 1 . The adversary may inject malicious content through multiple channels, including user turns, retrieved documents, tool responses, or memory stores. We assume A is computationally unbounded but constrained to at most q

    queries.

    A. Per-Query Leakage

    Per-query leakage is dened as the conditional mutual information:

    I(S; Y | X) = E [DKL(P (y | x, s) /1 P (y | x))] (1)

    which measures the divergence between output distributions with and without knowledge of the system prompt. The total leakage over q queries, assuming bounded cross-query dependence, is:

    q

    Iq = I(S; Yi | Xi) (2)

    i=1

    B. Fanos Inequality Bound

    By Fanos inequality, the adversarys error probability sat- ises:

    B. Agent-Level Attack Classes

    P 1 Iq + log 2

    e log |S|

    (3)

    Beyond system-prompt extraction, we consider four agent-

    level attack classes:

    • Direct injection: adversarial content in the user input overriding system instructions.

    • Indirect injection: adversarial payloads embedded within retrieved documents or API responses.

    • Tool misuse: injected instructions invoke privileged tools such as code execution or le writing.

    • Multi-agent propagation: adversarial content produced by one agent propagates and contaminates downstream agents within a pipeline.

      C. Security Goal

      We seek to upper-bound the probability of successful prompt injection such that Pr(successful prompt injection) , under the union of all four attack classes, for a system- designer-specied (0, 1).

  3. Information-Theoretic Leakage Bounds

    We quantify how much information about the system prompt s leaks through model outputs y, given user input x.

    Equivalently, for an attack success probability 1 Pe 1 , the adversary requires at least:

    (1 ) log |S| log 2

    q queries (4)

    I(S; Y | X)

    C. Leakage Reduction Mechanisms

    Two mechanisms reduce I(S; Y | X):

    • Output perturbation: adds calibrated Gaussian noise N (0, 2) to the logit distributions before sampling, directly reducing the KL divergence.

    • Context isolation (§VIII): ensures system prompt tokens do not co-attend with adversarial input tokens, setting the cross-attention contribution to zero for adversarial positions. Residual leakage after isolation is bounded as I(S; Y |

      X) E.

  4. Certified Robustness via Randomized Smoothing

    We apply randomized smoothing to the safety classier component of UCCA, obtaining input-perturbation certicates that ensure robustness against adversarial perturbations.

    1. moothed Classier

      Let h : X {0, 1} be a base safety classier. We dene the smoothed classier as:

      g(x) = arg max Pr[h(x + ) = c], N (0, 2I) (5)

      c

      This formulation ensures that the classier decision is based on the most probable output under Gaussian noise perturbations.

    2. Certied Radius

      Let pA = Pr[h(x+) = 1] and pB = maxc/=P1 r[h(x+) =

      c]. The certied robustness radius is:

      Fig. 2. Illustration of information-theoretic leakage from system prompt

      R = 1(p

      2 A

      ) 1(pB

      ) (6)

      through model outputs.

      This guarantees that for any perturbation with /1/12 < R, the prediction remains unchanged: g(x + ) = g(x).

    3. Practical Estimation

      We estimate pA and pB using Monte Carlo sampling with n = 10,000 samples, employing Bonferroni-corrected ClopperPearson condence intervals at signicance level = 0.001. Inputs for which pApB < are abstained on and agged for human review, preventing adversarial exploitation in regions of uncertainty.

    4. Scope and Limitations

    Randomized smoothing provides £2-norm robustness certi- cates for the safety classier but does not directly certify the autoregressive outputs of the LLM. It is applied as one layer of defense, with control barrier functions (CBFs) providing complementary guarantees over the action output trajectory.

  5. Token-Level Robustness: Erase-and-Check

    1. Procedure

      Let x = (t1,…, tn) denote a tokenized input. For a subset S [n] with |S| k, dene xS as the input with tokens at positions S erased (replaced by [MASK]). The erase-and- check condition is:

      reject x S, |S| k : h(xS) = 1 (7)

      where h(·) = 1 indicates a safety violation. Candidate sub- sets are sampled using greedy importance scoring to ensure tractability.

    2. Attack Coverage

    Erase-and-check effectively addresses several attack pat- terns not handled by traditional ltering:

    • Sufx injection: adversarial payloads appended after benign content.

      Fig. 3. Control-theoretic safety enforcement via Control Barrier Functions (CBFs) in the agent dynamical system.

      B. Control Barrier Function

      Dene a safety function h : X R such that the safe set is C = {x : h(x) 0}. A Control Barrier Function (CBF) satises the discrete-time forward invariance condition:

      h(f (xt, u)) h(xt) h(xt), (0, 1] (9)

      This ensures that if xt C, then xt+1 C regardless of the action u, making the safe set forward-invariant.

      C. Safety Filter

      Safety is enforced via a Quadratic Program (QP) that minimally modies the LLMs proposed action:

      u = arg min /1uuLLM/12 s.t. h(f (xt, u)) (1) h(xt)

      t t

    • Insertion attacks: adversarial tokens interleaved throughout u

      (10)

      the input.

    • Distributed attacks: malicious signals spread across mul- tiple non-contiguous tokens.

    C. Complexity and Approximation

    The worst-case complexity of exhaustive erase-and-check is O(nk). For k 3, this remains tractable for typical prompt lengths 512. For larger k, gradient-guided token impor- tance scoring restricts the search space, reducing average-case complexity to O(kn) while maintaining empirical recall above

    94% on benchmark attack datasets.

  6. Control-Theoretic Safety Enforcement

    A. Agent State and Dynamics

    We model the agent as a discrete-time dynamical system:

    xt+1 = f (xt, ut) (8)

    where xt represents the agent state (memory, tool context, con- versation history) and ut denotes the action output generated by the LLM.

    This preserves LLM utility when actions are already safe and intervenes only when the proposed action would violate safety. The QP is solved at inference time using a lightweight quadratic solver (< 2 ms per step).

    D. Safety Function Design

    The safety function h is parameterized as a neural network trained on labeled safe/unsafe state-action pairs. Lipschitz regularization ensures the CBF constraint is differentiable. The safety function is updated periodically via active learning on agged episodes.

  7. Cryptographic Verifiability via ZK-SNARKs

    1. Motivation

      In multi-agent and audited deployments, it is necessary to verify that a given output y was produced by an unmodied model f on input x, without exposing model weights or allowing adversarial output substitution.

    2. Proof Construction

      A ZK-SNARK proof is constructed to attest:

      = ZK-Prove(f (x) = y) (11) Verication is performed as:

      Verify(x, y, ) = 1 y = f (x) and is valid (12)

      D. Multi-Agent Containment

      In multi-agent pipelines, each agent independently applies the full UCCA stack. Inter-agent messages are treated as untrusted user input, and ZK proofs are required for mes- sages carrying capability assertions. This prevents injected instructions in one agent from propagating trust to downstream agents.

      Importantly, the proof reveals no information about :

      (statistical independence).

    3. Practical Instantiation

      Full ZK-SNARK proofs over transformer forward passes are computationally expensive. A hybrid approach is adopted: ZK proofs are applied to the safety-critical computation path (safety classier and CBF evaluation), while a commitment scheme covers full model outputs for auditability. This reduces proof generation time to 1.23.8 seconds per inference on a single A100 GPU. The Grotp6 proving system over BN-254 elliptic curves is used.

    4. Adversarial Implication

      ZK veriability prevents an adversary from substituting outputs post-inference without detection. This mitigates attack vectors in multi-agent pipelines where intermediate outputs are relayed between untrusted components.

  8. UCCA System Architecture

    UCCA integrates ve defense mechanisms into a layered architecture with four system-level components.

    1. Semantic Firewall

      A pre-inference semantic rewall operates on meaning- level representations (dense embeddings) rather than keyword patterns. Inputs are classied using a ne-tuned DeBERTa- v3 model trained on a dataset of 120,000 prompt injection examples, spanning direct, indirect, and obfuscated attacks. Adversarial inputs are rejected before reaching the LLM.

    2. Context Isolation

      System prompt tokens and user-controlled input tokens are assigned disjoint attention masks. Cross-segment attention is disabled via a modied causal mask, preventing adversarial inputs from probing system prompt content. This reduces

      I(S; Y | X), as formalized in §III-C.

    3. Capability Control

    Tool invocations are gated by a permission token system. Each tool requires a permission level; the LLMs permission context is initialized from the veried system prompt and cannot be elevated by user input. This prevents injected instructions from invoking privileged tools regardless of gen- erated text.

  9. Unified Security Theorem

    Theorem 1 (UCCA Security Guarantee): Suppose: (1) I(S; Y | X) after context isolation; (2) the safety classi- er has certied radius R under randomized smoothing; (3) erase-and-check holds for any adversarial subset of size k;

    (4) the CBF constraint is enforced at every inference step; (5)

    ZK proofs are veried for all inter-component outputs. Then under any adaptive adversary making q queries:

    Pr(successful prompt injection) (, R, k, q) (13)

    where ( R, k, q) = info(, q)·smooth(R)·erase(k)·cbf ·zk, each factor derived from the corresponding defense mecha- nism.

    Proof Sketch: A successful attack must simultaneously:

    (1) extract sufcient information to distinguish s; (2) evade the smoothed safety classier; (3) survive token erasure; (4) produce an action that violates the CBF constraint; (5) pass or forge ZK verication. We bound each factor independently and argue approximate independence given the architectural separation of mechanisms. Combining under approximate in-

    dependence via a hybrid argument: Pr(successful attack)

    li i .

  10. Experimental Evaluation

    1. Setup

      We evaluate UCCA on three production LLMs: GPT-4o (OpenAI) accessed via API; Claude 3.5 Sonnet (Anthropic) accessed via API; and Mistral-7B-Instruct self-hosted on 2× A100 GPUs. Attacks are drawn from four established benchmarks: INJECAGENT (1,054 indirect injection attacks

      in simulated agent tool-use scenarios), TensorTrust (2,312 system-prompt extraction and override attacks), PromptBench (adversarial robustness benchmark including obfuscation and character-level attacks), and HarmBench (320 goal-hijacking attacks targeting safety-relevant behaviors).

    2. Metrics

      We report: (1) Attack Success Rate (ASR), fraction of at- tacks achieving adversary goals; (2) Median inference latency overhead; (3) Certied rate, fraction of inputs with a valid randomized smoothing (RS) certicate.

    3. Results

      UCCA reduces ASR from 74.3% (no defense) to 7.1%, a 90.4% relative reduction. The 337 ms median latency

      overhead includes CBF evaluation (180 ms) and erase-and- check sampling (140 ms). ZK proof generation is performed asynchronously and does not block inference for non-audited

      TABLE I

      Attack Success Rate, Latency Overhead, and Certified Rate

      Across Defense Configurations

      Method

      ASR

      p50 Latency (ms)

      Certied Rate

      No Defense

      74.3%

      Baseline

      Input Filter

      51.2%

      +18

      0%

      RS + Erase

      29.7%

      +112

      61.4%

      UCCA (ours)

      7.1%

      +337

      84.2%

      requests. RS + Erase alone achieves 61.4% certied rate at lower cost, providing a useful intermediate conguration for latency-sensitive deployments.

    4. Ablation Study

      Removing any single UCCA component increases ASR measurably:

      • Removing context isolation raises ASR to 19.3% (indirect injection attacks exploit cross-context leakage).

      • Removing CBF raises ASR to 15.8% on tool-misuse attacks.

      • Removing erase-and-check raises ASR to 22.1% on dis- tributed token attacks.

        This validates the complementary coverage of the ve mech- anisms.

  11. Discussion

    1. Practical Deployment

      UCCA is modular: operators may deploy subsets of the ve mechanisms depending on the threat model and latency budget. For latency-critical deployments, semantic rewall + context isolation + CBF (omitting erase-and-check and ZK) achieves 18.4% ASR at +95 ms latency, a reasonable inter- mediate. Full UCCA is recommended for high-stakes agentic deployments such as autonomous coding agents, nancial assistants, and medical decision support systems.

    2. Limitations

      Several limitations warrant future work:

        • The CBF safety function requires labeled training data for the deployment domain; transferring to new domains may require ne-tuning.

        • Full ZK proofs over transformer inference remain computa- tionally intensive; the hybrid approach covers only safety- critical paths.

        • The independence assumption in the UCCA proof is ap- proximate; tighter bounds via information-theoretic hybrid arguments remain future work.

        • Adaptive adversaries aware of the UCCA architecture may exploit weak points in the CBF safety function; adversarial training of h is an important direction.

    3. Comparison with Related Work

    Prior work on LLM security has primarily addressed prompt injection using heuristic ltering and adversarial ne-tuning. UCCA distinguishes itself as the rst framework to provide:

    (1) provable leakage bounds, formal guarantees on information leakage under adversarial inputs; (2) certied robustness at the classier level, safety-certied outputs under randomized smoothing; (3) runtime trajectory guarantees, enforcement of safe actions over sequential outputs via CBFs. The closest related efforts include semantic rewalls and control-theoretic LLM safety. UCCA extends these approaches by unifying them under a single formal framework combining information- theoretic, control-theoretic, and cryptographic assurances.

  12. Conclusion

We presented UCCA, a unied framework for provable and practical prompt injection resilience in autonomous LLM agents. By integrating information-theoretic leakage bounds, randomized smoothing, erase-and-check, control barrier func- tions, and zero-knowledge proofs, UCCA provides the rst formally grounded architecture whose security guarantee is a function of measurable system parameters.

Empirical evaluation on three production LLMs and four at- tack benchmarks demonstrates attack success rates below 8%, representing a 90.4% reduction over undefended baselines, with median latency overhead under 340 ms. We release code, benchmark evaluation scripts, and the UCCA safety classier training data to support reproducibility and further research.

Acknowledgment

The authors thank the Department of Computer Science and Engineering at Keshav Memorial Institute of Technology for supporting this research. We also acknowledge the open- source communities behind DeBERTa-v3, the Grotp6 proving system, and the benchmark datasets used in this evaluation.

References

  1. Z. Greshake et al., Not What Youve Signed Up For: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injec- tions, arXiv:2302.12173, 2023.

  2. J. Perez and I. Ribeiro, Ignore Previous Prompt: Attack Techniques For Language Models, NeurIPS ML Safety Workshop, 2022.

  3. J. Cohen, E. Rosenfeld, and Z. Kolter, Certied Adversarial Robustness via Randomized Smoothing, ICML, 2019.

  4. A. Robey, E. Wong, H. Hassani, and G. J. Pappas, SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks, arXiv:2310.03684, 2023.

  5. J. Ziegler et al., INJECAGENT: Benchmarking Indirect Prompt Injections in Tool-Integrated Large Language Model Agents, arXiv:2403.02691, 2024.

  6. E. Toyer et al., TensorTrust: Interpretable Prompt Injection Attacks from an Online Game, ICLR, 2024.

  7. K. Zhu et al., PromptBench: Towards Evaluating the Robustness of Large Language Models on Adversarial Prompts, arXiv:2306.04528, 2023.

  8. N. Mazeika et al., HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal, ICML, 2024.

  9. J. Groth, On the Size of Pairing-Based Non-interactive Arguments,

    EUROCRYPT, 2016.

  10. A. D. Ames, X. Xu, J. W. Grizzle, and P. Tabuada, Cntrol Barrier Function Based Quadratic Programs for Safety Critical Systems, IEEE TAC, 2017.

  11. T. M. Cover and J. A. Thomas, Elements of Information Theory, 2nd ed. Wiley, 2006.

  12. Y. Liu et al., Prompt Injection Attacks and Defenses in LLM-Integrated Applications, arXiv:2310.12815, 2023.

  13. S. Wallace et al., Universal Adversarial Triggers for Attacking and Analyzing NLP, EMNLP, 2019.

  14. P. He et al., DeBERTa: Decoding-enhanced BERT with Disentangled Attention, ICLR, 2021.