Verified Scholarly Platform
Serving Researchers Since 2012

Behavioral Stability of Quantized Large Language Models Under Prompt Drift: The Resilio Evaluation Framework

DOI : https://doi.org/10.5281/zenodo.19854991
Download Full-Text PDF Cite this Publication

Text Only Version

Behavioral Stability of Quantized Large Language Models Under Prompt Drift: The Resilio Evaluation Framework

Vedant Barhate

School of Computing MIT ADT University Pune, India

Yash Pandey

School of Computing MIT ADT University Pune, India

Achyut Yesare

School of Computing MIT ADT University Pune, India

Vibhor Dev

School of Computing MIT ADT University Pune, India

Abstract – Large Language Models (LLMs) are increasingly deployed in resource-constrained environments using quantiza-tion techniques such as 8-bit and 4-bit precision, which re-duce memory footprint and inference costs. While these mod-els perform well on clean benchmarks, real-world deployment frequently encounters Prompt Drifttypographical errors, informal phrasing, and structural degradation.

This paper introduces Resilio, a systematic evaluation frame-work investigating the interaction between model quantization and ve progressive levels of prompt quality degradation. We evaluate Llama 3.1 8B, Mistral 7B, and Phi-3 Mini using two novel metrics: the Task Performance Score (TPS) and the Behavioral Stability Score (BSS).

Our study reveals a Quantization Amplication effect, where 4-bit models exhibit disproportionately higher sensitivity to noise compared to FP16 baselines, particularly in reasoning-intensive tasks.

Index TermsQuantization, Large Language Models, Prompt Drift, Behavioral Stability, BSS, TPS, Robustness, Deployment

  1. Introduction

    The deployment of Large Language Models (LLMs) has shifted from centralized, high-performance data centers to resource-constrained edge environments, including mobile hardware and local workstations. This transition is driven by the demand for reduced latency, enhanced data privacy, and lower operational costs. To enable this scaling, post-training quantization (PTQ) has become the industry-standard optimization strategy, compressing model weights from 16-bit oating point (FP16) to 8-bit (INT8) or 4-bit (INT4) precision. While these methods achieve substantial memory compression with minimal impact on clean-input accuracy, they introduce a poorly understood layer of behavioral risk.

    1. The Phenomenon of Prompt Drift

      Standard evaluation protocols assess LLMs using ideal benchmarks characterized by curated, well-structured syntax. However, production environments expose models to a con-tinuous spectrum of input degradationa phenomenon we formalize as Prompt Drift. Prompt Drift reects the naturally occurring noise in human-AI interaction, spanning ve esca-lating severity levels:

      • L1 Typographical Noise: Introduction of character-level errors such as misspellings, adjacency mistakes, and letter transpositions while preserving overall structure and meaning.

      • L2 Formatting Loss: Removal of capitalization, punc-tuation, and standard formatting cues, leading to reduced structural clarity without altering core semantics.

      • L3 Linguistic Informality: Incorporation of slang, abbreviations, and colloquial expressions (e.g., u, pls, bc), introducing deviations from formal language pat-terns.

      • L4 Structural Degradation: Breakdown of grammat-ical structure through word merging, spacing errors, and syntactic inconsistencies that affect sentence readability.

      • L5 Semantic Fragmentation: Severe degradation char-acterized by incomplete phrases, truncated context, and fragmented instructions, leading to partial or ambiguous semantic representation.

    2. The Critical Evaluation Gap

    A fundamental gap exists in current AI research: the inter-action between model quantization and prompt drift remains largely unmeasured. Traditional benchmarks fail to capture how a reduction in internal numerical precision affects a models capacity to parse and reason over noisy data. There is a signicant concern that quantization does not merely decrease absolute performance but actively amplies a models sensitivity to input degradation. This amplication can lead to silent failureswhere a quantized model maintains high expressed condence while providing factually incorrect or logically incoherent responses.

    The Resilio framework is designed to address this gap. By moving beyond binary accuracy and implementing multi-dimensional stability scoring, we aim to establish empirical safety thresholds for the deployment of quantized models in real-world, noisy environments.

  2. Related Work

    The pursuit of deploying Large Language Models (LLMs) on edge devices has catalyzed research into two previously distinct domains: computational optimization through quanti-zation and the linguistic robustness of models under pertur-bation. The Resilio framework operates at the intersection of these elds.

    1. LLM Quantization for Efcient Deployment

      The necessity for on-device LLM execution has led to the rapid maturation of Post-Training Quantization (PTQ) tech-niques. As noted by Dettmers et al. [1], 8-bit matrix multipli-cation allows for signicant memory reduction with negligible performance loss on curated benchmarks. The introduction of 4-bit quantization further pushes these boundaries, enabling models like Phi-3, Llama 3.1, and Mistral 7B to operate within the strict memory constraints of mobile hardware [2], [3], [9]. However, while prior analyses provide extensive evalua-tion of the resource efciency of INT4 versus INT8 rep-resentations [1], [9], these studies typically rely on clean input distributions. Recent work suggests that such extreme compression levels may introduce hidden instabilities or alter model safety alignmenta phenomenon sometimes termed

      quantization-based vulnerability [9].

    2. Robustness to Input Perturbations and Prompt Drift

      The sensitivity of LLMs to surface-level changes in input text is a well-documented challenge. Research into typographi-cal noise [11] and naturally occurring errors in training data [8] conrms that even slight character-level deviations can pro-duce signicant variance in model output. Frameworks such as PromptBench have begun to standardize the measurement of model resilience against adversarial attacks [5].

      Furthermore, studies on noisy instructions indicate that the quality of the prompt is a primary determinant of zero-shot generalization [10]. Despite these insights, existing robustness studies primarily utilize full-precision (FP16 or FP32) models, leaving the stability of quantized architectures under similar Prompt Drift largely unmeasured [4].

    3. Behavioral Stability and Runtime Monitoring

      As AI agents move into production, behavioral stability has emerged as a more critical metric than simple accuracy. This stability can be dened as the consistency of model behavior across intra-prompt variations [6]. In real-world deployments, data driftthe shift in the distribution of user inputs over timepresents a constant risk to model reliability [7].

      While the research community has proposed various runtime monitoring strategies to detect behavioral shifts in AI sys-tems [12], there remains a lack of unied metrics that measure stability relative to a clean baseline. The interaction between acceleration techniques and these hidden instabilities remains an open area of inquiry [9].

    4. Identication of Research Gaps

      A systematc review of the literature reveals a critical tripartite gap:

      • Isolation of Factors: Quantization is typically evalu-ated on clean inputs, while robustness is tested on full-precision models. The interaction between bit-precision and input drift is largely omitted.

      • Absence of Graduated Drift Frameworks: Most studies rely on binary noise conditions (clean vs. noisy) rather than a structured, multi-level taxonomy (L1L5) as pro-posed in this work.

      • Metric Limitations: Standard accuracy metrics fail to capture silent failures, where a model maintains high condence despite producing logically inconsistent or incorrect outputs.

    By introducing the Task Performance Score (TPS) and Be-havioral Stability Score (BSS), the Resilio framework provides a multi-factor analysis of task-specic stability patterns under quantized prompt drift.

  3. The Resilio Methodology

    The architecture of the Resilio framework is centered on a modular, reproducible pipeline that bridges the gap between synthetic benchmarks and the stochastic nature of production-level input noise. The methodology is executed through three distinct phases: custom dataset curation, object-oriented drift simulation, and semantic validation.

    1. Dataset Architecture

      To facilitate a granular analysis of model stability, we curated a specialized dataset comprising 40 base prompts. While these prompts are strategically categorized to mirror established linguistic challenges, they were custom-developed to target specic behavioral boundaries in Large Language Models:

      • Reasoning: 10 prompts engineered to evaluate multi-step mathematical logic and sequential deduction.

      • Logical Inference: 10 prompts structured to assess com-plex deductive reasoning and syllogistic consistency.

      • Question Answering: 10 prompts focused on the preci-sion of factual retrieval and context-window utilization.

      • Sentiment Classication: 10 prompts designed to mea-sure the robustness of pattern recognition under lexical noise.

    2. OOP-based Drift Generation Framework

      The technical core of Resilio is an Object-Oriented Pro-gramming (OOP) based drift generator. This modular engine allows for the application of graduated, additive transfor-mations that simulate realistic user-driven degradation. Each base prompt was subjected to 16 unique variations to ensure statistical diversity in the results.

      The framework formalizes Prompt Drift into a ve-tier severity taxonomy:

      • L1 (Character-level Perturbation): Simulates typo-graphical errors via keyboard adjacency mapping and character transpositions.

      • L2 (Form Degradation): Implements systematic re-moval of casing, standard punctuation, and formal whites-pace.

      • L3 (Linguistic Shift): Replaces formal lexical units with slang, non-standard abbreviations, and colloquial expressions.

      • L4 (Grammatical Breakdown): Induces word merging, omission of essential function words, and overall syntac-tic collapse.

      • L5 (Structural Truncation): Combines aggressive infor-mality with content reduction, forcing the model to infer intent from fragmented inputs.

    3. Mathematical Validation of Realism

      To ensure that the generated drift represents realis-tic noisy input rather than incoherent gibberish, we mathematically validate the quality of perturbations using the all-MiniLM-L6-v2 SentenceTransformer model. We compute the cosine similarity between the clean baseline (L0) and each drifted variant.

      Formally, for embedding vectors u (clean prompt) and v

      (drifted prompt), cosine similarity is dened as:

      • Logical Correctness (LC): Verication of deductive validity and mathematical accuracy of intermediate steps.

      • Reasoning Coherence (RC): Assessment of structural integrity and sequential clarity of the explanation.

      • Sentiment Correctness (SC): Evaluation of label accu-racy for pattern-matching and classication tasks.

      • Manner of Language (ML): A calibration metric mea-suring whether linguistic condence aligns with actual output accuracy.

    The Task Performance Score (TPS) is computed as:

    5

    TPS = (wi · Scorei) (2)

    i=1

    where wi denotes task-specic weight coefcients. These weights are dynamically adjusted; for example, Logical Cor-rectness (LC) is prioritized in reasoning tasks, while Senti-ment Correctness (SC) is emphasized in classication settings.

    B. Behavioral Stability Score (BSS)

    While TPS captures absolute performance, the Behavioral Stability Score (BSS) measures resilience to input degradation. By normalizing degraded performance against a clean base-line, BSS provides a relative stability index independent of model scale.

    CosineSimilarity(u, v)= u · v

    /lu/l/lv/l

    (1)

    For a given drift level d, BSS is dened as:

    The generated variants are required to satisfy strict semantic similarity ranges:

    BSSd

    = TPSd

    TP S0

    (3)

    • L1: 93.2% similarity (high surface-level delity)

    • L2: 88.1% similarity (loss of structural metadata)

    • L3: 83.4% similarity (lexical divergence)

    • L4: 77.8% similarity (syntactic fragmentation)

    • L5: 71.5% similarity (boundary of semantic coherence) This calibration conrms that the Resilio framework pro-

    vides a rigorously controlled and semantically grounded stress

    test for evaluating quantized language model robustness.

  4. Metric Formulation (TPS & BSS)

    The core analytical contribution of the Resilio framework is its departure from simplistic accuracy metrics toward a mul-tidimensional assessment of model reliability. By decoupling peak capability from structural consistency, we introduce a dual-metric system designed to capture the granular degrada-tion of quantized architectures.

    A. Task Performance Score (TPS): The 5-Factor Behavioral Matrix

    Traditional benchmarking often fails to identify silent fail-uresinstances where a model maintains linguistic uency while undergoing logical or factual collapse. To quantify these effects, each response is evaluated across a weighted ve-dimensional behavioral vector:

    • Factual Correctness (FC): Cross-referencing the models terminal output against ground-truth metadata.

    where TPSd represents the mean Task Performance Score across all variants at drift level d, and TPS0 denotes perfor-mance on the clean baseline (L0).

    C. Stability Threshold and Behavioral Collapse

    A central empirical nding of this work is the identication of a stability threshold dened by BSS < 0.7. This boundary distinguishes safe degradation from behavioral collapse.

    The choice of the 0.7 threshold is grounded in the observed divergence between linguistic uency and factual correctness. Below this threshold, the frequency of silent failures in-creases sharply: models maintain high Manner of Language (ML) scores while exhibiting signicant degradation in Fac-tual Correctness (FC) and Logical Correctness (LC).

    By formalizing this threshold, the Resilio framework pro-vides a practical safety metric for determining acceptable quantization levels in real-world deployment scenarios.

  5. Experimental Setup and Implemenation

    The empirical validation of the Resilio framework involved a large-scale, multi-threaded computational pipeline designed to measure the intersection of model architecture, bit-precision depth, and input degradation. The implementation was ar-chitected to ensure strict environmental parity across all test congurations to isolate the variables of quantization and drift.

    1. Technical Infrastructure and Model Congurations

      The study utilized a high-performance compute environment leveraging the Hugging Face ecosystem and the BitsAndBytes library for precision management. We selected three open-source models representing diverse parameter scales and ar-chitectural designs: Llama 3.1 8B, Mistral 7B, and Phi-3 Mini. Each model was instantiated across three distinct quantization variants:

      • FP16: The 16-bit oating-point baseline.

      • INT8: 8-bit quantization using vector-wise quantization logic.

      • INT4: 4-bit quantization with double quantization en-abled for maximum memory efciency.

        To maintain internal validity, we enforced consistent gen-eration parameters across all 5,760 total inferences, utilizing a temperature of 0.7, a top-p value of 0.9, and a maximum token limit of 350.

    2. The LLM-as-Judge Evaluation Pipeline

      Manual annotation of 5,760 multi-factor responses was de-termined to be logistically unfeasible; we therefore engineered an automated evaluation pipeline utilizing GPT-OSS 20B as a neural rater, accessed via the Groq API. This LLM-as-Judge system was prompted with a structured JSON protocol to extract a ve-dimensional scoring vectorFactual Correct-ness (FC), Logical Correctness (LC), Reasoning Coherence (RC), Sentiment Correctness (SC), and Manner of Language (ML)for each response.

      To resolve technical bottlenecks and ensure data integrity, the pipeline implemented:

      • Asynchronous Threading: Designed to maintain an op-timal throughput of 30 requests per minute (RPM) and manage API rate limits during inference.

      • Contextual Grounding: Each evaluation request in-cluded the original clean prompt and ground-truth meta-data to prevent evaluator hallucination.

      • Memory Management: Automated GPU ofoading and memory cleaning were utilized between quantization cycles to prevent VRAM fragmentation during local INT4/INT8 testing.

    3. TPS and BSS Computation & Visualization

    Following the evaluation phase, the extracted scoring vec-tors were aggregated to compute the primary metrics: Task Performance Score (TPS) and Behavioral Stability Score (BSS). For every individual response, the TPS was derived from a weighted combination of the ve evaluation factors, with category-specic weights applied to reect the unique requirements of Reasoning, Logic, Question Answering, and Classication tasks.

    The Behavioral Stability Score (BSS) was subsequently computed at each drift level as the ratio between the mean TPS of degraded prompt variants (TPSd) and the TPS of the corresponding clean baseline (TPS0). This normalization enabled direct comparison of degradation trajectories relative to ideal conditions, independent of model scale.

    To facilitate granular analysis, the results were struc-tured across dimensions of model architecture, quantiza-tion level, task category, and drift level. Visualization tech-niquesincluding line plots for BSS trajectories, bar charts for prompt-level variation, and heatmaps for category-wise stability patternswere utilized to identify degradation trends and quantization-induced amplication effects.

    This multi-level visualization framework enabled clear in-terpretation of behavioral instability across all experimental conditions.

  6. Results, Analysis, and Discussion

    As shown in Fig. 1, all models exhibit a gradual degrada-tion in stability as drift increases, with INT4 congurations collapsing signicantly earlier.

    Fig. 2a highlights task-specic divergence across categories, while Fig. 2b shows the global degradation trend across quantization levels.

    Fig. 3a demonstrates that INT4 degradation is dispropor-tionately higher, while Fig. 3b conrms statistical reliability.

    Fig. 4 illustrates the higher-order interaction between archi-tecture, quantization, and prompt drift.

    The empirical evaluation of 5,760 model-response pairs reveals a complex, non-linear relationship between numerical precision and linguistic robustness. Our analysis centers on three primary phenomena: the amplication of stability loss through quantization, the inherent fragility of specic cog-nitive tasks, and the emergence of silent failures in highly compressed models.

    1. The Quantization Amplication Effect

      A central nding of this research is the Quantization Ampli-cation effect, wherein reducing bit-precision does not merely lower absolute performance but actively accelerates the rate of behavioral degradation under input noise.

      The empirical results reveal a consistent but nuanced Quan-tization Amplication effect. While the absolute BSS differ-ence between INT4 and FP16 congurations remains modest (13 percentage points) at lower drift levels, the divergence becomes more pronounced at higher severity levels (L4L5), particularly in reasoning-intensive tasks. For Phi-3 Mini, INT4 congurations show a steeper degradation slope compared to FP16, reaching BSS < 0.75 at L5 while FP16 maintains BSS

      > 0.80 in classication tasks. This non-linear acceleration in

      degradation raterather than a large absolute gapconstitutes the amplication effect we identify.

      This pattern suggests that at 4-bit precision, the models tolerance for token-level noise is reduced. Rather than a catastrophic collapse, the effect manifests as an earlier onset of degradation and a steeper decline trajectory under high drift, particularly for architectures with smaller parameter counts such as Phi-3 Mini.

    2. Task-Specic Fragility: Reasoning vs. Classication

      The results demonstrate that behavioral stability is highly task-dependent, as illustrated by category-wise mean BSS

      Fig. 1. All models exhibit progressive degradation with increasing drift severity. INT4 congurations demonstrate an earlier onset and steeper decline trajectory, particularly in reasoning tasks.

      trends. A clear divergence emerges between pattern-matching tasks and computational reasoning tasks:

      • Logical and Reasoning Fragility: Tasks requiring multi-step deduction (e.g., GSM8K and LogiQA) are the most vulnerable. In these domains, INT4 models reach the behavioral collapse threshold (BSS < 0.7) as early as L3 drift. The loss of precision disrupts chain-of-thought consistency, where a single perturbed token can redirect the reasoning trajectory entirely.

      • Classication Resilience: In contrast, sentiment classi-cation (SST-2) exhibits strong robustness. Even INT4 models maintain BSS > 0.8 through L3 drift across most architectures. This suggests that classication relies on global semantic signals that are less sensitive to localized perturbations introduced by quantization.

    3. The Silent Failure Phenomenon and Calibration Loss

      One of the most critical safety concerns identied in this study is the emergence of silent failures. This phenomenon is characterized by a divergence between linguistic uency and factual correctness.

      Our multi-factor evaluation shows that while Manner of Language (ML) scores often remain high, both Factual Cor-rectness (FC) and Logical Correctness (LC) degrade sig-nicantly under high drift. In INT4 congurations, models frequently produce syntactically uent nd highly condent responses that are nevertheless factually incorrect or logically inconsistent.

      This condence-accuracy divergence represents a critical deployment risk. Systems that fail with low condence are manageable; however, systems that fail with high condence introduce substantial risk in real-world applications.

      (a) Category-wise BSS (b) Aggregated Quantization Impact

      Fig. 2. (a) Category-wise BSS showing task-dependent robustness differences. (b) Aggregated quantization trends demonstrating a consistent decline in stability with reduced precision.

      (a) Quantization Gap (b) Statistical Signicance

      Fig. 3. (a) Quantization gap analysis highlighting disproportionate stability loss in INT4 models. (b) Statistical validation conrming that degradation trends are signicant across drift levels.

    4. Discussion: Implications for Deployment

    The results indicate the existence of a practical stability ceil-ing for quantized models. While INT4 quantization remains viable for low-complexity classication tasks under moderate noise, it is fundamentally unsuitable for reasoning-intensive applications without robust input preprocessing.

    These ndings highlight a critical trade-off: while aggres-sive quantization yields substantial memory and efciency gains, it signicantly increases the risk of behavioral collapse under realistic prompt conditions. Developers must therefore carefully balance compression benets against robustness re-quirements when deploying LLMs in production environ-ments.

  7. Deployment Guidelines

    The ndings of the Resilio study provide a denitive evi-dence base for optimizing the trade-off between computational efciency and behavioral reliability. To assist practitioners in navigating these trade-offs, we propose a structured deploy-ment matrix.

    TABLE I Deployment Decision Matrix

    Task Category

    Precision

    Operational Constraint

    Reasoning / Logic

    FP16 / INT8

    INT4 viable only if input sim-

    ilarity > 90%

    Fact-Retrieval

    (QA)

    INT8

    Safe for moderate drift; INT4

    requires normalization

    Classication

    INT4

    Robust under high drift condi-

    tions

    Safety-Critical

    FP16

    Avoid quantization to prevent

    silent failures

    1. Deployment Decision Matrix

      Based on the observed stability thresholds (BSS < 0.7), we recommend the following precision-task mapping for pro-duction environments:

    2. Strategic Recommendations for Practitioners

      1. Prioritize Input Normalization: For reasoning-intensive tasks, applying a lightweight spell-correction or grammar-normalization layer prior to inference can signicantly improve stability in low-precision models.

      2. Monitor Silent Failures: Condence scores should not be treated as reliable indicators of correctness in

        Fig. 4. 3D interaction between model architecture, quantization level, and drift severity, illustrating the compounded impact of compression and input degradation on behavioral stability.

        quantized models. Low-precision systems may retain high linguistic uency despite logical or factual errors.

      3. Implement Similarity Gates: Production pipelines should track cosine similarity between incoming prompts and a clean reference distribution. If similarity falls below 0.80, requests should be escalated to higher-precision model variants.

  8. Conclusion

This paper introduced the Resilio framework to quantify the interaction between model quantization and prompt drift. Through large-scale evaluation, we demonstrated the Quanti-zation Amplication effect, showing that low-precision models are signicantly more sensitive to input degradation.

By introducing the Task Performance Score (TPS) and Behavioral Stability Score (BSS), this work establishes a stan-dardized methodology for evaluating model reliability beyond traditional accuracy metrics.

The results highlight a critical trade-off: while aggressive quantization enables scalable deployment, it introduces non-linear risks in reasoning accuracy and model calibration. The Resilio framework provides a principled approach to navigating this trade-off, ensuring that efciency gains do not compromise the behavioral integrity of deployed AI systems.

References

  1. T. Dettmers et al., LLM.int8(): 8-bit Matrix Multiplication for Trans-formers at Scale, Advances in Neural Information Processing Systems (NeurIPS), vol. 35, 2022.

  2. Microsoft Research, Phi-3 Technical Report: A Highly Capable Lan-guage Model Locally on Your Phone, arXiv:2404.14219, 2024.

  3. A. Zhu et al., Robustness in Large Language Models: A Survey of Mitigation Strategies and Evaluation Metrics, arXiv:2401.12926, 2024.

  4. Meta AI, Llama 2: Open Foundation and Fine-Tuned Chat Models, arXiv:2307.09288, 2023.

  5. K. Zhu et al., PromptBench: Towards Evaluating the Robustness of Large Language Models on Adversarial Prompts, arXiv:2306.04528, 2023.

  6. J. Wang et al., On the Robustness of ChatGPT: An Adversarial and Out-of-distribution Perspective, arXiv:2302.12095, 2023.

  7. S. Rabanser, S. Gu¨nnemann, and Z. C. Lipton, Failing Loudly: An Empirical Study of Methods for Detecting Dataset Shift, Advances in Neural Information Processing Systems (NeurIPS), vol. 32, 2019.

  8. J. Wei et al., Finetuned Language Models Are Zero-Shot Learners, in

    Proc. ICLR, 2022, arXiv:2109.01652.

  9. T. Dettmers et al., QLoRA: Efcient Finetuning of Quantized LLMs, Advances in Neural Information Processing Systems (NeurIPS), vol. 36, 2023, arXiv:2305.14314.

  10. Y. Wang et al., Self-Instruct: Aligning Language Models with Self-Generated Instructions, in Proc. ACL, 2023, arXiv:2212.10560.

  11. B. Kim et al., LLM-based Edge Intelligence: A Comprehensive Survey on Opportunities and Challenges, arXiv:2405.00379, 2024.

  12. J. Chang et al., A Survey on Evaluation of Large Language Mod-els, ACM Transactions on Intelligent Systems and Technology, 2024, arXiv:2307.03109.