DOI : 10.5281/zenodo.20626608
- Open Access

- Authors : Praveen Kumar, Parul Shrivastava
- Paper ID : IJERTV15IS060211
- Volume & Issue : Volume 15, Issue 06 , June – 2026
- Published (First Online): 10-06-2026
- ISSN (Online) : 2278-0181
- Publisher Name : IJERT
- License:
This work is licensed under a Creative Commons Attribution 4.0 International License
ProtoFuzz-AI: An AI-Assisted Fuzzing Framework for Proprietary Industrial IoT Protocols
Praveen Kumar[0009-0003-8842-6240], Parul Shrivastava[0009-0007-0234-6688]
Department of Electronics and Communication Engineering (IoT Specialization), Oriental College of Technology, Bhopal, India
Abstract – Proprietary protocols carried over Industrial Inter- net of Things (IIoT) deployments are among the hardest network surfaces to test, because their message formats and session rules are undocumented and rarely accompanied by public vulnerability data. Mutation-based fuzzers spend most of their budget on inputs that are discarded at the rst parsing stage, whereas grammar-based fuzzers achieve high input validity only after a human writes a specication for each target. This paper presents ProtoFuzz-AI, an AI-assisted fuzzing framework that infers enough structure from captured trafc to fuzz an unknown industrial protocol without a hand-written grammar and without training any model locally. The framework combines lightweight statistical structure analysis (entropy and eld-frequency prol- ing, candidate length-eld and session-token detection) with an external Large Language Model (LLM) that is queried, through its public Application Programming Interface (API), to interpret the recovered elds and to propose protocol-aware mutations and boundary values. The resulting hypotheses drive a coverage- guided fuzzing engine, while a monitoring module records timeouts, crashes, and abnormal responses. We implement the framework in Python using Scapy, Wireshark/Tshark, Boofuzz, and SQLite, and evaluate it on a Modbus/TCP testbed augmented with a synthetic proprietary wrapper, hosted on OpenPLC v3 inside Docker. Across ve independent four-hour campaigns on commodity hardware (Intel Core Ultra 7, 32 GB RAM, RTX 5060 GPU), ProtoFuzz-AI reaches a mean statement coverage of 52% and a 76% valid-input rate, and surfaces six unique faults, compared with 44%/61%/4 for an AFLNet-style baseline and 37%/58%/3 for Boofuzz. The improvements are moderate but consistent, and we discuss in detail the threats to validity that follow from the synthetic wrapper, the soft-PLC target, and the single-protocol evaluation.
Index TermsIndustrial Internet of Things, fuzz testing, pro- prietary protocols, protocol reverse engineering, large language models, Modbus, SCADA security, coverage-guided fuzzing.
-
Introduction
Industrial control networks were built decades ago for closed, trusted environments, and many of the devices on them are expected to run for fteen to twenty years. As these networks are connected to corporate information-technology systems and cloud analytics under the banner of the Industrial Internet of Things (IIoT), they inherit a modern attack surface while keeping their legacy assumptions. A single malformed frame that a desktop application would silently reject can, on a Programmable Logic Controller (PLC), cause a watchdog reset, freeze a control loop, or place an actuator in an un- expected state. Incidents such as the Stuxnet worm [17], the
Industroyer toolkit [18], and the TRITON attack on a safety instrumented system [19] have shown that defects in industrial communication software have physical consequences.
Most defensive testing effort has been directed at docu- mented protocols such as Modbus, DNP3, and OPC-UA, for which message grammars and public Common Vulnerabili- ties and Exposures (CVE) records exist. A large share of real trafc, however, is carried by vendor-specic protocols layered on top of, or alongside, the standard ones. These proprietary formats are deliberately opaque, change quietly between rmware releases, and are seldom covered by any public test suite. The result is an uneven security posture: standard protocols are repeatedly examined, while proprietary ones accumulate latent defects that may only be discovered after an incident.
Fuzz testing remains one of the most practical ways to nd such defects [1], [2], but two well-known styles each have a weakness in this setting. Purely random or mutation- based fuzzing produces inputs that the target rejects at the earliest parsing stage, so the campaign budget is spent before reaching deeper protocol logic. Grammar-based fuzzers such as Boofuzz [6] and Peach [7] solve the validity problem, but only after an analyst manually encodes the message format, which is exactly the reverse-engineering cost that the fuzzer was supposed to save when the protocol is proprietary. Recent work has begun to use machine learning to reduce this costLearn&Fuzz [9] learns input structure for le formats, NEUZZ [10] approximates coverage feedback with a neural model, and ChatAFL [11] drives mutations with a large language modelbut these systems either assume compile- time instrumentation, target stateless le parsing, or are tuned to text protocols, and most require non-trivial training or compute.
This paper takes a deliberately lightweight position. We ask how far a single researcher, using only commodity hard- ware and freely available tooling, can get toward fuzzing an unknown industrial protocol without writing a grammar and without training any model. Our answer is ProtoFuzz-AI, a framework that recovers coarse structure from captured trafc using simple statistics, asks an external LLM (through its public API) to interpret that structure and to suggest mutations, and feeds the suggestions into a coverage-guided fuzzing engine. The LLM is used purely as a stateless reasoning ser-
vice; no ne-tuning, reinforcement learning, or local Graphics Processing Unit (GPU) is required.
The contributions of this paper are as follows.
-
A ve-module fuzzing framework for proprietary IIoT protocols that needs neither a hand-written grammar nor a locally trained model, and that is implementable by a single researcher with Python, Scapy, Wireshark/Tshark, Boofuzz, Docker, and SQLite.
-
A structure-then-reason pipeline that pairs inexpensive statistical analysis (entropy, eld frequency, candidate length-eld and session-token detection) with an external LLM that interprets elds and proposes protocol-aware mutations, keeping the cost of inference low and the design reproducible.
-
An empirical evaluation on a Modbus/TCP testbed with a synthetic proprietary wrapper hosted on OpenPLC, comparing the framework against random mutation, Boo- fuzz, and an AFLNet-style baseline over ve independent four-hour campaigns.
-
A frank threats-to-validity discussion that states the limits imposed by the synthetic wrapper, the soft-PLC target, the single-protocol scope, and the use of coverage as a proxy metric.
The rest of the paper is organised as follows. Section II re- views related work. Section III states the problem and assump- tions. Section IV describes the architecture, and Section V the workow and metrics. Section VI covers implementation, Section VII the experimental setup, and Section VIII the results. Section IX discusses the ndings, Section X the threats to validity, Section XI future work, and Section XII concludes.
-
-
Related Work
Coverage-guided and network fuzzing. American Fuzzy Lop (AFL) [3] popularised greybox, coverage-guided fuzzing, and AFLFast [4] improved its scheduling by modelling path exploration as a Markov chain. AFLNet [5] extends greybox fuzzing to stateful network protocols by inferring a state machine from server response codes; this works well for doc- umented protocols but tends to merge distint error conditions that share a response code, which is common in industrial protocols where a wrong checksum and a wrong session token may produce the same reply. Boofuzz [6] and Peach [7] are generation-based fuzzers that require a user-supplied block grammar per protocol. PULSAR [8] performs stateful black- box fuzzing of proprietary protocols by rst building a Markov model of observed messages, and is closest in spirit to the present work, although it does not use a language model to interpret recovered structure.
Machine-learning-assisted fuzzing. Learn&Fuzz [9] trains a recurrent model to generate structured le inputs, and NEUZZ [10] learns a smooth surrogate of the programs branch behaviour to guide mutation; both, however, as- sume access to instrumentation or a large training corpus. ChatAFL [11] uses a large language model as a mutation engine for network protocols and demonstrates that LLM guidance improves state and code coverage, but its public form
is oriented toward text protocols rather than binary industrial frames with strict length and checksum constraints. ProtoFuzz- AI differs from these in that the LLM is used only as a stateless interpreter of pre-computed statistics and as a source of mutation hints, so no training, ne-tuning, or local GPU is needed.
Protocol reverse engineering. Discoverer [12] infers mes- sage formats from network traces, and Netzob [13] adds semantic clustering and partial eld labelling. These tools pro- duce static descriptions that must then be connected to a fuzzer by hand. Our framework instead treats inference and fuzzing as one loop: the structural summary is produced automatically, interpreted by the LLM, and immediately consumed by the mutation engine.
Industrial protocol and ICS security. The Modbus pro- tocol has well-known security weaknesses, and attack tax- onomies for it have been catalogued in the literature [15]. National guidance such as NIST SP 800-82 [16] sets out de- fensive practice for industrial control systems. OpenPLC [14] provides an open, IEC 61131-3-compliant soft-PLC that is widely used as a research testbed because real vendor rmware is rarely available for analysis. We use OpenPLC for the same reason, and we are explicit in Section X about the gap between a soft-PLC and production hardware.
Positioning. Prior systems each address part of the proprietary-IIoT problemvalidity, statefulness, coverage feedback, or structure recoverybut not all of them within a design that a single researcher can run on a laptop. ProtoFuzz- AIs contribution is the integration of cheap statistical structure recovery, external-LLM interpretation, and coverage-guided mutation into one reproducible pipeline.
-
Problem Statement and Assumptions
Goal. Given a target device D that speaks an unknown proprietary protocol P , and a nite set T of passively captured, well-formed P packets, we wish to generate test inputs that
(i) are accepted past the rst parsing stage often enough to reach deep logic, and (ii) are likely to trigger faults in D. We do this without a specication of P , without source code for D, and without training a model on a private dataset.
Assumptions. We assume the analyst can observe a modest amount of legitimate trafc (on the order of a few thousand packets) on a span port or host interface, can send packets to D in a controlled laboratory network, and can observe Ds responses, connectivity, andon a soft-PLC such as OpenPLCa coverage signal from the instrumented server. We do not assume the ability to read internal program state, and we treat the LLM as an external service reachable over the network through its API.
Non-goals. We do not attempt to fully reconstruct P , to prove safety, or to fuzz live production equipment. The framework is intended as a laboratory tool for proactive defect discovery on protocols that current grammar-based tooling cannot reach without manual effort.
External LLM API
M5
Monitoring & Detection
feedback
M4
Fuzzing Engine
ce (DUT)
Target devi
M3 AI
Understanding
M2
Structure Analysis
M1
Trafc Collection
Fig. 1. High-level architecture of ProtoFuzz-AI. The ve modules form a closed loop; the only external dependency is a hosted LLM accessed through its public API.
TABLE I
Framework Components and Their Responsibilities
Module Name Responsibilities
M1 Trafc Collection Packet capture; session extrac-
tion; packet normalisation
M2 Protocol Structure Entropy analysis; eld- Analysis frequency analysis; candidate length-eld detection; session-
token detection; repeated-
were enough to expose stable structure in the synthetic wrap- per.
B. Module 2: Protocol Structure Analysis
This module computes inexpensive statistics over the cap- tured payloads and needs no machine learning. For each byte offset it computes the Shannon entropy across the corpus; offsets with near-zero entropy are candidate constant or magic elds, and offsets with high entropy are candidate tokens, counters, or checksums. Field-frequency analysis records, per offset, the distribution of byte values. Candidate length elds are detected by correlating each two-byte offset (in both byte orders) with the observed packet length; an offset whose value tracks the length is agged as a probable length eld. Candidate session tokens are detected as high-entropy byte ranges that are constant within a session but differ across sessions. Repeated n-gram patterns are mined to suggest message-type prexes. The module emits a compact, human- readable structural summary: a per-offset table of entropy, the candidate constant, length, token, and counter regions, and a small set of representative packets in hexadecimal.
C. Module 3: AI-Assisted Protocol Understanding
The structural summary is sent to an external LLM through its public API (for example, the Claude or OpenAI API). The prompt asks the model to interpret each candidate region (constant, length, type, sequence number, token, payload,
M3 AI-Assisted Protocol Understanding
pattern discovery
Field interpretation; protocol explanation; mutation-
suggestion generation; boundary-value generation (external LLM, no training)
checksum), to give a short natural-language explanation of the likely message format, and to propose concrete mutation strategies and boundary values for each eld. The model re- turns its answer in a xed JSON schema, which the framework
M4 Fuzzing Engine Protocol-aware mutations;
state-aware fuzzing; test
parses into a structured protocol hypothesis: a list of elds with
inferred roles, recommended mutation operators per eld, and
M5 Monitoring and Detec- tion
scheduling; input prioritisation Timeout detection; crash detec- tion; abnormal-response detec- tion; logging; reporting
candidate boundary values (for example, zero, maximum, off- by-one, and oversized lengths). Crucially, no model is trained or ne-tuned; the LLM is a stateless interpreter, and each query is independent. Because the model may be wrong, the hypothesis is treated as advisory: it biases the fuzzer but does
-
Framework Architecture
ProtoFuzz-AI is organised as ve modules connected in a loop, shown in Fig. 1. Trafc ows from capture through anal- ysis, interpretation, and fuzzing to monitoring, and feedback from monitoring returns to the fuzzing engine to bias the next round of test generation. Table I summarises the modules and their responsibilities.
A. Module 1: Trafc Collection
Trafc is captured with Wireshark/Tshark or directly with Scapy from a span port or host interface and stored as a packet- capture (PCAP) le. The collector strips lower-layer headers, groups packets into sessions by the standard four-tuple, and normalises volatile elds such as transaction dentiers so that recurring structure becomes visible. Each normalised payload is stored, together with its session identier, direction, and inter-arrival time, in a local SQLite database. In our setup roughly two thousand packets across a few dozen sessions
not constrain it, and any eld can still be mutated blindly if the guided strategy stops producing new coverage.
D. Module 4: Fuzzing Engine
The engine turns the protocol hypothesis into concrete test cases. It maintains a seed pool of captured packets ranked by how much new coverage each has produced, and applies protocol-aware mutations driven by the hypothesis: corrupting or overowing the inferred length eld, substituting the in- ferred type eld, replaying or mangling the inferred session to- ken, and injecting boundary values into payload elds. Generic operators (bit-ip, byte-ip, arithmetic increment/decrement, block insertion and removal, and splicing) remain available for elds the hypothesis does not explain. State-aware fuzzing is achieved by replaying the captured handshake before mutating later messages, so that post-authentication message types can be reached. A simple scheduler allocates more of the budget to operators and seeds with a higher recent coverage yield,
PCAP capture
giving a coverage-guided behaviour without requiring source- level instrumentation of the target.
Normalise / segment sessions
E. Module 5: Monitoring and Detection
LLM query (API)
JSON protocol hypothesis
Structural summary
Candidate length / token / counter detection
Per-offset entropy & eld frequency
After each test case the monitor checks four signals: a timeout (no response within a congurable window, default two seconds), a crash (loss of connectivity or a watchdog- triggered restart of the OpenPLC runtime), an abnormal re- sponse (a reply whose length or status code falls outside the envelope observed during legitimate trafc), and, where available, a coverage delta read from the instrumented Modbus server. Every test case, its outcome, and any anomaly ag are written to SQLite. Suspected crashes are conrmed by inspecting the OpenPLC console log and are minimised by bisecting the triggering packet sequence, so that the nal report lists deduplicated, reproducible faults.
Mutation strategies & boundary values
-
Methodology and Metrics
-
Workow
The operational workow has seven steps; the last three repeat until the time budget is exhausted.
-
Capture protocol trafc. Record legitimate trafc be- tween the engineering client and the target into a PCAP le.
-
Extract packet elds. Normalise and segment the cap- ture, and load payloads with metadata into SQLite.
-
Identify candidate structures. Run entropy, eld- frequency, length-eld, and token analysis to produce the structural summary.
-
Generate protocol-aware mutations. Query the LLM for a protocol hypothesis, then expand it into a priori- tised set of mutation strategies and boundary values.
-
Execute the fuzzing campaign. Replay the handshake, issue mutated test cases under the scheduler, and respect the congured rate limit.
-
Collect feedback. Record responses, timeouts, coverage deltas, and anomaly ags for each test case.
-
Analyse anomalies. Conrm and minimise suspected crashes, deduplicate faults, and feed coverage and anomaly signals back into scheduling.
-
-
Metrics
We deliberately use simple, transparent metrics. Let Nvalid be the number of test cases accepted past the rst parsing stage and Nsent the number sent. The valid-input rate is
Nvalid
Fig. 2. Protocol analysis workow from raw capture to the structured protocol hypothesis consumed by the fuzzing engine.
These three quantitiesvalidity, coverage, and fault rate together describe how well a fuzzer reaches deep logic and how productively it nds defects, and they avoid any depen- dence on advanced modelling assumptions.
-
-
Implementation
The framework is implemented in roughly 3,500 lines of Python 3.11. Packet handling and crafting use Scapy; ofine capture and eld extraction use Wireshark/Tshark. The generic mutation operators and the session-replay harness are built on top of Boofuzz primitives, while the scheduler, coverage bookkeeping, and prioritisation follow ideas from AFLNet- style stateful fuzzing without reusing its instrumentation. All run datatest cases, responses, coverage deltas, and anomaly agsare stored in a single SQLite le, which makes a campaign self-contained and easy to re-analyse.
The structure-analysis module is plain NumPy and standard- library code; entropy and frequency tables for two thousand packets are computed in under a second.
Prompt and schema design. The AI module is a thin client around a hosted LLM API. The prompt has three parts: a short instruction that frames the task as protocol eld
interpretation, the structural summary (per-offset entropy, the
VIR =
. (1)
Nsent
candidate constant, length, token, and counter regions, and
Given the coverage Ca achieved by ProtoFuzz-AI and Cb by a baseline, the relative coverage improvement is
C = Ca Cb × 100%. (2)
Cb
Finally, with U unique conrmed faults over a campaign of
H hours, the fault-discovery rate is
U
FDR = . (3)
H
three to ve representative packets in hexadecimal with their lengths), and an explicit output contract. The model is required to return a JSON object whose top-level key is a list of elds, each with an offset range, an inferred role drawn from a closed vocabulary (magic, type, length, sequence, token, payload, checksum, unknown), a list of recommended mutation operators, and a list of candidate boundary values. Constraining the role vocabulary and the schema keeps re- sponses machine-parseable and makes the validator simple:
TABLE II Experimental Configuration
Item Value
CPU Intel Core Ultra 7
Memory 32 GB
GPU Nvidia RTX 5060 Operating system Ubuntu 22.04 LTS Language Python 3.11
Soft-PLC runtime OpenPLC v3 (in Docker)
Base protocol Modbus/TCP (pymodbus server) Proprietary layer Synthetic wrapper protocol Captured trafc 2,000 packets
Seed packets 120
Campaign duration 4 hours Independent runs 5
LLM access External API (no local training)
malformed or out-of-vocabulary output triggers a single retry with the schema restated, and a second failure falls back to a purely statistical hypothesis so that the campaign never blocks on the external service. Because each query is independent and stateless, the same prompt can be replayed for auditing, which partially offsets the non-determinism of the model.
Scheduling heuristic. The fuzzing engine keeps, for each (seed, operator) pair, a running estimate of recent coverage yield and recent anomaly yield. At each step it samples a seed with probability proportional to its yield, then samples an operator favouring those the protocol hypothesis marked as relevant to the seeds dominant message type, with a xed exploration oor so that no operator is ever fully starved. Seeds that produce new coverage or a new anomaly are promoted; seeds that have produced nothing for a long window are demoted. This gives the coverage-guided behaviour of greybox fuzzers using only externally observable feedback, without instrumenting the target at the binary level.
A typical campaign issues only a few dozen LLM queries in totalduring bootstrap and whenever the structural summary changes materiallyso the monetary and latency cost of the external service is negligible elative to the four-hour campaign. The entire stack, including the OpenPLC runtime and the Modbus server, is packaged with Docker so that the testbed can be rebuilt from a single compose le. No GPU is used at any point, and the framework runs comfortably within 16 GB of RAM.
-
Experimental Setup
Hardware and software. All experiments run on a single workstation: an Intel Core Ultra 7, 32 GB of RAM, and Ubuntu 22.04, with no discrete GPU. The software stack is Python 3.11, OpenPLC v3, Boofuzz, Wireshark/Tshark, Docker, Scapy, and SQLite. Table II lists the conguration.
Target and proprietary wrapper. The target is a Mod- bus/TCP server (pymodbus) running on OpenPLC v3, exe- cuting a ladder-logic program that models a small two-tank level-control process. Around the standard Modbus frame we add a synthetic proprietary wrapper that is unknown to every fuzzer and must be inferred from captured trafc. The wrapper
HTTPS
External LLM API
Orchestrator
(Intel Core Ultra 7, 32 GB, Ubuntu 22.04)
OpenPLC v3 + pymodbus server with synthetic wrapper
Docker bridge network
Fig. 3. Experimental testbed. The orchestrator, OpenPLC runtime, and Modbus server run on one workstation inside Docker; only the LLM is external.
has a four-byte magic prex, a one-byte message-type eld, a two-byte length eld using a non-standard byte order, an eight- byte session token established during a short handshake, and a two-byte checksum computed with a non-standard polynomial. Four message types exercise different code paths, two of which are reachable only after a valid handshake.
Baselines. We compare four congurations: (B1) uniform random mutation with no structural knowledge and no feed- back; (B2) Boofuzz with a hand-authored block denition for the standard Modbus frame but none for the proprietary wrapper; (B3) an AFLNet-style stateful fuzzer with a cus- tom adapter that infers state from response codes; and (B4) ProtoFuzz-AI, the proposed framework.
Coverage measurement. Because the Modbus server is Python, we measure statement coverage on the server using the standard coverage instrumentation, reported as a percentage of reachable statements. We treat this as a proxy for how deeply each fuzzer penetrates the targets logic and return to its limitations in Section X.
Protocol. Each conguration is run for four hours in ve independent trials with fresh random seeds. The target is reset to a known-good state before each trial, and the captured trafc used for analysis is identical across trials. We report means with standard deviations and assess pairwise differences in coverage with a MannWhitney U test at = 0.05.
-
Results and Analysis
Coverage. Table III reports end-of-campaign statement cov- erage. ProtoFuzz-AI reaches a mean of 52%, against 44% for the AFLNet-style baseline, 37% for Boofuzz, and 22% for random mutation. Using Eq. (2), this is a relative improvement of about 18% over AFLNet and 41% over Boofuzz. The pairwise differences between ProtoFuzz-AI and each baseline are signicant at p < 0.05. The gap is moderate rather than dramatic, which is what we expect on a single, modest testbed: the proprietary wrapper has a limited number of reachable paths, and once the handshake and the four message types are reached, all reasonable fuzzers plateau.
Validity. Table IV shows the valid-input rate from Eq. (1) alongside throughput and time-to-rst-fault. ProtoFuzz-AI ac-
TABLE III
Coverage Results (mean ± std, 5 four-hour runs)
TABLE V
Fault Discovery Results (unique confirmed faults, 5 runs)
Fuzzer
Statement coverage (%)
C vs. random
Fuzzer
Unique faults
FDR (faults/hour)
Random mutation
22 ± 2.1
Random mutation
1
0.25
Boofuzz
37 ± 2.6
+68%
Boofuzz
3
0.75
AFLNet-style
44 ± 2.3
+100%
AFLNet-style
4
1.00
ProtoFuzz-AI
52 ± 2.0
+136%
ProtoFuzz-AI
6
1.50
TABLE IV Performance Comparison
60
Statement coverage (%)
Fuzzer
Valid rate (%)
Throughput (cases/s)
Time to rst fault (min)
Random mutation
10
410
Boofuzz
58
260
96
AFLNet-style
61
230
61
ProtoFuzz-AI
76
215
34
50
40
30
20
10
ProtoFuzz-AI
AFLNet-style
Boofuzz Random
cepts 76% of its test cases past the rst parsing stage, com- pared with 61% for AFLNet-style, 58% for Boofuzz, and only 10% for random mutation. Boofuzzs rate reects the standard Modbus layer it understands but not the wrapper, whose length eld and checksum it cannot satisfy; once those constraints apply, most of its mutated frames are dropped. ProtoFuzz- AIs advantage comes directly from the inferred length, type, token, and checksum elds, which let it keep frames well- formed while still mutating payloads. Random mutation rarely produces a frame the parser will accept, which explains both its low validity and its low coverage.
Faults. Table V lists the unique conrmed faults found by each fuzzer over the ve runs, broken down by class. ProtoFuzz-AI surfaces six distinct faults, AFLNet-style four, Boofuzz three, and random mutation one. The faults are of moderate severity and are consistent with the wrappers design: a length-eld integer-handling bug, an out-of-bounds read in the message-type dispatch, a checksum-validation error path, a denial-of-service triggered by an oversized payload, a session- token handling fault reachable only after the handshake, and a state-desynchronisation issue when message order is violated. The last two are reached only by ProtoFuzz-AI in our trials, which we attribute to its handshake replay and token-aware mutation rather than to any deep search advantage. Applying Eq. (3), ProtoFuzz-AIs fault-discovery rate is 1.5 faults per hour, against 1.0 for AFLNet-style and 0.75 for Boofuzz.
Coverage over time. Fig. 4 sketches how coverage grows
over the four-hour campaign. All fuzzers rise quickly in the rst thirty minutes as the easily reachable, pre-handshake message types are covered. ProtoFuzz-AI then separates from the baselines between roughly thirty and ninety minutes, which is when the protocol hypothesis lets it satisfy the length and checksum constraints and reach the post-handshake types; after about two hours every conguration attens, and the residual gap reects the two post-handshake message types that only the guided fuzzer exercises reliably.
Per-message-type coverage. Decomposing coverage by wrapper message type claries where the differences arise. The
0
0 60 120 180 240
Time (minutes)
Fig. 4. Statement coverage over the four-hour campaign for each fuzzer (mean of ve runs).
two pre-handshake types are reached by every fuzzer, since they require only a well-formed magic prex and length eld, and even random mutation occasionally satises these. The two post-handshake types tell a different story: reaching them requires replaying a valid session token and computing the non-standard checksum over the mutated body. Boofuzz, lack- ing a wrapper denition, ssentially never exercises them; the AFLNet-style baseline reaches one of the two intermittently by chance when its mutated token happens to be accepted, but covers it shallowly; and only ProtoFuzz-AI exercises both reliably, because its hypothesis identies the token region for replay and the checksum region for recomputation. This concentration of the advantage in the post-handshake region, rather than a uniform lift across all types, is consistent with the validity story in Table IV and is the reason the headline coverage gap is moderate rather than large: the pre-handshake region is easy for everyone, and only a minority of the code lies behind the handshake.
Cost of the LLM. Across a full campaign the framework issues on the order of forty LLM queries, almost all during the rst half hour while the structural summary is still changing. At commodity API pricing this is a few cents per campaign and a few seconds of cumulative latency, which is immaterial against four hours of fuzzing. This conrms the design intent: the LLM is consulted sparingly as an interpreter, not invoked per test case.
-
Discussion
Why guidance helps here. The main lever in this testbed is input validity. A fuzzer that cannot satisfy the wrappers length eld and non-standard checksum spends its budget being rejected, and the post-handshake message types stay
unreached. ProtoFuzz-AIs structural analysis recovers the length and token regions cheaply, and the LLMs interpretation turns those regions into actionable mutation strategies, so frames stay well-formed enough to be accepted while still being adversarial in their payloads. The valid-input rate in Table IV is therefore the clearest single explanation for the coverage and fault differences.
Comparison with AFLNet-style fuzzing. The AFLNet- style baseline benets from response-code state inference and is the strongest baseline, but it conates distinct error conditions that share a reply codewrong checksum and wrong token both look the same to itso it over-explores the pre-handshake region. ProtoFuzz-AIs explicit token replay sidesteps this and reaches the post-handshake types sooner, which is visible in its shorter time-to-rst-fault.
Comparison with Boofuzz. Boofuzz performs well on the standard Modbus frame it was given, but the proprietary wrapper reintroduces exactly the manual-grammar cost that a proprietary protocol is supposed to impose; without a wrapper denition its validity collapses against the wrapped frames. The frameworks value proposition is that this denition is recovered automatically rather than written by hand.
Reproducibility and cost. A practical attraction of the de- sign is that the entire pipelinecapture, analysis, LLM inter- pretation, fuzzing, and monitoringts on a single commodity workstation with no GPU, and the only state a campaign produces is one SQLite le and one PCAP. The external LLM is the one source of non-determinism, but because it is queried sparingly and constrained to a xed schema, its inuence is conned to which mutation strategies are tried rst rather than to the fuzzing mechanics themselves; a campaign that received a poor hypothesis still degrades gracefully toward statistical fuzzing rather than failing. For a Masters-level researcher this matters: the framework can be stood up, run, and re-analysed without specialised infrastructure, industrial partnerships, or access to vendor rmware.
Honest reading of the numbers. The improvements are real and consistent across ve runs but moderate in size, and they come from a single synthetic protocol on a soft-PLC. We are careful not to read them as evidence that the approach will transfer unchanged to real vendor rmware; the next section states why.
-
Threats to Validity
We discuss the principal threats to validity so that the reported numbers are read with appropriate caution.
Synthetic proprietary wrapper. The proprietary layer used in our evaluation is synthetic. It was designed to resemble common industrial patternsa magic prex, a non-standard- endian length eld, a session token, and a custom checksum but it is not a real vendor protocol such as S7Comm or UMAS. Real protocols may include multi-message transac- tional dependencies, time-varying authentication, or vendor- specic encoding that our wrapper does not reproduce, and the frameworks effectiveness on them is unveried.
Soft-PLC target. OpenPLC is a community-developed soft- PLC and is not equivalent to closed-source rmware running on real hardware. Its behaviour under malformed inputhow it crashes, whether a watchdog restarts it, how it recovers may differ materially from production controllers, so the fault classes and discovery rates we observe may not carry over.
Single-protocol, single-base scope. We evaluate one syn- thetic wrapper over a single base protocol (Modbus/TCP). Protocols with substantially different structure, such as DNP3 with secure authentication or OPC-UA binary transport, are not represented, so generalisation across protocol families remains an open question.
Small-scale testbed and short campaigns. The testbed is a single workstation, and each campaign lasts four hours. Faults that only appear over multi-day campaigns, or behaviour that only emerges under load or in distributed deployments, are out of scope by construction. The ve-run design gives reasonable condence in the means but only coarse variance estimates.
Coverage is a proxy. Statement coverage on the Modbus server is a proxy for how deeply a fuzzer reaches into the target, not a direct measure of exploitable defects; a fuzzer could score well on coverage while nding few real bugs, and the ranking of fuzzers could change under a different coverage notion. The valid-input rate is likewise a proxy for structural understanding and does not distinguish meaningful inputs from merely well-formed ones.
Dependence on an external service. The framework relies on a hosted LLM whose outputs are not deterministic and may change between model versions, which affects reproducibility. We mitigate this by constraining the model to a xed JSON schema and by treating its hypotheses as advisory rather than binding, but a different model could yield somewhat different mutation strategies and therefore different coverage.
Results may not generalise. Taken together, these threats mean the results should be read as evidence that cheap struc- ture recovery plus external-LLM interpretation is a promising direction on this class of target, not as a claim of broad supe- riority. Validation on real rmware and additional protocols is required before stronger conclusions are warranted.
-
Future Work
Several directions follow naturally from the threats above. First, evaluation on real vendor rmware under controlled disclosure would test whether the testbed observations transfer. Second, benchmarking across at least three protocol families would characterise where structural inference helps and where it does not. Third, replacing the single LLM query with a short interactive dialoguewhere the framework reports back which hypotheses produced coverage and asks the model to rene themmay improve later-campaign exploration at modest extra cost. Fourth, caching and local distillation of the LLMs structural hypotheses would reduce the dependence on an external service and improve reproducibility. Finally, integrating a lightweight safety lter, so that the tool can be applied cautiously to higher-delity simulated equipment with-
out issuing physically dangerous commands, would broaden its applicability.
-
Conclusion
We presented ProtoFuzz-AI, an AI-assisted fuzzing frame- work for proprietary Industrial IoT protocols that needs neither a hand-written grammar nor a locally trained model. The framework recovers coarse protocol structure from captured trafc with simple statistcs, asks an external LLM to interpret that structure and propose protocol-aware mutations, and feeds the result into a coverage-guided fuzzing engine with crash and anomaly monitoring. On a Modbus/TCP testbed with a syn- thetic proprietary wrapper hosted on OpenPLC, and using only commodity hardware, the framework reached a mean state- ment coverage of 52%, a 76% valid-input rate, and six unique faultsmoderate but consistent gains over an AFLNet-style baseline (44%/61%/4) and Boofuzz (37%/58%/3). The con- tribution is a reproducible, single-researcher-implementable pipeline that lowers the manual cost of fuzzing undocumented industrial protocols. The principal limitation is one of scope: the evaluation rests on a synthetic wrapper, a soft-PLC, and a single base protocol, and conrming the approach on real rmware and additional protocol families is the most important next step.
Acknowledgment
The author would like to express sincere gratitude to his guide,
ORCID: 0009-0005-5081-7339
Prof. Neenansha Jain, Department of Electronics & Commu- nication Engineering, Oriental College of Technology, Bhopal, for her invaluable guidance, constant encouragement, and insightful feedback throughout the course of this work. Her expertise and patient support were instrumental in shaping both the direction and the execution of this research.
The author is equally grateful to his co-guide, ORCID: 0009-0003-4773-4968
Sonu Kumar, Department of Civil Engineering, National In- stitute of Technology (NIT) Jalandhar, for the valuable sug- gestions, technical insights, and continued support that greatly contributed to the quality of this paper.
The author also thanks the Department of Electronics & Communication Engineering, Oriental College of Technology, for providing the resources and environment that made this work possible.
-
J. Pereyda, Boofuzz: Network protocol fuzzing for humans, 2017. [Online]. Available: https://github.com/jtpereyda/boofuzz
-
M. Eddington, Peach fuzzing platform, Peach Tech, Tech. Rep., 2011.
-
H. Gascon, C. Wressnegger, F. Yamaguchi, D. Arp, and K. Rieck, PULSAR: Stateful black-box fuzzing of proprietary network protocols, in Proc. Int. Conf. Security and Privacy in Communication Systems (SecureComm), 2015, pp. 330347.
-
P. Godefroid, H. Peleg, and R. Singh, Learn&Fuzz: Machine learning for input fuzzing, in Proc. IEEE/ACM Int. Conf. Automated Software Engineering (ASE), 2017, pp. 5059.
-
D. She, K. Pei, D. Epstein, J. Yang, B. Ray, and S. Jana, NEUZZ: Efcient fuzzing with neural program smoothing, in Proc. IEEE Symp. Security and Privacy (S&P), 2019, pp. 803817.
-
R. Meng, M. Mirchev, M. Bo¨hme, and A. Roychoudhury, Large lan- guage model guided protocol fuzzing, in Proc. Network and Distributed System Security Symp. (NDSS), 2024.
-
W. Cui, J. Kannan, and H. J. Wang, Discoverer: Automatic protocol re- verse engineering from network traces, in Proc. 16th USENIX Security Symp., 2007, pp. 199212.
-
G. Bossert, F. Guihe´ry, and G. Hiet, Towards automated protocol reverse engineering using semantic information, in Proc. ACM Asia Conf. Computer and Communications Security (ASIA CCS), 2014,
pp. 5162.
-
T. Alves, R. Buratto, F. M. de Souza, and T. V. Rodrigues, OpenPLC: An open source alternative to automation, in Proc. IEEE Global Humanitarian Technology Conf. (GHTC), 2014, pp. 585589.
-
P. Huitsing, R. Chandia, M. Papa, and S. Shenoi, Attack taxonomies for the Modbus protocols, Int. J. Critical Infrastructure Protection, vol. 1,
pp. 3744, 2008.
-
K. Stouffer, V. Pillitteri, S. Lightman, M. Abrams, and A. Hahn, Guide to industrial control systems (ICS) security, NIST Special Publication 800-82 Rev. 2, 2015.
-
N. Falliere, L. O. Murchu, and E. Chien, W32.Stuxnet dossier, Symantec Security Response, Tech. Rep., 2011.
-
A. Cherepanov, WIN32/INDUSTROYER: A new threat for industrial control systems, ESET, Tech. Rep., 2017.
-
A. Di Pinto, Y. Dragoni, and A. Carcano, TRITON: The rst ICS cyber attack on safety instrument systems, in Proc. Black Hat USA, 2018.
References
-
B. P. Miller, L. Fredriksen, and B. So, An empirical study of the reliability of UNIX utilities, Commun. ACM, vol. 33, no. 12, pp. 3244, 1990.
-
V. J. M. Mane`s et al., The art, science, and engineering of fuzzing: A survey, IEEE Trans. Softw. Eng., vol. 47, no. 11, pp. 23122331, 2021.
-
M. Zalewski, American Fuzzy Lop (AFL), 2014. [Online]. Available: https://lcamtuf.coredump.cx/a/
-
M. Bo¨hme, V.-T. Pham, and A. Roychoudhury, Coverage-based grey- box fuzzing as Markov chain, in Proc. ACM SIGSAC Conf. Computer and Communications Security (CCS), 2016, pp. 10321043.
-
V.-T. Pham, M. Bo¨hme, and A. Roychoudhury, AFLNet: A greybox fuzzer for network protocols, in Proc. IEEE Int. Conf. Software Testing, Validation and Verication (ICST), 2020, pp. 460465.
