GlossaryAdversarial Simulation10 min read

What is adaptive simulation?

JWJames Wall
Co-founder, Adversarial Risk Group

Adaptive simulation is adversarial testing where each round learns from prior outcomes and changes technique, lure, or timing instead of repeating fixed scripts.

Key takeaways

Adaptive simulation is defined by the feedback loop. Each test informs the next. The system never repeats the exact same test against the same target.
It replaces scripted security-awareness platforms, which deliver fixed template libraries on schedules and degrade to noise within months.
The adaptation operates on multiple dimensions: technique, pretext, channel, timing, target, and difficulty.
It produces operational data with real signal: detection rate over time, by department, by technique, by pretext family, controlled for prior exposure.
For mid-market manufacturers, adaptive simulation is the only practical way to keep social engineering simulation useful past the second month of any program.

What makes a simulation "adaptive" rather than scripted?

Most "phishing simulation" today is scripted. A platform ships a library of templates, the customer schedules campaigns, and every employee receives one of N templates per quarter. The metric is click rate. Click rates drop over time because the workforce learns the templates, not because the workforce learned to detect phishing. The signal degrades to vanity.

Adaptive simulation operates differently. It has six properties:

Closed loop. Every test outcome (delivered, opened, clicked, credential submitted, escalated, blocked, missed) feeds back into the system as input.
Per-target memory. The system tracks what each named individual has been exposed to, when, and with what outcome. It does not retest the same person with the same pretext.
Technique rotation. Within a campaign, technique varies: vendor-invoice pretext, then HR-policy pretext, then IT-help-desk pretext, then OAuth-consent. The workforce sees a moving target.
Channel rotation. Email, then voice, then SMS, then physical pretext during on-site engagements. See What is vishing?, What is smishing?, and What is spear phishing?.
Difficulty calibration. Tests adjust difficulty based on prior outcomes for that person and their department. A team that surfaced the last three pretexts gets a harder pretext next; a team that fell for the last three gets a different pretext at the same difficulty until detection improves.
Detection-aware. The system knows what the defending stack caught and shapes the next round of tests to probe coverage gaps, not to repeat known-detected techniques.

Adaptive simulation is the operating principle behind continuous adversarial simulation and AI-personalized spear phishing. It also extends to deepfake vishing and voice cloning fraud where voice characteristics, calling time, and pretext vary per round.

The learning loop: how adaptive simulation iterates on what works

The loop has four stages.

1. Observe. The previous round's outcomes are normalized into a record per test: target, technique, lure family, pretext, channel, timing, delivery status, recipient action, detection event, escalation. Outcomes from continuous penetration testing and BAS feed the same record format where they intersect.

2. Score. Outcomes are scored on multiple axes: success rate for the technique, escalation rate from the workforce, time to detection by the defending team, false-positive rate for the corresponding control. Scores are tracked by department, by tenure, by role, and per individual.

3. Generate. The system generates the next round of tests by sampling from technique and pretext libraries weighted against history. Recently exposed pretexts are down-weighted for that target. Recently caught techniques are down-weighted for the workforce in general (to probe other coverage areas). Targets who have not been tested in the current cycle are prioritized.

4. Execute. The new round runs. Outcomes flow back into step 1. The loop never stops; it runs continuously between engagements.

The lure-generation portion uses language models to produce content tuned to the target's role, tenure, visible projects, communication style, and currently-public context (recent press, conference talks, podcast appearances). The LLM is one tool in the loop; the loop itself is the system. Replacing the LLM with a better one improves lure quality but does not change the adaptive structure.

Why scripted phishing tests stop producing useful data

Scripted phishing platforms produce useful data for roughly two months. After that, three failure modes set in:

Template learning. Employees recognize the templates and the senders. Click rates collapse. The platform reports "improved performance"; the workforce has learned the platform, not phishing.
Selection bias. Employees who fall for templates get more training, the rest get bored. The metric measures the trainable, not the testable.
Detection signal pollution. Email security tools learn to identify the platform's infrastructure and start auto-quarantining its messages. Tests "fail to deliver" and the program looks like it is succeeding when it is being filtered.

A real attacker does not have these problems. Their pretext varies. Their infrastructure rotates. Their timing matches operational pressure. Their lure is personalized to the specific target. The gap between scripted simulation and a real targeted attempt widens every month the scripted platform runs.

Adaptive simulation closes the gap by adopting the structural properties of a real attacker: per-target context, technique rotation, infrastructure rotation, and outcomes-driven iteration.

Examples of adaptive simulation outcomes

What changes when a program shifts from scripted to adaptive, based on ARG engagements:

Vendor-invoice pretext caught by AP for the first time in week 6. After three rounds of progressively more believable vendor-invoice lures, the AP team surfaces the fourth attempt and runs the verification callback correctly. The remediation was not training; it was the team learning what these attempts look like when the attacker tries multiple times.
Help desk vishing failure rate drops from 80 percent to 25 percent across two quarters. Each round uses a different pretext, calling time, and voice profile (What is vishing?). The metric improvement reflects real workflow change (callback verification, manager confirmation) rather than the team memorizing a single attempted approach.
Detection coverage drift surfaces a Defender for Office configuration regression. A scripted simulation would have continued to be auto-quarantined and reported zero impact. The adaptive simulation switched infrastructure, surfaced the regression, and routed a tuned detection request into the purple team backlog.
OAuth consent-phishing exposure mapped per role. Engineering managers fall for "Add to Project" OAuth grants at a higher rate than finance; finance falls for "Mailbox automation" grants more often. The signal is not "phishing is bad"; it is "different roles have different blind spots, and the controls need to reflect that". See What is consent phishing (OAuth phishing)?.
Wire-fraud-pretext failure rate drops to zero after a policy change. A revised wire-transfer approval workflow (two-person, out-of-band verification, ERP-pulled vendor contact) drives the success rate to zero across multiple pretext variants. The policy change worked; the adaptive simulation proved it under contact.

The pattern across these is that the program does not measure "click rate". It measures detection and response improvement against techniques the workforce has not seen before.

How to evaluate whether a simulation is truly adaptive

When a vendor or service claims adaptive simulation, four diagnostic questions separate marketing from substance:

Show me how next month's tests differ from this month's, for the same target. If the answer is "we cycle through our template library", the system is scripted, not adaptive.
What data does the system track per individual? If the answer is "click rate", it is scripted. Adaptive systems track per-target technique exposure, pretext family, channel history, timing, outcome, and detection event.
What happens when the email security stack starts auto-quarantining the platform's messages? If the answer is "we work with you to allowlist", the system is scripted (and is also corrupting the test). Adaptive systems rotate infrastructure.
How does the simulation respond when a specific department's detection improves? If the answer is "we keep sending the same campaigns", the program is not adapting. Adaptive systems raise difficulty, change channel, or probe a different surface.

A vendor that cannot answer these questions concretely is selling scripted simulation with adaptive marketing.

Best practices for measuring adaptive simulation effectiveness

For organizations operating an adaptive simulation program:

Track surface metrics, not theater metrics. Time to detection, escalation accuracy, repeat exposure, and per-department detection trend matter. Aggregate click rate does not.
Compare cohorts, not absolutes. "How did finance compare to engineering" and "how did this quarter compare to last" are useful framings. "What was our click rate" is not.
Tie outcomes to control changes. When a policy change rolls out (callback verification, conditional access tightening, OAuth governance), the program should measurably re-test against that change.
Distinguish detection improvement from exposure reduction. A test that does not land because the email gateway blocked it is a different outcome than a test that lands and is correctly surfaced. Both are good; they mean different things.
Surface findings about people carefully. Adaptive simulation will name names. Decide in advance how that data is held, who sees it, and how the response stays informational rather than punitive.
Loop findings into operational documents. Incident response playbooks, risk register entries, and insurance renewal evidence should be updated from simulation data, not treated as separate.
Hold the cadence. Adaptive programs compound over time. A program that runs intensively for a quarter and then pauses loses most of its value.

Adaptive simulation FAQs

Is adaptive simulation the same as AI-driven phishing?

No. AI-driven phishing usually refers to using a language model to generate more convincing lure content. Adaptive simulation uses LLMs as one tool inside a feedback loop that decides what to test next based on what worked, what failed, who has been tested recently, and how detection responded. The lure is the smallest part of the system. See What is AI-personalized spear phishing?.

How does adaptive simulation avoid retesting the same people the same way?

Each test logs target, technique, lure family, pretext, timing, and outcome. The next round of tests is generated from this history: the same person sees a different pretext on a different timing through a different channel, calibrated to whether prior tests succeeded or were caught.

What data does adaptive simulation collect?

Date and time of each test, named target, technique mapped to MITRE ATT&CK, lure or pretext used, outcome (delivered, opened, clicked, credential submitted, escalated, blocked, missed), time to detection, escalation path, and the change in any of these across consecutive tests for the same target.

Does adaptive simulation work for vishing as well as phishing?

Yes. Adaptive simulation applies to any channel where attempts can be measured: email, voice, SMS, in-person pretexting during physical engagements, and OAuth-based attacks. Vishing adaptation includes calling time, pretext variant, voice characteristics, and the specific person targeted. See What is deepfake vishing?.

How ARG runs adaptive simulation continuously between physical audits

Adaptive simulation is the operating model for ARG's continuous layer, running every week between on-site engagements.

The system is built and operated by James Wall on infrastructure ARG owns. Inputs include: a continuously refreshed OSINT profile of the organization and its named targets, the cumulative test history per individual, MITRE ATT&CK coverage maps, and the defending stack's detection telemetry where the client shares it.

Outputs are rounds of tests calibrated to each target: a vendor-invoice variant to an AP clerk who has not seen one in two months, a help-desk vishing attempt to an IT technician during a known busy week, a deepfake voice call to an executive ahead of a public event, an OAuth-grant prompt to an engineering manager working a new SaaS integration. Each test logs against the history; each outcome feeds the next round.

Findings land in the monthly operational packet and the quarterly review. The packet is written for the owner or CTO: who was tested, what happened, what the trend is, what the recommended action is. The quarterly review is the structural conversation: what changed in the workforce's detection capacity, which controls demonstrably improved, which controls show drift.

For founding clients, the adaptive simulation layer is part of the monthly retainer alongside continuous penetration testing and periodic on-site physical engagements. Pricing is locked for two to three years.

Apply as a founding client or see how the engagement works for the full delivery cycle.

Find what gets through.

ARG runs continuous AI-driven adversarial simulation and on-site physical audits for mid-market manufacturers. Two founding-client spots remain.

Apply as a founding client How the engagement works

Author: James WallUpdated 2026-05-18Adversarial Risk Group