Adversarial Risk Group
GlossaryAI-Driven Threats10 min read

What is deepfake vishing?

Deepfake vishing is a voice phishing attack that uses an AI-synthesized clone of a specific person's voice to manipulate the target.

Key takeaways

  • Deepfake vishing combines vishing tradecraft (pretext, time pressure, channel asymmetry) with AI-synthesized voice that matches a specific person's expected sound.
  • Voice cloning has collapsed in cost. Minutes of clear public audio of an executive produce a usable clone; high-quality clones come from public earnings calls and podcasts.
  • Mid-market manufacturer executives are increasingly in scope because their public audio footprint is meaningful and the per-attempt yield from a successful deepfake is large.
  • Real-time deepfake detection tooling is improving but is not yet a primary defense. Workflow controls (callback verification, two-person approval, channel switching protocols) are.
  • ARG simulates deepfake vishing in client engagements where the executive sponsor authorizes it, including controlled use of public-source voice cloning against named targets.

How does deepfake vishing work, technically and operationally?

The attack combines three components: voice cloning, pretext construction, and live-call delivery.

1. Voice cloning. The attacker collects public audio of the target (executives, finance leadership, IT lead, named decision-makers) from earnings calls, podcasts, conference talks, marketing video. The audio is processed through a voice-cloning service (commercial: ElevenLabs, Resemble AI; open-source: various). Modern systems produce a clone good enough for live conversation from minutes of training audio.

2. Pretext construction. Standard vishing tradecraft applies. The attacker picks a target whose workflow would respond to a request from the cloned voice. The pretext fits a current visible business context (a known acquisition discussion, a recent press release, a known travel window). The pretext is calibrated to produce time pressure that suppresses verification. See What is pretexting?.

3. Live-call delivery. The attacker calls the target using the cloned voice in real time. Modern voice-cloning systems can transform spoken text to the target voice with low latency (sub-second), enabling natural conversation. Some attacks use pre-recorded clips for specific phrases plus generated audio for the rest; some use full real-time generation.

The defense gap is in the recipient's cognitive override. The recipient hears their CEO's voice; the voice matches their memory exactly; the request is plausible given visible business context. The instinct to verify is fighting against the instinct to comply with a recognized authority. The instinct to comply usually wins unless the recipient has been trained in (and consistently practices) a verification habit that does not depend on voice recognition.

The data attackers need to clone an executive's voice

Voice cloning is a function of audio quality and audio quantity. Both have collapsed in the era of public audio.

Quality requirements. Modern cloning works on:

  • Conversational audio (podcast guest appearances, interview clips)
  • Presentation audio (conference talks, webinars, marketing videos)
  • Phone-quality audio (earnings call recordings, recorded calls posted online)

The audio does not need to be studio-quality. A reasonable consumer microphone in a typical environment produces sufficient quality.

Quantity requirements. Different tools have different thresholds:

  • Some commercial tools claim usable results from 30 seconds to 3 minutes of audio.
  • High-quality results typically require 5 to 30 minutes.
  • The longer the training audio, the better the clone handles edge cases (whispered phrases, unusual words, emotional inflection).

Public sources for typical executives. A CEO who has appeared on three industry podcasts in the last twelve months has likely produced 2 to 6 hours of public audio. Earnings calls add several hours per quarter. Conference appearances add 30 to 60 minutes each. A typical public-facing executive at a mid-market manufacturer has more public audio available than a high-quality clone requires.

The implication: there is no realistic way for a publicly visible executive to avoid voice-cloning exposure entirely. The defense moves to workflow controls that do not depend on voice recognition.

Why mid-market manufacturers are now in scope for deepfake attacks

Until recently, deepfake attacks were targeted at the highest-value executives at the largest organizations (Fortune 500 CEOs, public-figure CFOs). The economics have changed. Three reasons mid-market manufacturers are now in scope:

  1. Cloning cost has collapsed. A commercial voice cloning service costs cents to dollars per minute of synthesized speech. Cloning a specific voice takes hours of compute, not days. The marginal cost of one more target is small.
  2. The per-attempt yield is large. A successful deepfake CEO call to a mid-market manufacturer's CFO can route a six-figure wire. Attackers allocate effort accordingly.
  3. Mid-market executives are publicly visible. Trade publications, industry conferences, podcasts, marketing video, and SEC disclosures all produce public audio. Mid-market executives have less audio than Fortune 500 CEOs but enough for cloning.

The shift is documented in 2024 to 2026 industry data. Reports from Pindrop, Group-IB, and others show deepfake-related incidents rising by orders of magnitude across the mid-market segment. The trend is not slowing.

For mid-market manufacturers, deepfake vishing has moved from "future concern" to "current threat" inside a single year. Programs designed against pre-deepfake threat models are increasingly out of date.

Examples of deepfake vishing incidents in 2024 to 2026

The public record is filling out. Representative examples and patterns:

  • Arup engineering firm (early 2024). Hong Kong finance employee duped by a multi-person video deepfake call into transferring $25 million. The call included multiple cloned colleagues. Demonstrated that video deepfakes were operationally viable for live calls.
  • Multiple Ferrari-targeted attempts (2024). Cloned CEO voices used in vishing attempts against finance staff. Detection via cross-channel verification habits in those cases; loss in some unreported cases.
  • WPP CEO deepfake attempt (2024). Attempted deepfake video call against a senior WPP executive. Detection caught it because the request did not match normal communication patterns.
  • Numerous mid-market events (2024-2026). Steady cadence of deepfake-driven wire fraud attempts at manufacturers and other mid-market organizations. Most do not make public news; aggregate losses are material. Patterns include CEO/CFO voice cloning, help-desk pretext with cloned executive voice for MFA reset, and vendor-side cloning for invoice manipulation.
  • Consumer-side voice cloning fraud at scale (2023-present). Family-emergency scams ("Grandma, I'm in trouble, send money") using cloned voices of family members. Demonstrated the consumer-facing application; the same techniques scale to corporate.
  • Help-desk pretext combined with cloned voice. Vishing call to IT help desk requesting MFA reset for an executive; the voice on the line matches the executive's known voice; the technician complies. See What is MFA fatigue (push bombing)?.

The pattern: deepfake vishing is no longer rare or experimental. The technique is in active operational use; the defending workforce needs to assume any voice call could be cloned.

How to detect a deepfake voice in a live call

Real-time detection is hard. Five signals that suggest a call may be deepfake:

  1. Subtle audio artifacts. Robotic inflection in unusual words, slightly wrong emphasis on uncommon phrases, audio quality that does not match the expected environment (an executive ostensibly on their cell phone but with no background noise, or with unnatural-sounding background noise).
  2. Mismatched conversational tempo. Deepfake systems have small latencies that can produce subtle gaps or unnatural pacing, especially in back-and-forth conversation.
  3. Resistance to topic change. Attackers using deepfake tools may steer the conversation back to the script if the recipient introduces an unscripted topic. Real conversations flow more naturally.
  4. Inability to answer specific questions. A challenge question ("what did we discuss at lunch yesterday?", "what's the project codename?") that only the real person would know surfaces the deepfake.
  5. Caller refuses video. If the call is voice-only and the recipient suggests video, a deepfake attacker often declines or makes excuses; video deepfakes are harder to produce convincingly in real time.

None of these are reliable on their own. A workforce trained to challenge with specific questions, request video, or end-and-callback produces better outcomes than a workforce trained to listen for audio artifacts.

The right framing: the recipient should not trust voice recognition as authentication, period. The verification habit is the defense, not the ability to detect cloned audio.

Best practices for verification workflows that defeat deepfake vishing

The workflow controls that defeat deepfake vishing are the same as those that defeat traditional vishing, with tighter discipline.

  1. End-and-callback to a directory-sourced number. For any high-impact request, the recipient ends the call and dials back through a number sourced from the corporate directory, not from caller ID or anything supplied during the call. The callback breaks the deepfake; the cloned voice cannot answer the recipient's outbound call.
  2. Two-person approval for high-loss actions. Wire transfers, vendor changes, password resets, and badge issuance require approval from a second person via a different channel. The two-person rule is structurally robust against deepfake because the attacker must compromise two independent verification paths.
  3. Pre-authorized verification questions for high-risk roles. Executive-to-finance interactions include challenge-question protocols. The questions are specific enough to defeat public-source impersonation; the answers are not in any public source.
  4. Channel-switching protocols. A voice request to finance for a wire transfer routes to written confirmation through a known internal channel (Teams, Slack with verified accounts) before action. The channel switch breaks the voice deepfake.
  5. Removed social cost of refusing executive requests. The organization commits in writing that no employee is penalized for refusing to act on an unverified executive request, regardless of urgency. The commitment removes the social-engineering lever.
  6. Continuous adaptive simulation. Adaptive simulation that includes deepfake variants keeps the verification habit sharp. Annual training does not produce durable behavior; quarterly simulation does.
  7. Insurance alignment. Confirm the SEF endorsement requirements explicitly include modern deepfake scenarios; some older endorsements do not.

Deepfake vishing FAQs

How much audio is needed to clone a voice?

Modern voice cloning produces usable results from a few minutes of clear audio; high-quality clones from twenty to thirty minutes. Public sources for executives (earnings calls, podcasts, conference talks, marketing videos) routinely produce hours of training material. The bar is low and dropping.

Can deepfake detection software stop these attacks in real time?

Some detection tools exist (Pindrop, Reality Defender, others) and produce useful signal on call recordings. Real-time detection during a live call is harder; the tool has to integrate with the phone infrastructure and produce a signal fast enough for the recipient to act. Detection tools complement workflow controls; they do not replace them.

Does cyber insurance cover deepfake vishing losses?

Sometimes, depending on policy. Most deepfake-driven losses route through the social engineering fraud (SEF) endorsement on cyber or crime policies. The endorsement typically requires specific verification procedures; losses where the verification was not performed are routinely denied. Confirm coverage and required procedures with the broker before a deepfake incident occurs. See What is social engineering fraud (SEF) coverage?.

What is the difference between deepfake vishing and a regular impersonation call?

A regular impersonation call uses a human attacker pretending to be someone else. A deepfake vishing call uses an AI-synthesized voice that matches the target's expectation of how a specific person sounds. The deepfake variant is materially more effective because the voice resemblance overrides skepticism that a generic impersonation would trigger.

How ARG simulates deepfake vishing in client engagements

ARG includes deepfake vishing as a standard component of continuous adversarial simulation where the engagement scope authorizes it. The simulation is operated by James Wall on infrastructure ARG controls.

For each client, after executive sponsor authorization, ARG:

  • Inventories the public voice footprint. What public audio exists for the named in-scope executives, where, how much. The inventory feeds the digital footprint analysis output.
  • Constructs cloned voices from public sources. Voice models built from public audio of authorized targets. The models are used only in the simulation and are destroyed at the end of the engagement.
  • Runs simulated deepfake calls against named workflow targets. Finance, AP, IT help desk, executive assistants. The simulation tests whether the workforce's verification workflow holds.
  • Logs outcomes. Time of call, target, pretext, voice used, recipient response, where the workflow caught it (or where it would have failed). The outcomes feed the monthly operational packet.

Findings flow into the broader engagement: verification workflow tightening, executive footprint reduction recommendations, training updates, SEF endorsement compliance documentation, and detection tool evaluation if appropriate.

The engagement explicitly does not use deepfake techniques outside the scoped simulation. Voice clones built for the engagement are not retained, are not used outside the simulation, and are subject to the same rules-of-engagement discipline as the rest of the work.

For founding clients who authorize deepfake simulation, the work is part of the monthly retainer. Clients who do not authorize deepfake-specific testing can still benefit from the broader simulation; deepfake is one channel among many.

Apply as a founding client or see how the engagement works for the full delivery cycle.

Find what gets through.

ARG runs continuous AI-driven adversarial simulation and on-site physical audits for mid-market manufacturers. Two founding-client spots remain.

Author: James WallUpdated 2026-05-18Adversarial Risk Group