February 25, 2023 · AI Detection · 4 min read

Detecting Deepfake Audio and Video

In early 2024, a finance worker in Hong Kong transferred $25 million after a video call with what appeared to be his company's CFO and several colleagues. Every person on that call was a deepfake. This incident, one of the largest deepfake-enabled frauds on record, illustrates how far the technology has advanced and why detecting it matters.

How Deepfake Video Works

Deepfake video technology uses deep learning models to manipulate faces and bodies in video footage. The most common techniques include:

Face swapping: Replacing one person's face with another's. Modern face-swap models can work from just a few reference photos, mapping the target's expressions, lighting, and head angle onto the source face in real time.
Lip syncing: Modifying a person's mouth movements to match different audio. This is particularly dangerous because it can make someone appear to say something they never said.
Full body puppeteering: Newer models can transfer body movements, gestures, and poses from one person to another, creating entirely fabricated video of someone performing actions they never did.

What makes modern deepfakes especially concerning is their accessibility. Creating convincing deepfake video no longer requires a PhD or expensive hardware. Consumer-grade tools can produce passable results on a standard laptop, and real-time deepfake technology means manipulated video can be used in live video calls.

How Deepfake Audio Works

Voice cloning technology has advanced even faster than video deepfakes. Modern voice synthesis can clone a person's voice from as little as three seconds of reference audio. These systems capture the timbre and pitch of a voice, as well as speaking patterns, emotional inflections, and accent details.

Deepfake audio has been used in CEO fraud schemes where criminals impersonate executives to authorize wire transfers, family emergency scams using cloned voices of relatives, and fabricated audio clips attributed to politicians and public figures.

Diagram showing how deepfake audio and video detection identifies manipulated media

Notable Deepfake Incidents

The real-world impact of deepfakes is growing rapidly:

Political manipulation: Deepfake robocalls impersonating President Biden were used to discourage voters during the 2024 New Hampshire primary. Deepfake videos of political candidates have surfaced in elections worldwide.
Financial fraud: Beyond the Hong Kong case, deepfake audio impersonating executives has been used in multiple wire fraud schemes, with individual losses reaching millions of dollars.
Celebrity exploitation: Unauthorized deepfake content featuring celebrities has proliferated, raising serious consent and privacy concerns.
Misinformation: Fabricated video of world leaders and public figures making inflammatory statements has been used to undermine trust in journalism and spread disinformation.

How to Spot Deepfake Video

While deepfakes are improving, they still leave detectable traces:

Face boundary artifacts: Look at where the face meets the hair, ears, and neck. Deepfakes often show subtle blending artifacts: a slight shimmer, color mismatch, or blurriness at these boundaries.
Inconsistent lighting: The lighting on a swapped face may not perfectly match the rest of the scene. Watch for shadows that fall differently on the face versus the body.
Unnatural eye movement: Early deepfakes had notable issues with blinking, and while this has improved, eye movement can still appear slightly off, particularly the way eyes track objects or the rate of blinking.
Lip sync mismatches: Watch the speaker's mouth closely. Does it perfectly sync with the audio? Are the mouth shapes correct for the sounds being made? Slight delays or incorrect mouth positions are common tells.
Temporal flickering: Play the video at reduced speed. Deepfakes sometimes show frame-to-frame inconsistencies: slight jumps in facial features, momentary distortions, or brief "glitches" that are invisible at normal speed.

How to Spot Deepfake Audio

Deepfake audio detection relies on more subtle cues:

Breathing patterns: Real speech includes natural breaths, pauses, and filler sounds. Synthesized audio may lack these or include them in unnatural patterns.
Emotional flatness: Cloned voices often struggle with natural emotional variation. The voice may sound correct in tone but miss the subtle emotional dynamics of real speech.
Background consistency: Listen for unnatural transitions between speech and background noise. Real recordings have consistent ambient sound; synthesized audio may have abrupt changes in the noise floor.
Pronunciation of unusual words: Voice cloning models may struggle with uncommon names, technical terms, or words outside their training data.

Protecting Yourself

Beyond detection, practical steps can reduce your vulnerability to deepfake attacks: establish verification protocols for financial transactions that don't rely solely on voice or video, be skeptical of urgent requests received through digital channels, verify important communications through a separate channel, and stay informed about how to identify AI-generated content across all media types.

As deepfake technology becomes more sophisticated, the arms race between creation and detection continues. Developing a critical eye (and ear) for manipulated media is no longer optional. It's a basic digital literacy skill. For still images that accompany suspicious audio or video, our AI Image Detector can quickly flag whether they were AI-generated.

Detecting Deepfake Audio and Video

How Deepfake Video Works

How Deepfake Audio Works

Notable Deepfake Incidents

How to Spot Deepfake Video

How to Spot Deepfake Audio

Protecting Yourself

More from the Blog

Kate Middleton Video Controversy: About the Role of AI

How to Spot AI-Generated Images

The Rise of AI Art: Machines Creating Masterpieces

Try AI Image Detection Free