Research

Audio forensics

schedule 15 min read

What is audio forensics Spectral analysis techniques Tampering detection methods AI-generated speech detection Voice authentication and speaker identification Tools and workflows The future of audio forensics Frequently asked questions Related research

Audio forensics is the scientific analysis of sound recordings to determine their authenticity, identify speakers, detect tampering, and distinguish human speech from AI-generated audio. As voice cloning technology reaches the point where synthetic speech can fool human listeners, forensic audio analysis has become a critical capability for law enforcement, journalism, and fraud prevention.

Audio forensics applies spectral analysis, temporal pattern detection, and statistical methods to evaluate the authenticity of audio recordings. Forensic examiners analyze spectrograms for splice points, measure electrical network frequency (ENF) signals for timestamp verification, compare voiceprints for speaker identification, and detect the artifacts that AI speech generators leave in synthetic audio. The field is governed by standards from the Audio Engineering Society (AES) and SWGDE.

The field has existed for decades. Audio recordings have been presented as evidence in courtrooms since the mid-20th century, and forensic methods for detecting tape splices and recording anomalies were developed in parallel. What has changed dramatically in the past three years is the threat landscape. Voice cloning services can now produce convincing imitations of any voice from just a few seconds of sample audio, and the cost of generating synthetic speech has dropped to effectively zero.

This guide covers both traditional forensic audio analysis methods and the newer techniques developed specifically to detect AI-generated speech.

What is audio forensics

History and evolution of forensic audio analysis

Audio forensics as a formal discipline emerged in the 1960s and 1970s, driven partly by high-profile cases involving recorded evidence. The analysis of the Watergate tapes in the 1970s brought audio forensics into public awareness, with experts examining the famous 18.5-minute gap for signs of deliberate erasure.

Early forensic methods focused on analog recordings: detecting physical splices in tape, measuring playback speed variations, and analyzing background noise for consistency. The transition to digital audio brought new techniques. Digital recordings do not have physical splices, but they carry codec fingerprints, quantization patterns, and metadata that provide forensic signals. And the shift to AI-generated audio has added an entirely new category of analysis.

Legal standards for audio evidence

For audio evidence to be admissible in court, it must meet authentication requirements that vary by jurisdiction but generally require demonstrating that the recording is what it claims to be. In the US, Federal Rule of Evidence 901(b)(9) addresses the authentication of process or system output. The examiner must show that the recording system was functioning properly and that the recording has not been materially altered.

Forensic audio examiners typically hold certifications from organizations like the International Association for Identification (IAI) or follow the standards published by SWGDE. The AES-43 standard specifically addresses forensic audio examination and provides guidelines for methodology, equipment, and reporting.

Key organizations and certifications (AES, SWGDE)

The Audio Engineering Society (AES) publishes AES-43, the standard for forensic audio examination. The Scientific Working Group on Digital Evidence (SWGDE) publishes best practices for digital audio evidence handling. The European Network of Forensic Science Institutes (ENFSI) maintains guidelines for forensic speech and audio analysis across European jurisdictions. These standards ensure that forensic analysis is conducted consistently, reproducibly, and to a level that courts can rely upon.

Spectral analysis techniques

Spectral analysis is the foundation of audio forensics. It transforms a time-domain audio signal into a frequency-domain representation, revealing patterns that are invisible (or inaudible) in the raw waveform.

Spectrogram interpretation

A spectrogram displays audio as a visual map with time on the horizontal axis, frequency on the vertical axis, and amplitude represented by color intensity. Trained forensic examiners can read spectrograms the way radiologists read medical images, spotting anomalies that indicate editing, splicing, or synthesis.

In a spectrogram, a clean recording shows smooth transitions between speech segments, consistent background noise across the recording, and harmonic patterns that follow the physics of human vocal production. A spliced recording may show abrupt changes in the noise floor at edit points, discontinuities in background hum frequencies, or spectral characteristics that shift suddenly in ways that do not correspond to any natural change in the recording environment.

Mel-frequency cepstral coefficients (MFCC)

MFCCs are a compact representation of the spectral envelope of an audio signal, modeled on the nonlinear frequency perception of the human ear. They have been the standard feature set for speaker recognition and audio classification for over two decades.

In forensic applications, MFCCs serve two primary purposes. For speaker identification, they capture the spectral shape of a person's voice, which is determined by the physical geometry of their vocal tract. For AI detection, they reveal statistical differences between human and synthetic speech. AI-generated audio tends to produce MFCC distributions that are slightly too smooth or too regular compared to natural speech, because generation models optimize for perceptual quality rather than perfect replication of the underlying physics.

Formant analysis for speaker verification

Formants are the resonant frequencies of the human vocal tract. They are determined by the physical shape and size of the throat, mouth, and nasal passages, and they are as distinctive to a speaker as a fingerprint is to a hand. The first three formants (F1, F2, F3) carry most of the information that distinguishes one speaker from another.

Forensic formant analysis measures these resonant frequencies across a recording and compares them to a reference sample from a known speaker. The analysis accounts for natural variation caused by different speaking styles, emotional states, and recording conditions. A mismatch in formant patterns between two recordings that claim to be the same speaker is strong evidence of either a different speaker or synthetic voice generation.

Electrical network frequency (ENF) analysis

Electrical network frequency analysis is one of the most elegant forensic techniques available. The power grid operates at a nominal frequency of 50 Hz (in most of the world) or 60 Hz (in North America and parts of Asia), but the actual frequency fluctuates slightly around this nominal value due to variations in load and generation. These fluctuations create a unique time-varying signature that is captured by any recording device connected to or near mains power.

By extracting the ENF signal from a recording and comparing it to a reference database of grid frequency measurements, forensic examiners can determine when the recording was made, sometimes to within a few seconds. This technique can also detect edits: if a segment of the recording has been removed or rearranged, the ENF signal will show discontinuities at the edit points.

Why ENF matters

ENF analysis is particularly powerful because the signal is captured passively and cannot be easily faked. An AI-generated audio clip will either lack an ENF signal entirely (because it was never near a power grid) or contain an ENF pattern that does not match any real grid measurement for the claimed time and location. This makes ENF a strong discriminator between recorded and synthesized audio.

Tampering detection methods

Splice and edit detection

Audio splicing involves combining segments from different recordings, or removing segments from a single recording, to change the meaning or content. Forensic splice detection examines multiple signals for discontinuities at potential edit points.

The most reliable indicators are changes in background noise characteristics. Every recording environment has a unique noise profile determined by ambient sounds, room acoustics, and equipment noise. When audio from a different recording environment is spliced in, the noise profile changes abruptly. Even when the splice is smoothed with crossfading, the underlying noise statistics shift in ways that spectral analysis can detect.

Compression artifact analysis (MP3, AAC, Opus)

Lossy audio codecs like MP3, AAC, and Opus compress audio by removing information that the human ear is less sensitive to. This process leaves characteristic artifacts in the frequency domain. Each codec produces a distinct pattern of spectral shaping, and each quality setting produces a different level of artifact intensity.

Forensic compression analysis can determine what codec was used, what quality setting was applied, and whether the audio has been compressed multiple times. Double compression, where audio is decoded and re-encoded, produces a measurable increase in artifact energy that single compression does not. This is useful for detecting audio that has been edited (which requires decoding, modifying, and re-encoding) versus audio that was compressed once during the original recording.

Room acoustics and reverberation consistency

Every room has a unique acoustic signature determined by its size, shape, and the materials of its walls, floor, and ceiling. This signature is captured in the reverberation characteristics of the recording: the pattern of reflections that follow each sound event. Forensic analysis can estimate the room impulse response from a recording and check whether it remains consistent throughout.

If a segment of the recording was captured in a different room, the reverberation pattern will change at the edit point. Even if the direct sound is perfectly matched, the reflected sound will differ. This is extremely difficult to fake convincingly, because synthesizing a realistic room impulse response requires detailed knowledge of the physical space.

Background noise pattern analysis

Background noise is not random. It contains structured signals from HVAC systems (which produce harmonic patterns at specific frequencies), traffic (which has characteristic spectral profiles depending on distance and vehicle types), and electronic equipment (which introduces hum at mains frequency harmonics). These signals form a temporal pattern that should remain consistent within a continuous recording.

Forensic background noise analysis looks for discontinuities in these patterns. A splice point where two different recordings are joined will typically show a change in the noise floor level, the harmonic structure of background hum, or the spectral profile of ambient noise. These changes may be too subtle to hear but are clearly visible in spectral analysis.

AI-generated speech detection

Detecting AI-generated speech is the newest and fastest-evolving area of audio forensics. Modern text-to-speech (TTS) and voice cloning systems produce speech that most human listeners cannot distinguish from real recordings. Forensic methods target the artifacts that these systems leave at a level below human perception.

Voice cloning artifacts: what forensic analysis reveals

Voice cloning systems work by learning a statistical model of a speaker's voice characteristics from sample audio, then using that model to generate new speech. The process introduces several categories of artifacts that forensic analysis can detect.

Spectral smoothing is the most common artifact. Cloned voices tend to have smoother spectral envelopes than real voices, because the generation model averages over the natural variation in the source speaker's voice. Real speech is messy: formant frequencies shift moment to moment, vocal fry introduces irregular low-frequency energy, and micro-variations in breath pressure create subtle amplitude modulations. AI-generated speech captures the average of these patterns but not their full dynamic range.

Text-to-speech vs human speech characteristics

TTS systems, even the best ones, produce speech that differs from human speech in measurable ways. These differences are subtle and inaudible to most listeners, but they show up clearly in forensic analysis:

Pitch contour regularity - Human pitch varies with emotion, emphasis, and breath. TTS pitch contours are smoother and more predictable.
Formant transition speed - Humans transition between phonemes at variable speeds depending on speaking rate and articulatory effort. TTS transitions tend to be more uniform.
Glottal pulse variation - The vocal folds vibrate with natural irregularity. TTS systems model this as controlled jitter, which has different statistical properties than real vocal fold vibration.
Breathing artifacts - Humans breathe. Real speech contains breath sounds at natural intervals determined by lung capacity and speech rate. Many TTS systems either omit breathing entirely or insert it with metronomic regularity.

Prosody and micro-timing anomalies

Prosody refers to the rhythm, stress, and intonation patterns of speech. It carries meaning that is separate from the words themselves: the difference between a statement and a question, the emphasis that signals sarcasm, the pacing that conveys urgency. Human prosody is shaped by emotion, cognitive load, social context, and physical factors like fatigue.

AI-generated speech has improved dramatically in prosodic naturalness, but forensic analysis can still detect anomalies in micro-timing. The gaps between words, the duration of stressed versus unstressed syllables, and the relationship between pitch and timing all follow patterns in human speech that AI models approximate but do not perfectly replicate. Statistical analysis of these micro-timing features across a full recording can distinguish human from synthetic speech even when individual segments sound completely natural.

Breathing, pauses, and disfluency patterns

Human speech is full of disfluencies: "um," "uh," false starts, repeated words, and self-corrections. These are not errors. They are a natural part of speech production, and their frequency, placement, and acoustic characteristics follow well-studied patterns. Hesitation disfluencies tend to occur at clause boundaries or before low-frequency words. Self-corrections follow specific patterns of interruption and restart.

Current TTS and voice cloning systems typically produce unnaturally fluent speech. When they do include disfluencies (some newer models add them), the placement and acoustic properties differ from natural patterns. Forensic analysis of disfluency patterns, combined with breathing analysis, provides a signal that is difficult for AI systems to fake convincingly because it requires modeling cognitive processes, not just acoustic properties.

graphic_eq Spectral analysis

Examines frequency domain for smoothing artifacts, missing harmonics, and unnatural spectral envelopes typical of synthesized speech.

timer Temporal analysis

Measures micro-timing between phonemes, syllable durations, and pause patterns for statistical regularity that indicates synthesis.

air Breathing detection

Analyzes breath placement, duration, and acoustic characteristics. Missing or metronomic breathing is a strong synthesis indicator.

electric_bolt ENF verification

Checks for electrical network frequency signal. Absence indicates the audio was not recorded near a power source, a hallmark of synthesis.

Voice authentication and speaker identification

Voiceprint comparison methodology

Voiceprint comparison is the process of determining whether two audio recordings contain the same speaker. The analysis extracts features that characterize the speaker's voice, primarily MFCCs, formant frequencies, pitch range, and speaking rate, and computes a similarity score between the two samples.

Modern voiceprint analysis uses deep neural network embeddings (often called "speaker embeddings" or "d-vectors") that compress the speaker's vocal characteristics into a fixed-length vector. The similarity between two speakers is then computed as the cosine similarity between their embedding vectors. This approach is more robust than traditional feature comparison because the neural network learns to be invariant to recording conditions, channel effects, and speaking style variations.

Cross-language and cross-channel challenges

Speaker identification becomes more difficult when the comparison samples are recorded under different conditions. A phone call recorded at 8 kHz bandwidth contains far less spectral information than a studio recording at 48 kHz. A speaker may use different vocal registers when speaking different languages, or when speaking formally versus casually.

Forensic methods account for these variations through channel normalization techniques that attempt to remove the effects of the recording medium and focus on speaker-specific features. However, extreme channel mismatches (such as comparing a voicemail recording to a studio sample) reduce confidence levels significantly, and responsible forensic practice reports these limitations explicitly.

Limitations and error rates

No speaker identification method is infallible. The field reports performance in terms of equal error rate (EER), the point at which the false acceptance rate equals the false rejection rate. State-of-the-art systems achieve EERs below 1% on clean, controlled data, but performance degrades with background noise, channel mismatch, and short sample duration. Samples under 10 seconds are particularly challenging, and forensic examiners generally require at least 30 seconds of comparable speech for reliable identification.

Tools and workflows

Open-source audio forensics software

Several open-source tools support forensic audio analysis. Audacity, while primarily an editor, provides spectrogram visualization, frequency analysis, and noise profiling. Praat is the standard tool for phonetic analysis, offering formant tracking, pitch extraction, and voiceprint comparison. SoX (Sound eXchange) provides command-line audio analysis and transformation. For ENF analysis, research tools from the University of Campinas and NIST provide reference implementations.

Professional forensic analysis workflow

Acquire

Secure the original recording with chain of custody

Preserve

Create forensic copy, compute hash values

Examine

Metadata, spectral, temporal, and statistical analysis

Compare

Speaker ID, ENF matching, reference correlation

Report

Document findings with confidence levels and limitations

AFIP forensic audio analysis

AFIP's forensic audio analysis pipeline combines traditional forensic methods with AI speech detection in a single workflow. Audio files are analyzed for splice points, compression signatures, ENF signals, and AI generation artifacts simultaneously. The results are fused into a multi-evidence report that provides confidence levels for each finding and an overall authenticity assessment.

The system is designed to handle the audio formats and quality levels encountered in real-world scenarios, including compressed voice messages, phone call recordings, and social media audio clips. Unlike research tools that work best on clean, uncompressed audio, AFIP's analysis is calibrated for the degraded quality typical of content that has been shared across platforms.

The future of audio forensics

Audio forensics is evolving on multiple fronts simultaneously.

Real-time detection is becoming critical as voice cloning is used in live phone calls for social engineering and fraud. The FBI has reported cases of CEO impersonation scams using real-time voice cloning, where the caller sounds exactly like a company executive. Detecting these attacks requires analysis that runs within the latency constraints of a live call, typically under 500 milliseconds.

Cross-modal analysis combines audio forensics with video forensics for stronger detection. When a video contains both a face and speech, forensic analysis can check whether the lip movements match the audio phonemes, whether the voice characteristics match the visible speaker's physical build, and whether the audio and video share consistent environmental characteristics (room acoustics visible in the video versus audible in the audio).

Adversarial robustness is an active research area. As detection methods improve, voice cloning systems are being designed to specifically avoid known detection artifacts. The forensic community responds by developing new detection features that target deeper characteristics of the generation process. This is a sustained arms race, and the forensic approach of combining multiple independent analysis methods provides the best defense against single-point-of-failure detection.

Frequently asked questions

Can audio forensics detect all AI-generated speech?

Detection rates vary by generation model and audio quality. Current forensic methods achieve detection rates above 90% on most commercial voice cloning systems when given at least 10 seconds of audio. Detection is more challenging on short clips, heavily compressed audio, or speech generated by the most advanced research models. Multi-method analysis combining spectral, temporal, and breathing analysis provides the most reliable results.

How long does a forensic audio examination take?

It depends on the scope. An automated triage analysis for AI detection can complete in seconds. A full forensic examination that includes speaker identification, tampering detection, ENF analysis, and detailed reporting typically takes several hours for a short recording and may take days for longer or more complex cases. Legal cases generally require the longer, more thorough analysis.

What is ENF analysis and why is it important?

Electrical network frequency analysis extracts the power grid frequency signal captured in audio recordings. Because the grid frequency fluctuates in a unique pattern that is logged by reference stations, this signal can timestamp a recording and verify its continuity. AI-generated audio lacks a genuine ENF signal, making ENF analysis a strong indicator of synthetic speech.

Can phone call recordings be forensically analyzed?

Yes, but with reduced confidence compared to higher-quality recordings. Phone calls are typically compressed to 8 kHz or 16 kHz bandwidth, which removes high-frequency spectral information. Forensic analysis adapts by focusing on features that survive bandwidth compression, such as pitch characteristics, temporal patterns, and low-frequency spectral features. Voice-over-IP calls may retain more bandwidth than traditional phone calls.

Is voiceprint analysis as reliable as fingerprint analysis?

Not quite. Fingerprint analysis works with stable physical features that do not change over time. Voiceprints are influenced by health, age, emotion, recording conditions, and speaking style. Modern voiceprint analysis achieves error rates below 1% under controlled conditions, but real-world performance is lower. Forensic practice treats voiceprint evidence as probabilistic, reporting likelihood ratios rather than definitive identifications.

Analyze audio with AFIP forensic tools

Upload audio for comprehensive forensic analysis including AI speech detection and tampering detection.

Run audio analysis

AI voice detection Video forensics Perceptual hashing Deepfake detection AFIP Verify tool

References: AES-43-2000: AES Standard for Forensic Purposes - Criteria for the Authentication of Analog Audio Tape Recordings. | SWGDE Best Practices for Digital Audio Evidence. | ENFSI Guidelines for Forensic Speech and Audio Analysis. | ASVspoof 2024: Automatic Speaker Verification Spoofing and Countermeasures Challenge.