Understanding AI in Audio Post: What It Means and How It’s Used

Aug 18

7 min read

At Smart Post, AI in audio post doesn't mean just one thing.

This guide breaks down the four families we actually use – and their definitions as it pertains to audio post – ML on audio, Speech AI (ASR/TTS), LLM text and automation tools, and generative audio tools – and where they fit into our workflows (turnover, prep, cleanup, editorial, design, mixing, metadata, and delivery).

In essence, these tools handle the repeatable tasks so we can make creative decisions more fluidly.

Our "AI" motto has become: Tools speed the chores; people shape the story.

Abstract head built from speakers, cables, and waveform graphs, representing AI in audio post-production at Smart Post.

What AI Means to Us

In our practice – and in our view – when we say “AI” in audio post, what we refer to can be broken down to four practical families of tools:

Machine learning (ML) on audio – learns from examples to clean, separate, and/or enhance sound or recorded material (think denoising, de-reverb, source separation, basic editorial/conforming, etc.).
Speech AI (ASR/TTS) – ASR turns speech into text; TTS synthesizes voices (used only within union-compliant consent of course!) and converts "text-to-speech."
Large-language Models (LLMs) for text/formatting automating workflows – summarizes, outlines, tags, and drafts notes and metadata; everything we do here is human-reviewed.
Generative audio/design – references ambiences/textures/sound libraries for concepts or idea generation; not a substitute for a real/human performance, design, or mix decision – EVER.

Row of four glossy 3D icons: brain (ML), speech bubble (ASR/TTS), document with pencil (LLM), and waveform (generative audio).

Where AI Helps Us (Use Cases)

Turnover and Prep

Triage problem clips (noisy, clipping, roomy mics, etc.). This helps us troubleshoot issues before we get into editorial.
Batch loudness leveling, processing, and standardizing of any/all filenames and/or metadata tagging.
Automate cue-sheet helpers and metadata consistency scans.

Dialogue Editorial

Denoise/de-reverb for intelligibility with minimal artifacts.
Source separation to rescue lines under possible music/FX and/or noise.
Detect any "alternate" dialogue takes that may be in dailies, but not in AAF/OMF exports.
Remove fillers and long silences in dialogue-only content.

ADR and Continuity

Align ADR timing/pitch to production/recorded audio.
Align boom/lav for phase-coherent compositing.
When cleared: micro-fix line reads with approved voice modeling or pitch/timing adjustments that may be needed.

Pre-Dub and Mix

First-pass suggestions (dialogue leveling, EQ starting points/matching).
Cleanup/balancing that speeds up mix decisions without changing creative intent.
Adding semantic markers for tricky words/off-mic moments that may need additional attention from the mixers.
Object placement, upmix, and previewing formats efficiently before a final mix – especially useful in spatial audio formats.

Delivery and QC

Run/process loudness compliance for multiple studios/networks at scale.
Generate captions/transcripts with spot checks.

Archive and Asset Management

Tag speakers/notes/scenes for faster retrieval.
Keep transcripts searchable and tied to original scripts or assets associated with sound files or recorded material.

Remote Recording, Playbacks, and Review

Low-latency conferencing and recording solutions. Playbacks/notes/fixes and hi-res client streaming for faster fixes and approvals.

The Tools We Use (...and Why)

Cleanup and Restoration (these are our current ML workhorses)

iZotope RX – Dialogue Isolate, Repair Assistant, Spectral Recovery; daily restoration backbone.
Steinberg SpectraLayers Pro – visual spectral editing/separation that complements RX.
Adobe Enhanced Speech – fast web cleanup for roughs/remote recordings.
Waves Clarity Vx Pro / Hush – real-time dialogue-focused denoise for speed passes.
Acon Extract:Dialogue / Accentize dxRevive / Supertone Clear – alternate flavors for specific noise/artifact profiling and cleanup.
Absentia DX – batch de-click/de-hum tailored for production dialog.
CEDAR / VoicEX 2 – adaptive dialogue suppression when transparency is critical.
Auphonic – intelligent leveling and LUFS compliance for voice-heavy content.
Krisp (real-time) – keeps remote VO/ADR/Podcast audio cleaner from the streaming source.
Zynaptiq Unveil / Unchirp, CrumplePop and Acon Digital DeVerberate for "special-use" cases.

Alignment and Continuity

Sound Radix Auto-Align Post – dynamic boom/lav mic alignment and spectral phase correction.
Synchro Arts Revoice Pro / Vocalign – ADR timing/pitch alignment and doubling.
DaVinci Resolve Fairlight Voice Isolation and Dialogue Leveler – light-lift tools inside picture editorial (when needed).

Conform and Change Management

The Cargo Cult Matchbox – compare EDLs, find differences, build reconform maps and auto-conform sessions to new picture turnovers.
Sounds In Sync EdiLoad – merge EDLs, create conform lists, spotting.

ADR, Dubbing, ASR/TTS/Voice Cloning and Recording

Sounds In Sync EdiCue / EdiPrompt – cue sheets and on-screen overlay of streamers/beeps/prompting.
Non-Lethal Applications Cue Pro – ADR cueing app with on-screen streamers and live, shareable cue sheets.
VoiceQ – dubbing/ADR with clear syllable timing overlays.
Resemble.ai – create, edit, emote, and localize voices (either web or API). Good for temp ADR, pickups, and localization - when cleared.
ElevenLabs – platform of AI voice tools – high-quality text-to-speech, speech-to-speech, consent-based voice cloning, multilingual dubbing, and huge voice library – handy for temp ADR and quick pickups when cleared.
Google TTS/ Custom Voice – wide language coverage; custom models with great review process.

Remote Record and Review

Source-Connect – industry-standard low-latency remote record and playback.
SessionLinkPRO / Cleanfeed – browser-based talent/producer connections.
Audiomovers ListenTo – stream your DAW bus to clients in hi-res for approvals.
ClearView Flex – secure, low-latency streaming for remote review.
Streambox – Spectra (software) and Chroma (hardware) – deliver secure, low-latency, streaming for remote review/collaboration.

Sound Design and Foley

Krotos Studio / Weaponiser / Dehumaniser – performance-driven, layered SFX with quick variation; great for fast, creative ideas and design.
Krotos Reformer Pro – “perform” Foley/textures from mic or track input (cloth, footsteps, creatures, etc.) for natural, organic ideas.
Soundly – our hub for fast search, collections, and consistent SFX tagging/versioning, plus Voice Designer for PA/airport announcements, background conversation, and quick utility VO.
Accentize Chameleon – AI reverb-matching plugin that analyzes a recording’s room acoustics and builds a reverb profile you can apply to ADR, Foley, or dry tracks.
ElevenLabs Sound Effects – generates sound effects from simple text prompts, with quick variations. Used as tools, not inventory – used in context, not resold, and we stick to the provider’s terms.

Library and Metadata

Soundly (also listed above), Soundminer and BaseHead for deep metadata tagging/scanning, DAW spotting, and alternative search workflows of sound libraries.

Loudness, Metering and Delivery

NUGEN VisLM / LM-Correct / ISL – realtime + one-click compliance and true-peak limiting.
iZotope Insight 2 – complete metering (loudness, surround, spectrum).
Dolby Atmos Production Suite / Renderer – ADM, binaural renders, deliverable checks.

Spatial and Immersive

NUGEN Halo Upmix / Downmix – transparent format moves (stereo↔5.1↔Atmos beds).
Sennheiser Dear Reality – binaural/immersive positioning for previews and temp 3D scenes.
Flux/IRCAM SPAT Revolution – advanced immersive mixing and room modeling.

Union Compliance (SAG-AFTRA): Digital Replicas and Voice Synthesis

Most of the actors we record are SAG-AFTRA members, so any use of a digital voice replica is governed by union rules and contracts. In practice, it's all about:

Consent first – informed, written consent before any creation or use of a digital replica. We are completely transparent and pride ourselves on that.
Defined scope – where, how long, and for what the replica may be used; new uses require new approvals.
Compensation and credit – union-compliant terms per the applicable agreement.
Security – private handling, least-privilege access, documented chain of custody.

Our policy: We only engage TTS/voice-cloning workflows when a project has documented SAG-AFTRA compliance – in accordance with the SAG-AFTRA SOUND RECORDINGS CODE (consent on file, scope defined, compensation arranged). If not cleared, we strongly urge recording traditional ADR.

Human audio engineer in headphones arm-wrestling a robot across a mixing console with waveform screens in the background.

The Human Element: Why Craft Still Wins

Give ten pros the same tools and you’ll get ten different results... and that’s the point!

Taste and intent – deciding what to fix vs what to keep is storytelling, not algorithmic decision making.
Context and judging trade-offs – sometimes a little noise, air, or a slight rustle in the recording is the "life" that would otherwise be a lifeless, artifact-free line. This will ALWAYS feel more natural to the human ear.
Performance direction – ADR/Dialect coaching/direction, mic choices, and/or room setups change outcomes before any tool runs a "process" to dictate a performance.
Session architecture – how editors, assistants, and techs name, route, group, and setup templates for each session helps speed up creative decision-making and prevent errors.
Problem framing – knowing why a clip might sound wrong guides the right decision making – and avoids any additional over-processing or time wasted.
Ears and trust – feedback discussions with directors/producers/talent, fast A/B comparisons, and delivery confidence can’t be automated.

AI can help accelerate some tasks, but it can’t model your taste, your judgment, or your collaboration and vision with a director or producer.

THIS human touch is the difference between “fixed” and finished.

What AI Won’t Do Here

Replace actors, mixers, or editors – taste and storytelling are human.
Decide creative intent – it can suggest; it doesn’t direct.
Break trust – no unapproved cloning of any kind; no shady data usage whatsoever.

Results You Can Expect

Clean dialogue tracks built faster, with fewer artifacts.
Quicker first passes that keep creative momentum flowing.
Consistent loudness/metadata tagging/captioning across multiple versions.
Better exchanges between editorial, recording, mix, and delivery departments.

FAQ

Can you replace a missed word without ADR?

Sometimes — only with documented performer consent under certain union rules and clearances; otherwise we strongly urge using ADR.

Do you use AI to fix bad Zoom/phone/lecture audio?

Oftentimes, yes — send a short sample and we’ll show what’s recoverable without artifacts or "chirping" effects.

Will AI change the sound of my production tracks?

Our bias is preservation. We share before/after examples (when used), explain trade-offs, and you approve any final versions.

Do you train any AI models on our audio?

No. We don’t use your material to train public models. Project audio stays private and is used only for that specific project only.

Do you provide logs of any AI steps used?

Yes — upon request we’ll note tools used and provide before/after clips for any key fixes. Our processes are always "non-destructive" and transparent – absolutely no unethical use whatsoever.

Can you match mics?

Often — yes, using Auto-Align Post, Revoice, EQ matching, mic modeling and/or IRs (aka. impulse responses), we're able to match production mics or create similar "tonal" matches so dialogue sounds smooth and natural and there's no perceived difference in technical characteristics.

Can AI fix bad production audio?

Often times, yes — especially for hisses, hums, noise, clipping, and roomy mics. Heavy winds/waves/reverbs or overlapped sounds set limits, so we test short samples and provide realistic before/after of our work before diving in further.

Can you separate dialogue from music/FX if we don’t have stems?

Sometimes. ML separation (e.g., spectral/source separation) can pull voices from mixes – even music and effects. We’ll try a clip, flag artifacts, and recommend the cleanest path forward to split any composite tracks.