Skip to content
VC
Case Study · Performance marketing

Speech-to-Creative Pipeline: speech recognition + language swap + lip-sync

One creative is scaled to 5+ languages and GEOs automatically. Whisper transcribes speech → GPT-4o translates and rewrites for the market and persona → ElevenLabs clones the voice with the correct accent → wav2lip syncs lip movement to the new audio. What used to require a recording studio and a week of work — now happens in an evening.

Industry
Performance / Affiliate / Media buying
Stack
Whisper · GPT-4o · ElevenLabs · ffmpeg
Timeline
~5 business days (MVP)
Outcome
1 video → 10-30 variants
01 · Pain Point

Spy data exists but doesn't scale by hand

In performance marketing, spy services (AdHeart, AdSpy, AdsLibrary, Anstrex) give thousands of competitors' working creatives — but you can only "borrow a bit" of what doesn't trigger a Copyright Strike. Narrator voice, background, actor's face, lip-sync — everything covered by DMCA requires recreation.

The workflow used to look like this: watch spy video → write transcript by hand → hand to copywriter to rewrite for your own offer → hire voice actor via Fiverr → wait 1-3 days → edit in Final Cut → iterate. 4-6 hours of work per variation, and that's without lip-sync.

With budgets of $2-5k/day on A/B tests, production-cycle speed = a unit of competitive advantage. When you have 8-10 funnels running simultaneously and each needs 20-30 fresh creatives per week — the manual workflow stops working.

02 · Solution

4-stage pipeline: STT → LLM → TTS → lip-sync

01
STT

Whisper transcription of spy videos

whisper-large-v3 or OpenAI whisper-1 via API. Supports Russian, English, Balkan, Turkic, Spanish. Outputs SRT with timecodes — we know exactly which phrase is on which second.

02
LLM · TRANSLATE

GPT — translation, localization, persona rewrite

GPT-4o does three tasks in one pass: (a) breaks the transcript into compositional blocks — hook (first 3 sec), pain, proof, CTA; (b) translates into the target language (Russian / English / Serbian / Polish / Turkish) with cultural nuance — not literal but as "a native speaker would say it"; (c) rewrites for the target offer, persona, and GEO, preserving the rhythm and emotional triggers of the original.

Few-shot prompts with ready "before → after" examples per language and niche. The main focus is preserving timecodes: each phrase's duration must match the original or lip-sync will break. Output — 10-30 variants × 5+ languages, ranked by predicted engagement.

03
TTS · CLONE

ElevenLabs — voice cloning with the right accent

ElevenLabs Multilingual v2 supports 29 languages in a single model — the cloned voice sounds in each with the correct accent. This matters: if the original speaker is an American with a Southern accent, her clone in Serbian won't sound robotic but like a natural native Serbian speaker.

Two strategies: (a) voice cloning — a 30-second sample is enough for a high-quality clone; (b) stock voices for fast A/B by timbre (male/female, age, emotional tone). Stability / similarity / style settings are tuned per niche. Output — audio.wav matching the original's duration (important for the next step).

04
LIP-SYNC

wav2lip — lip movement synchronization

At this stage the "magic" appears — the actress in the video starts speaking in the new language as if it was re-shot. wav2lip analyzes the source video + new audio.wav and redraws the mouth region frame by frame so the lip movement matches the new speech. A GPU is needed, but it's hours of compute, not days at a studio.

Simple case (voice-over / off-screen): ffmpeg just replaces the audio track. Complex case (talking head): wav2lip or SadTalker for face sync. Output — finished mp4 for ad platforms (FB Ads / TikTok / VK Ads / Yandex Direct).

03 · Stack

Technologies and infrastructure

Speech-to-Text
  • OpenAI Whisper API (whisper-1)
  • whisper-large-v3 (self-hosted, at large volumes)
  • SRT parsing for timecodes
LLM orchestration
  • GPT-4o / Claude Sonnet 4.5
  • few-shot prompts with hook → rewrite examples
  • structured outputs (JSON schema)
Text-to-Speech
  • ElevenLabs Multilingual v2
  • Voice cloning (30-second sample is enough)
  • Voice style settings: stability / similarity / style
Video / lip-sync
  • ffmpeg for audio-replace (voice-over)
  • wav2lip / SadTalker for talking head
  • Python orchestration + task queue
04 · Results

What changes in the funnel

Production time
4-6 hrs ~30 min

per creative variation (audio-replace, no lip-sync)

Volume from 1 video
1-2 10-30 × 5+

variants × languages (1 actress → 5+ GEOs without re-shoot)

Cost per creative
$40-80 ~$2

Whisper + GPT + ElevenLabs (at API rates, no lip-sync)

The main effect is iteration speed. An A/B test of 20 hooks instead of 2 launches in an evening, not a week. Winning combinations are identified in the first 24-48 hours, losers are turned off before the budget burns. CPI drops 15-30% from better hook-match with persona.

05 · Where it fits

Where it works, where it doesn't

Good fit
  • · Performance agencies with 5+ creatives staff
  • · Affiliate teams (nutra, e-com, sweepstakes, white GEOs)
  • · In-house e-com marketing with regular UGC reels
  • · Launches in 5+ countries simultaneously (multi-language from one master script)
  • · Podcasters / infobusiness for short-form content cuts
Bad fit
  • · Direct copying of other people's creatives (DMCA / Copyright Strike)
  • · Regulated niches (medicine, pharma, finance) — need media lawyers for compliance
  • · Voice cloning without consent of the voice owner (banned by EU AI Act)
  • · Long-form (10+ minute videos) — TTS cost and editing time comparable to an expensive human actor

Ethical note: the pipeline is intended for scaling your own ideas and original scripts. Using others' videos and voices without permission violates Copyright and the AI Act. I transcribe others' creatives as research to understand the market, then create my own script, my own audio, and my own video.

Готовы начать?

Аудит за 5 000 ₽ — с конкретным отчётом и сметой

Расскажу что внедрить в вашем бизнесе в первую очередь, какая будет окупаемость, и нужен ли вообще AI для вашей задачи (иногда — нет).

Или просто напишите свой вопрос — отвечу в течение 2 часов