video-polish
ElevenLabs audio isolation + Gemini-aligned captions; integrated audio cleanup and captioning pipeline.
video-polish
Cleans a video's audio with ElevenLabs Audio Isolation and burns in narratively-grouped, word-level-aligned captions generated by Gemini 3.1 Pro.
The canonical library lives at /root/scripts/video_polish/video_polish.py. This skill wraps it. Do NOT modify the canonical lib from this skill — open a PR against /root/scripts/video_polish/ instead.
Default invocation
ELEVENLABS_API_KEY=<key> \
python3 ~/.claude/skills/video-polish/scripts/video_polish.py \
--input <path-in> \
--output <path-out>
For Floom team videos, the default --known-names covers Federico, Falco, Floom, etc. Override only when the cast is different.
By default the pipeline speeds the audio + video up by 1.15x (pitch preserved via rubberband), which matches Federico's standard pacing for Rocketlist + Floom launch videos. Override with --speed 1.0 to keep the original tempo, or --speed 1.2 for tighter pacing.
Required env vars
| Var | Source | |-----|--------| | ELEVENLABS_API_KEY | Pass in env or --eleven-key. Federico's key in ~/.config/elevenlabs.env if present, else ask. | | GEMINI_API_KEY | Auto-loaded from /root/.config/gemini/floom.env. Override path with --gemini-env. |
If either is missing the script will fail fast at stage 2 (ElevenLabs) or stage 5 (Gemini name correction). Do not retry — surface to Federico.
Optional flags
| Flag | Default | When to change | |------|---------|---------------| | --known-names | Fede,Federico,Falco,Floom,YC,UltraRelevant,Max,Mahir | Add cast names + product names + acronyms specific to this video. Passing "" disables correction. | | --speed | 1.15 | Pitch-preserving playback speed (rubberband). 1.15x is Federico's standard pacing for Rocketlist + Floom launch videos. Pass 1.0 to skip the speedup stage entirely. The audio is sped up BEFORE Whisper transcribes, so all caption timestamps are natively at the output speed; no scaling needed downstream. | | --no-captions | off | Skip caption burn-in but still compute + cache captions.json. Use when downstream tooling (the Remotion starter) will render its own caption track. See REMOTION_PROTOCOL.md. | | --whisper-model | small | Use base or tiny if the speaker is clear and you want speed. NEVER use medium or large on AX41 — they OOM-crash tmux (per CLAUDE.md). | | --gemini-model | gemini-3.1-pro-preview | Use gemini-3-flash-preview only if Pro is throttled. Never drop to 2.x — global rule. | | --workdir | /tmp/video_polish_work | Override only if /tmp is full or you want a different cache namespace. | | --eleven-key | env ELEVENLABS_API_KEY | Use to scope a specific key per run. | | --with-remotion | off | Auto-generate Remotion intro + outro bookends via Gemini storyboard. Wraps final output as intro.mp4 + body.mp4 + outro.mp4. | | --cta-url | None | CTA URL for the Remotion outro EndCard (e.g. github.com/you/repo). Passed to storyboard.py and overrides Gemini's guess. | | --brand-name | None | Brand name for the outro EndCard. Overrides Gemini's guess. | | --storyboard-json | None | Path to a pre-built storyboard.json to skip Gemini storyboard generation entirely. | | --no-review | off | Skip Gemini quality review after polish. By default the review runs automatically and prints a scored summary to stdout. |
Pipeline stages
| # | Stage | Output | |---|-------|--------| | 1 | extract_audio | mono mp3 48kHz | | 2 | isolate_voice | ElevenLabs voice-isolated mp3 (HTTP API call) | | 3 | normalize_voice | ffmpeg loudnorm I=-14 TP=-1 LRA=9 wav | | 3.5 | speedup_voice | rubberband pitch-preserving tempo at --speed (skipped when 1.0) | | 4 | transcribe | faster-whisper word-timestamped JSON (timestamps already at output speed) | | 5 | correct_transcription | Gemini fixes proper nouns + tech terms + number tokenization | | 6 | ai_caption_groups | Gemini 3.1 Pro narrative-grouped windows + emphasis words | | 7 | render_ass | Aegisub .ass with karaoke emphasis tags (skipped with --no-captions) | | 8 | assemble_video | ffmpeg burn-in subs + replace audio + setpts retime to match speed | | 9 | storyboard | Gemini-driven scene decision: palette, intro headline/subline, outro brand/tagline/cta (skipped without --with-remotion) | | 10 | render_remotion | Remotion renders AutoIntro + AutoOutro, then ffmpeg-concats intro + body + outro (skipped without --with-remotion) | | 11 | review | Gemini multimodal quality review: 8 strategic frames + LUFS + freeze probe -> scored JSON + terminal summary (skip with --no-review) |
Each stage caches independently. Re-running the same input skips completed stages. The cache key includes the speed factor, so different --speed values keep separate caches and never collide.
Cache
Layout: <workdir>/.video_polish_cache/v<CACHE_VERSION>/<sha8(input_bytes)>/[speed_<S>/]
Top level (per-input, speed-independent):
.isolation_method
audio.mp3,voice-isolated.mp3,voice-clean.wav,
Per-speed subdir (speed_<S>/):
transcript.raw.json, transcript.json, captions.json, captions.ass
voice-clean_s<NN>.wav(sped-up audio, when speed != 1.0),
Cache version is bumped to v2 in May 2026 (added --speed). Old v1 caches are untouched until manually deleted; they will not be reused.
To force re-isolation on the same input (for example, to consume a fresh ElevenLabs run):
rm /tmp/video_polish_work/.video_polish_cache/v2/<hash>/voice-isolated.mp3
To nuke the whole cache for an input:
rm -rf /tmp/video_polish_work/.video_polish_cache/v2/<hash>
Verification (mandatory before claiming done)
(within 100ms). At default --speed 1.15, a 65s source becomes ~56.5s.
rubberband can pull it slightly low; if more than 2 LUFS off, re-run loudnorm with two-pass (not yet wired).
says ffmpeg-arnndn the API call failed; surface and stop.
at varied timestamps (~10%, 50%, 90% of duration), READ them with the Read tool, confirm captions are visible AND match the transcript at those timestamps.
vs visible mouth motion. Off by >150ms = stop and surface.
- Duration parity —
ffprobeoutput ≈source_duration / speed - LUFS —
info["lufs"]should land in-14 ± 2. Speed-up via - Isolation method —
info["isolation_method"] == "elevenlabs". If it - Caption visibility (when captions are burned in) — extract 3 frames
- Lip-sync sanity — pick one random timestamp, check the audio peak
# extract 3 frames
ffmpeg -ss <t> -i <output> -frames:v 1 -y /tmp/frame_<t>s.png
Automatic intro/outro with --with-remotion
Pass --with-remotion (plus optionally --brand-name and --cta-url) to get a fully rendered final video with animated bookends in one command:
python3 video_polish.py \
--input source.mov \
--output final.mp4 \
--with-remotion \
--brand-name "iContext" \
--cta-url "github.com/you/repo"
How storyboard auto-decision works
scripts/storyboard.py calls Gemini 3 Pro with a JSON-schema-enforced prompt that reads the full transcript and outputs:
{
"palette": "dark-orange",
"intro": {"headline": "I solved *AI memory*.", "subline": "AI forgets everything until it doesn't.", "duration_s": 3.0},
"outro": {"brand": "iContext", "tagline": "Your AI will never forget anything again.", "cta": "github.com/federicodeponte/icontext", "duration_s": 3.0}
}
palette: one ofdark-orange | dark-green | cream | ink. Gemini picks based on content topic.intro.headline: hook extracted from first 3 sentences. Words wrapped inasterisksrender in accent color.outro.cta: overridden by--cta-urlif passed; otherwise Gemini guesses from transcript.- Result cached at
<cache>/<speed_dir>/storyboard.json(idempotent via sha256 of inputs).
scripts/render_remotion.py then copies the remotion-starter, writes Root.tsx + index.ts, runs npx remotion render for AutoIntro + AutoOutro, and ffmpeg-concats the three segments. Remotion render output cached at <workdir>/.remotion_render/<hash>/ with a .done stamp.
Verification checklist (--with-remotion)
- Duration =
source_duration / speed + intro_duration_s + outro_duration_s(within 200ms) - Intro frame at t=1.5s — shows headline in large type with accent color on emphasized word
- Body frame at t=(intro_dur + 2)s — facecam with captions
- Outro frame at t=(total - 1.5)s — brand name, tagline, and cta visible
Adding Remotion visuals (b-roll, motion graphics, branded scenes)
When Federico wants the polished video to feel "more produced" — animated problem cards, terminal reveals, demo stills, shareable-link moments, end cards — bolt a Remotion layer on top of the polished facecam.
The starter project at /root/scripts/video_polish/remotion-starter/ ships six scene components, a circular face-cam PIP overlay, and an optional Remotion-rendered caption track. Follow REMOTION_PROTOCOL.md (in this skill's directory) step by step. It is reproducible: any agent can author visuals without inventing the structure each time.
When using Remotion, decide up front whether to keep captions burned into the polished video (default, simpler) or to render captions in Remotion (cleaner control). For the latter, run video_polish.py --no-captions and enable the caption overlay in Composition.tsx.
Common follow-ups
separately then concat. Re-run the polish per cut.
cross-correlation block in the canonical README at /root/scripts/video_polish/README.md. Matching codec/fps/resolution is mandatory or concat will stutter.
baked in so they will recenter — if margin gets ugly, re-run with a vertical-aware ASS style override (not yet wired into the lib; surface it).
polish_video() is the handler-shaped entry point.
- Per-speaker resync: if the input was multi-camera, polish each angle
- Intro/outro bookends from a parallel recording: see the audio
- Vertical reformat (9:16): pad/scale the polished output. Captions are
- Floom-app wrapping: see
/root/scripts/video_polish/floom-app/. - Adding b-roll / motion graphics: see REMOTION_PROTOCOL.md.
Auto-review
Every polish run automatically scores the output with scripts/review_video.py. The review:
- Probes metadata (duration, dimensions, LUFS integrated, freeze detection).
- Extracts 8 strategic frames (intro mid, intro/body seam, body 25/50/75%, body/outro seam, outro mid, end).
- Calls Gemini 3.1 Pro multimodal with all frames + metadata + storyboard + transcript excerpt.
- Enforces
score = MIN(all 6 dimension scores)client-side (anti-anchoring). - Applies verdict rules:
shipif MIN>=9 and no critical flaws;failif MIN<6 or any critical; elseiterate. - Writes
<output>.review.jsonand prints a terminal summary.
To run standalone (without re-running the full pipeline):
python3 ~/.claude/skills/video-polish/scripts/review_video.py \
--video /path/to/output.mp4 \
--storyboard /path/to/storyboard.json \
--out /path/to/review.json
To skip (e.g. when iterating quickly):
python3 video_polish.py --input source.mov --output final.mp4 --no-review
Constraints
the log. Common causes: ElevenLabs credits exhausted, Gemini throttle, fonts missing, rubberband not on PATH (fallback to ffmpeg filter is automatic).
/root/scripts/video_polish/.
over input bytes — different videos = different caches automatically.
makes a fresh cache subdir; old caches stay until manually deleted.
- No
medium/largeWhisper on AX41 — OOM kills tmux. - Don't loop on stage failure. If a stage fails twice, STOP and surface
- Don't modify the canonical lib from this skill. All edits go to
- Cache hash collisions across projects are impossible because hash is
- Speed is part of the cache key. Re-running with a different
--speed
Reference run
The introducing-icontext.mov polish (May 2026) ran a 65s 1620x1080 H264 input through the full pipeline at --speed 1.15 in ~140s wallclock with whisper-small + Gemini 3.1 Pro. Output landed at 56.5s, LUFS -17.8, isolation_method elevenlabs, 16 caption groups.
The same video was then processed with --with-remotion --brand-name iContext --cta-url github.com/federicodeponte/icontext in ~92s additional wallclock. Final output: 62.6s total (3s intro + 56.5s body + 3s outro), LUFS -15.4, dark-orange palette, headline "I solved AI memory.", outro brand "iContext".
video-polish into Claude, Codex, Cursor, Kimi, and OpenCode.