AI Media Pipelines: UXR Platform & Dubbing Studio
Both of these projects start with the same problem: a human expert sitting in front of hours of video, doing tedious manual work that AI can mostly handle — but “mostly” isn’t good enough when the stakes are high. UX researchers can’t ship hallucinated insights. Dubbing editors can’t ship a mangled translation. So both systems follow the same pattern: AI does the heavy processing, and a purpose-built interface makes every AI decision visible, auditable, and editable.
Agentic UXR Platform
A UX researcher uploads a 2-hour interview video. Five minutes later they have: a verbatim transcript with speaker identification, participant personas, semantic chapters, product walkthrough maps, pain points backed by timestamped evidence quotes, and a synthesized report — all quality-checked by an adversarial judge agent. The previous workflow took about 4 hours of manual analysis per study. 154 commits, piloted with 2 UXR teams.
How it works
Seven specialized agents run sequentially, orchestrated by a KnowledgeSupervisor:
| Agent | Job |
|---|---|
TranscriberAgent | Gemini watches the video, outputs JSON transcript with speaker IDs and timestamps |
PersonaAnalyzerAgent | Identifies participant roles from intro segments and multimodal cues |
ChapterizerAgent | Finds structural boundaries — topic shifts, task transitions |
WalkthroughMapperAgent | Maps UI interaction steps from screen-share segments |
InsightExtractorAgent | Pulls pain points and positive signals. Every insight requires a verbatim quote and timestamp |
SynthesizerAgent | Assembles everything into a final report |
JudgeAgent | Adversarial audit — scores grounding fidelity and flags unsupported claims |
Multimodal transcription and speaker diarization
The hard parts
Making 2-hour videos work. The video is ingested once and cached via Vertex AI Context Caching for 1 hour. All agents run against the cache — 10x faster, significantly cheaper than re-sending the video per step. And instead of dumping the full transcript into every prompt, the supervisor dispatches specific slices via index ranges (EnvPointers). This solves the “Lost in the Middle” problem where models lose track of content buried in long contexts.
Earning researcher trust. The TraceLog component shows every agent thought and tool call in real-time — researchers can see exactly why the system extracted a particular insight. The EvidenceViewer displays multimodal reasoning with verbatim visual evidence. HITL checkpoints pause the pipeline so researchers can verify transcript accuracy and extracted insights before the next stage runs. The Prompt Depot gives non-technical researchers a versioned prompt library with AI-assisted editing and diff comparison, so they can tune the system’s behavior without touching code.
Pipeline orchestration and real-time trace view
Infrastructure choices. Dual-service Cloud Run deployment: a UI+API service that scales to zero, and a persistent worker polling a GCS task queue (min 2 instances, no cold starts). Separating the two prevents analysis tasks from dying during UI redeployments. I chose Google’s GenAI SDK over ADK for direct Context Caching access and full traceability — documented the comparison in docs/GENAI_VS_ADK.md.
Insight extraction with evidence grounding
Dubbing Editability V1
Same pattern, different domain. Upload an English video, and the system transcribes it, identifies speakers, translates to target language, and generates dubbed audio. A three-panel editor lets a human editor fix what the AI got wrong. 311 commits. No designer assigned — I defined the user journeys, designed the editor, and built the full stack in 2 weeks.
The editor
Three panels, always visible:
- VideoPlayer — playback synced with timeline and transcript
- ScriptEditor — inline editable transcript cards per clip. Edit a translation and the clip gets flagged
isStalefor regeneration - Timeline — canvas-based waveform visualization with drag-and-drop handles for split/merge and time alignment
- InspectorPanel — voice model, prosody, and emotion controls per clip
Three user journeys: fix a bad translation (Linguistic Correction), adjust timing and lip-sync (Prosody & Lip-Sync), and review AI-flagged quality issues (Proactive Quality Alerts).
Dubbing studio — three-panel editor with waveform timeline
The hybrid TTS/STS workflow
This was the interesting technical problem. You want the initial dub to sound natural — Speech-to-Speech (via Gemini Live) preserves the original speaker’s emotion and cadence. But when an editor changes a word, you can’t re-run STS on the whole clip. You need controlled TTS that respects the prosody data extracted during the original STS pass — duration, pitch contour, voice profile — so the re-dubbed clip still sounds like it belongs in the same sentence.
The is_stale flag on each TimelineClip makes this surgical: only edited clips get regenerated. Change one word in a 90-minute video and you re-dub that one clip, not the whole project.
The pipeline and deployment
5-stage media pipeline (FastAPI): ingestion → FFmpeg audio extraction → Pyannote diarization + Whisper transcription → Vertex AI Translation → hybrid TTS/STS dubbing. Six-service CI/CD chain: GitHub Enterprise → Developer Connect → Cloud Build → Artifact Registry → Cloud Run → Vertex AI. Multi-stage Docker build producing a single image that serves both the Vue frontend and FastAPI backend.
ScaleCM: Usability Benchmark Dashboard
A different kind of problem but the same impulse — data was trapped in a format that made it hard to act on. Google’s release-critical usability benchmarking lived in a massive multi-tab spreadsheet. I kept the spreadsheet as the data source (researchers were comfortable with it) but built a web dashboard for leadership: product health scores, CUJ success rates, pillar-level deep dives across Gemini, Agents, and APIs.
ScaleCM — KPI scoreboard and trend analysis
Product pillar deep-dive
Resource directory