Script Ingestion Pipeline Architecture (March 2026)
This document outlines the architecture for the Script Ingestion Pipeline, a critical feature that converts raw PDF scripts into structured, interactive data (Acts, Scenes, Characters, Dialogue, Props, Sound Cues). The pipeline supports two modes: Project Import (collaborative, production-level) and Personal Import (individual, for On Book Rehearsal).
1. High-Level Flow
The pipeline uses a Hybrid Architecture combining Mistral OCR (for OCR) and Vertex AI Gemini 2.5 Pro (for semantic extraction via context caching).
Steps:
- Trigger: User uploads a PDF and calls
startOcrJob(project) orprocessPersonalScript(personal). - OCR (The Eyes): Mistral OCR converts the PDF to Markdown text via signed URL.
- Context Caching: The full script text is cached in Vertex AI (reducing token costs by ~90% for subsequent calls).
- Dispatch: The coordinator dispatches parallel extraction via PubSub topics (
extract-batchfor blocks,extract-metadatafor structure). - Block Extraction: PubSub workers extract
ScriptBlock[]in parallel batches. - Structure & Elements: Structure is extracted first, then chains props and sound cues extraction with resolved scene IDs.
- Merge & Persist: The final batch worker merges all block batches and saves
scriptBlocks.json.gz. Metadata workers saveclassifiedStructure.json,extractedProps.json,extractedSoundCues.json.
2. Trigger Integration
A. Project Uploads (startOcrJob)
- Source:
functions/src/script-import/process-ocr.ts - Trigger:
onCall(Callable Cloud Function — invoked by client after PDF upload). - Inputs:
{ projectId, storagePath, forceReOcr? } - OCR Reuse: Before running Mistral OCR, checks the OCR bucket for cached results. If found (and
forceReOcris false), skips OCR and proceeds directly to extraction dispatch. - Action: Runs Mistral OCR synchronously → Caches raw OCR JSON → Dispatches parallel extraction via PubSub.
- Output: Extraction results saved to
gs://{defaultBucket}/projects/{projectId}/system/. - Status: Creates a document in
script_jobscollection with statusprocessing.
B. Personal Uploads (processPersonalScript)
- Source:
functions/src/script-import/process-personal.ts - Trigger:
onCall(Callable Cloud Function — invoked by client). - Inputs:
{ scriptId, storagePath, title } - Action: Runs Mistral OCR → Uses the shared extraction core (
extractFromPages) with self-managed cache (no PubSub) → Saves blocks, structure, andcharacterHintsto personal namespace. - Output: Saves
scriptBlocks.json.gzandstructure.jsontopersonalScripts/{userId}/scripts/{scriptId}/. - Status: Creates a document in
personalScriptJobs(root collection, not nested under user). - Scope: Runs both block extraction and structure extraction (no props/sound cues). Includes
characterHintsfor voice profile generation.
C. Manual Re-Analysis
Re-analysis is handled by calling startOcrJob again with forceReOcr: true (or without, to reuse cached OCR). No separate debug endpoint is required.
3. Shared Core Module (core/)
The functions/src/script-import/core/ directory contains shared helpers consumed by both the project and personal pipelines. This centralization eliminates duplication and ensures consistent behavior.
core/mistral-ocr.ts — Mistral OCR Client
runMistralOcr(gcsUri)— Runs Mistral OCR on a PDF via GCS signed URL. ReturnsOCRPage[]and raw response for caching. UsestableFormat: null,includeImageBase64: falsefor lean payloads.saveMistralOcrResult(bucketName, projectId, rawResponse)— Saves raw OCR JSON toocr-results/{projectId}/mistral-ocr.jsonfor reuse/repair.loadCachedOcrPages(bucketName, projectId)— Loads cached OCR pages from GCS. Used by OCR reuse and repair flows.- Singleton Client: Lazy-initialized
Mistralclient fromMISTRAL_API_KEYsecret.
core/ocr-helpers.ts — Backward Compatibility
Re-exports runMistralOcr, saveMistralOcrResult, loadCachedOcrPages from mistral-ocr.ts. The former Document AI helpers (submitBatchOcr, pollDocAiOperation, downloadShards) have been removed.
core/extract.ts — AI Extraction Pipeline
extractFromPages(options)— Core extraction pipeline used by the personal pipeline. Takes OCR pages, creates (or uses external) context cache, runs blocks + structure extraction in parallel, and returnsExtractionResult.buildBatches(pages, pagesPerBatch)— Pure function: splits pages into batch descriptors.parseBlockResponse(raw)— Parses Gemini response intoScriptBlock[]. WithresponseSchemaenforcement, output is guaranteed{ scriptBlocks: [...] }— backward-compatible parsing retained for safety.
core/ai-schemas.ts — JSON Schema Constraints
Defines both Zod validators (for testing) and JSON Schema constants (for Gemini responseSchema) for all extraction types:
BLOCK_EXTRACTION_JSON_SCHEMA— Blocks withid,pageNumber,blockIndex,type,text,characterId,lineNumber. Follows the One-Line-Per-Block standard: each script line maps to exactly one ScriptBlock, eliminating the formerpreserveLineBreaksfield and the corresponding multi-line block ambiguity.STRUCTURE_EXTRACTION_JSON_SCHEMA— Acts, scenes (with song/parentScene support), characters (with optionaldefinitionfor voice profiles).PROPS_EXTRACTION_JSON_SCHEMA— Props with typed categories (Hand,Set,Personal,Consumable).SOUND_CUES_EXTRACTION_JSON_SCHEMA— Sound cues withdescription,cueLine,cueNumber.
core/types.ts — Shared Type Definitions
OCRPage—{ pageNumber, text }ScriptBlock— Full block interface withactId,sceneId,characterId(nullable).ShowStructure— Acts, scenes, characters (characters include optionaldefinitionfor voice hints).ExtractionResult— Combined result withblocks,structure,characterHints,failedBatches.ScriptTarget—'project' | 'personal'namespace discriminator.
4. The Project Pipeline (startOcrJob + PubSub Workers)
Source: functions/src/script-import/process-ocr.ts, functions/src/script-import/pubsub-workers.ts
The project pipeline uses a PubSub fan-out architecture: the coordinator (startOcrJob) runs OCR synchronously, creates a context cache, then dispatches all extraction tasks as independent PubSub messages. Workers execute in parallel Cloud Function instances.
Key Components
A. OCR (Mistral OCR)
- Model:
mistral-ocr-2512(Mistral OCR 3). - Access: PDF accessed via GCS signed URL (15-minute expiry) — not base64, keeping memory lean.
- Output: Markdown text per page (no HTML tables). Raw JSON response cached in OCR bucket.
- Reuse: Cached OCR results are checked before running fresh OCR.
forceReOcr: truebypasses the cache.
B. Context Caching (cache-utils.ts)
To process a 100+ page script efficiently, we create a Vertex AI Context Cache.
- Model:
gemini-2.5-pro(supports long context & caching). - Cache Content: The FULL Markdown text of the script (from Mistral OCR).
- TTL: 60 minutes.
- Benefit: We pay the "input token" cost once. All subsequent API calls for blocks, structure, props, and sound cues pay a fraction of the cost.
C. PubSub Dispatch (dispatchExtractionPipeline)
After OCR and cache creation, the coordinator publishes:
- N
extract-batchmessages — One per batch of pages (7 pages/batch). Each carries{ jobId, projectId, cacheName, batchNum, startPage, endPage }. - 1
extract-metadatamessage (typestructure) — Triggers structure extraction, which chains props and sound cues after completion.
D. Block Extraction Workers (extractBatchWorker)
- Topic:
extract-batch - Goal: Convert raw pages into
ScriptBlock[]objects. - Granularity: One Line = One Block (strict rule enforced in prompt).
- Schema Enforcement:
responseSchemawithBLOCK_EXTRACTION_JSON_SCHEMAguarantees structured output. - Retry Logic: Per-batch retry via re-publishing to same topic with incremented
retryCount(max 3), exponential backoff. - Persistence: Each batch saves to
projects/{jobId}/system/batches/batch_{N}.json. - Final Merge: The last completing batch triggers a transactional merge — downloads all batch files, sorts by batch number, enriches with defaults (
isVerified: false, nullableactId/sceneId/characterId), and savesscriptBlocks.json.gz. - Progress Reporting: Batch completion tracked via
completedBatchesarray inscript_jobsdocument.
E. Metadata Extraction Workers (extractMetadataWorker)
- Topic:
extract-metadata - Execution Order: Structure runs first → chains props and sound cues with resolved scene IDs.
- Structure: Extracts acts, scenes (with song/parentScene support), and characters. After saving, publishes
propsandsoundCuesmessages with the scene list for correct ID assignment. - Props: Uses scene list from structure to assign correct
sceneId. SavesextractedProps.json. - Sound Cues: Uses scene list from structure. Saves
extractedSoundCues.json. - Completion: After both props and sound cues complete, a transactional check marks the job as
completed.
F. Repair Mechanism (repairScriptImport)
- Source:
functions/src/script-import/process-ocr.ts - Trigger:
onCall(user-initiated from the client). - Purpose: Re-runs only the failed phases from a
partialimport, using the cached OCR data in GCS. No Mistral OCR reprocessing required. - Flow: Verifies ownership → Loads cached OCR pages (
loadCachedOcrPages) → Creates fresh context cache → Re-processes failed phases → Updates job status. - Block Repair: Failed batches are re-dispatched via PubSub fan-out (same worker pattern). Cache lifecycle managed by Vertex AI TTL (not deleted by repair function).
- Phase 2 Repair: Structure, props, and sound cues can be individually re-extracted within the same function invocation (sequential).
- Safety Deadline: Writes partial status 30s before Cloud Run kills the function (540s timeout).
G. Client-Side Structure Application
Since block extraction workers emit blocks without actId/sceneId, the client is responsible for assigning structure to blocks at load time.
- Source:
src/features/show-structure/hooks/useScriptBlocks.ts - Helper:
applyStructure(extracted helper function) - Flow:
useScriptBlocksloads bothscriptBlocks.json.gzandclassifiedStructure.jsonfrom GCS viascript-storage-api.ts.applyStructurematches each block to an act/scene using page-number ranges from the classified structure.- Blocks receive their
actIdandsceneIdfields before being placed in the Zustand store. - If
classifiedStructure.jsonis unavailable, a client-side fallback derives structure from blocks:
deriveStructureFromBlocks()— scansheadingblocks for ACT/SCENE patterns; creates a "Full Show" parent act if no acts are detected.deriveCharactersFromBlocks()— extracts unique names fromcharacter_nameblocks (stripping parentheticals like "V.O."), and assigns each character to the scenes they appear in.
- Why Client-Side: Moving structure assignment to the client eliminated cross-batch state dependencies, enabling fully parallel extraction without risk of ordering bugs.
5. The Personal Pipeline (processPersonalScript)
Source: functions/src/script-import/process-personal.ts
A streamlined pipeline for On Book Rehearsal personal script imports. Uses the shared extractFromPages core function (self-managed cache, no PubSub) and runs both block and structure extraction.
Flow
- Trigger:
onCall— client calls after uploading PDF tousers/{userId}/scripts/{fileName}.pdf. - OCR:
runMistralOcr()with signed URL →saveMistralOcrResult()to OCR bucket. - Extraction:
extractFromPages()— blocks and structure in parallel (5 pages/batch, concurrency 3). - Persistence: Saves
scriptBlocks.json.gzandstructure.jsontopersonalScripts/{userId}/scripts/{scriptId}/. - Metadata: Writes script metadata (title, counts,
characterHints) to Firestore atpersonalScripts/{userId}/scripts/{scriptId}. - Status: Tracks progress in root
personalScriptJobscollection (not nested under user).
Differences from Project Pipeline
| Aspect | Project | Personal |
|---|---|---|
| Trigger | onCall (startOcrJob) | onCall (processPersonalScript) |
| Extraction | PubSub fan-out (separate workers) | In-process (extractFromPages) |
| Phases | Blocks + Structure + Props + Sound Cues | Blocks + Structure only |
| Storage | projects/{id}/system/ | personalScripts/{uid}/scripts/{scriptId}/ |
| Job Collection | script_jobs | personalScriptJobs (root) |
| Repair | Supported (selective phase re-run) | Not supported |
| Character Hints | Not stored in job metadata | Stored for voice profile generation |
| Context Cache | Managed by coordinator, shared via PubSub | Self-managed within extractFromPages |
6. Code Configuration (State as of March 2, 2026)
Files
| Module | Purpose |
|---|---|
| Core (shared) | |
functions/src/script-import/core/mistral-ocr.ts | runMistralOcr, saveMistralOcrResult, loadCachedOcrPages — Mistral OCR 3 client & GCS caching |
functions/src/script-import/core/ocr-helpers.ts | Backward-compat re-export of mistral-ocr.ts functions |
functions/src/script-import/core/extract.ts | extractFromPages, buildBatches, parseBlockResponse — AI extraction pipeline |
functions/src/script-import/core/ai-schemas.ts | Zod validators + JSON Schema constants for responseSchema enforcement |
functions/src/script-import/core/types.ts | Shared types (OCRPage, ScriptBlock, ShowStructure, ExtractionResult, ScriptTarget) |
functions/src/script-import/types.ts | Re-exports core types + MISTRAL_OCR_MODEL constant |
functions/src/script-import/storage.ts | GCS persistence (saveScriptBlocks, saveStructure, saveProps, saveSoundCues, savePersonalScriptBlocks, savePersonalStructure) |
functions/src/cache-utils.ts | Vertex AI context caching, withRetry (unified retry logic), generateWithCache (with responseSchema support) |
functions/src/json-utils.ts | repairJSON, runWithConcurrency (worker-pool pattern) |
| Project Pipeline | |
functions/src/script-import/process-ocr.ts | Coordinator — startOcrJob (onCall), repairScriptImport (onCall) |
functions/src/script-import/pubsub-workers.ts | Workers — extractBatchWorker (blocks), extractMetadataWorker (structure → props → soundCues) |
| Personal Pipeline | |
functions/src/script-import/process-personal.ts | Orchestrator — processPersonalScript (onCall, blocks + structure) |
| Prompts | |
functions/prompts/*.prompt | Extraction prompts (blocks, structure, props, sound cues, system instructions) — plain text, no YAML frontmatter |
functions/src/index.ts | Re-export hub — all functions are re-exported from here |
| Client | |
src/features/sync/ai-api.ts | Client API — subscribeToScriptJob, subscribeToLatestScriptJob, ScriptJobStatus type |
src/features/show-structure/hooks/useScriptJobStatus.ts | Status hook — subscribes to latest job, computes failedPhases, exposes repairJob() |
src/features/show-structure/components/ScriptJobStatusChip.tsx | Toolbar chip — processing/partial/failed/completed visual states |
src/features/show-structure/components/ScriptRepairPopover.tsx | Repair popover — phase checkboxes, error summaries, repair action |
src/contexts/JobProgressContext.tsx | Lightweight "started" toast only — ongoing status delegated to sidebar/toolbar |
Tests
| Test File | Coverage |
|---|---|
functions/tests/ai-pipeline-integration.test.ts | Integration test for the AI extraction pipeline |
Tunings
| Setting | Value | Reason |
|---|---|---|
| Gemini Model | gemini-2.5-pro | Best balance of reasoning vs. speed for extraction tasks. |
| OCR Model | mistral-ocr-2512 (Mistral OCR 3) | High-quality Markdown output from PDFs. |
| Project Batch Size | 7 pages (PubSub workers) | Each worker processes one batch independently. |
| Personal Batch Size | 5 pages (in-process) | Smaller batches for in-process concurrency. |
| Block Concurrency | Unlimited (PubSub) / 3 (in-process) | PubSub workers run in separate Cloud Function instances; in-process uses worker pool. |
| Metadata Concurrency | Sequential chain | Structure → (props + soundCues in parallel). Props/soundCues need scene IDs from structure. |
| Repair Block Concurrency | PubSub fan-out | Failed batches re-dispatched as independent messages. |
| Max Output Tokens | 65,536 (Blocks), 32,768 (Structure), 16,384 (Props/Sound) | Sized per extraction task. |
| Retry Attempts | 2 (withRetry) / 3 (PubSub batch retry) | withRetry reduced to 2 to prevent 30-minute hangs on 300s timeouts. PubSub retries are per-batch with exponential backoff. |
| Cache TTL | 60 mins | Sufficient for even long script processing. Shared across all PubSub workers. |
startOcrJob Timeout | 300s (5 min) | Mistral OCR is synchronous but fast. |
| Worker/Repair Timeout | 540s (9 min) | Maximum for Cloud Functions. |
startOcrJob Memory | 1 GiB | Sufficient for OCR + dispatch overhead. |
| Worker/Repair Memory | 2 GiB | Required for large script extraction. |
Known "Gotchas" & Fixes
- Signed URL Permissions: Mistral OCR accesses PDFs via GCS signed URLs. The Cloud Functions service account must have
iam.serviceAccounts.signBlobpermission. - PubSub Ordering: Batch workers complete in arbitrary order. The final merge sorts by batch number before saving.
- Transactional Completion: Both
extractBatchWorkerandextractMetadataWorkeruse Firestore transactions to safely detect "last completer" and trigger final status updates. - OCR Reuse:
startOcrJobchecks for cached Mistral OCR before running fresh OCR. UseforceReOcr: trueto force re-processing. - Cache Lifecycle: For PubSub fan-out, the coordinator does NOT delete the cache — it's shared by all workers. Vertex AI TTL (1 hour) handles cleanup. For in-process extraction (
extractFromPageswithout external cache), the function deletes its own cache. responseSchemaEnforcement: All Gemini calls use JSON Schema constraints viaresponseSchema, eliminating JSON repair utilities and polymorphic parsing. Backward-compatible parsing is retained inparseBlockResponsefor safety.- Scene ID Propagation: Structure extraction chains into props/soundCues extraction. The structure worker formats the scene list and passes it to props/soundCues messages, ensuring correct
sceneIdassignment. - Prompt Hygiene: Structure prompt explicitly prevents hallucinated divisions; sound cue prompt preserves original cue numbers instead of generating sequential ones.
- Unified Retry:
generateWithCachedelegates towithRetryfor consistent 429/5xx handling withRetry-Afterheader parsing,AbortSignal.timeout, and exponential backoff with jitter. - Auth Timing:
useScriptJobStatususesuseSyncExternalStoreviauseAuthUid()to reactively track Firebase Auth. Without this, the hook subscribes before auth resolves and silently receives a no-op unsubscribe that never re-fires. - Safety Deadline:
repairScriptImportwrites partial status 30s before the 540s Cloud Run timeout kills the function, preventing silent failures.
7. Client-Side Status & Repair UI
Architecture
The status communication uses a Hybrid pattern:
- Lightweight toast signals the start of the import (auto-closes after 3s via
JobProgressContext). - Sidebar badge on the "Show Structure" tool icon provides cross-tab awareness.
- Toolbar status chip within the "Show Structure" toolbar shows detailed status and enables repair actions.
Components
useScriptJobStatus (Hook)
- Source:
src/features/show-structure/hooks/useScriptJobStatus.ts - Auth Reactivity: Uses
useAuthUid()(viauseSyncExternalStore) to track auth state. TheuserIdis included in theuseEffectdependency array so the Firestore subscription re-establishes after login. - Subscription: Calls
subscribeToLatestScriptJob(projectId, ...)to watch the most recentscript_jobs/{jobId}document for the current project/user.
Client-Side Fallback (Structure Derivation)
- Trigger:
useScriptBlocks.tscallsderiveStructureFromBlocks()+deriveCharactersFromBlocks()whenclassifiedStructure.jsonis absent. If classified structure exists but has no characters, onlyderiveCharactersFromBlocks()is called. - Quality: Adequate for simple shows. AI output is superior for ensemble detection, character notes, and complex multi-act structures. User can run repair to upgrade from fallback to AI-derived structure.
- State Derivation:
failedPhases— computed fromphaseErrorskeys on the job document.indicator—'processing'(blue pulse),'warning'(amber),'success'(green), ornull.
- Actions:
repairJob(phases)— callsrepairScriptImportCloud Function with selected phases.dismiss()— resets local state without affecting Firestore.
ScriptJobStatusChip (Toolbar)
- Source:
src/features/show-structure/components/ScriptJobStatusChip.tsx - States: Processing (spinner), Partial (amber warning), Failed (red error), Completed (green check → auto-fades after 10s).
- Interaction: Clicking partial/failed opens the
ScriptRepairPopover. - Rendering: Only renders when a job exists (returns
nullotherwise).
ScriptRepairPopover
- Source:
src/features/show-structure/components/ScriptRepairPopover.tsx - Content: Phase checkboxes (Structure, Props, Sound Cues, Blocks) with error summaries.
- Pre-Selection: Failed phases are pre-checked; user can toggle any phase for selective repair.
- Result Feedback: Shows repair progress, success, or error inline.
- Dismiss: Closes via outside click, escape key, or dismiss button.
WorkspaceSidebar Integration
- Prop:
toolStatusMap: Partial<Record<ToolId, StatusIndicator>> - Behavior:
SidebarToolrenders a colored dot whenstatusIndicatoris set:processing→ blue pulsing dotwarning→ amber dotsuccess→ green dot
- Precedence: Status dot takes visual priority over numeric badges.
JobProgressContext (Simplified)
- Only fires the initial "Script analysis started..." toast (auto-closes after 3s).
- All ongoing/terminal status communication is delegated to the sidebar badge + toolbar chip.
- Still loads completed/partial results into the Zustand store and navigates to Show Structure.
8. Development Guide
How to Monitor
View Firebase Functions logs:
firebase functions:log --only startOcrJob
firebase functions:log --only extractBatchWorker
firebase functions:log --only extractMetadataWorker
firebase functions:log --only repairScriptImport
firebase functions:log --only processPersonalScriptHow to Test Prompts
Prompts are standard text files in functions/prompts/.
extract_blocks.prompt: Source of truth for block extraction.extract_structure.prompt: Source for Structure/Characters.extract_props.prompt: Source for Props (usestemplate).extract_sound_cues.prompt: Source for Sound Cues (usestemplate).system_instructions_cache.prompt: System instructions cached with the script context.
Deployment
To deploy updates to the pipeline:
cd functions
npm run build # Compiles TS and copies prompts to lib/
firebase deploy --only functionsCRITICAL: Always run npm run build first. The prompts are not TypeScript files, so they must be manually copied (handled by the build script using cpx).
PubSub Topics: The extract-batch and extract-metadata topics must exist in the GCP project. They are auto-created on first publish if the service account has pubsub.topics.create permission.
Secrets: MISTRAL_API_KEY must be set in Firebase Functions secrets (firebase functions:secrets:set MISTRAL_API_KEY).
Last updated: March 22, 2026 (One-Line-Per-Block standard)