Skip to content

Script Ingestion Pipeline Architecture (March 2026)

This document outlines the architecture for the Script Ingestion Pipeline, a critical feature that converts raw PDF scripts into structured, interactive data (Acts, Scenes, Characters, Dialogue, Props, Sound Cues). The pipeline supports two modes: Project Import (collaborative, production-level) and Personal Import (individual, for On Book Rehearsal).

1. High-Level Flow

The pipeline uses a Hybrid Architecture combining Mistral OCR (for OCR) and Vertex AI Gemini 2.5 Pro (for semantic extraction via context caching).

Steps:

  1. Trigger: User uploads a PDF and calls startOcrJob (project) or processPersonalScript (personal).
  2. OCR (The Eyes): Mistral OCR converts the PDF to Markdown text via signed URL.
  3. Context Caching: The full script text is cached in Vertex AI (reducing token costs by ~90% for subsequent calls).
  4. Dispatch: The coordinator dispatches parallel extraction via PubSub topics (extract-batch for blocks, extract-metadata for structure).
  5. Block Extraction: PubSub workers extract ScriptBlock[] in parallel batches.
  6. Structure & Elements: Structure is extracted first, then chains props and sound cues extraction with resolved scene IDs.
  7. Merge & Persist: The final batch worker merges all block batches and saves scriptBlocks.json.gz. Metadata workers save classifiedStructure.json, extractedProps.json, extractedSoundCues.json.

2. Trigger Integration

A. Project Uploads (startOcrJob)

  • Source: functions/src/script-import/process-ocr.ts
  • Trigger: onCall (Callable Cloud Function — invoked by client after PDF upload).
  • Inputs: { projectId, storagePath, forceReOcr? }
  • OCR Reuse: Before running Mistral OCR, checks the OCR bucket for cached results. If found (and forceReOcr is false), skips OCR and proceeds directly to extraction dispatch.
  • Action: Runs Mistral OCR synchronously → Caches raw OCR JSON → Dispatches parallel extraction via PubSub.
  • Output: Extraction results saved to gs://{defaultBucket}/projects/{projectId}/system/.
  • Status: Creates a document in script_jobs collection with status processing.

B. Personal Uploads (processPersonalScript)

  • Source: functions/src/script-import/process-personal.ts
  • Trigger: onCall (Callable Cloud Function — invoked by client).
  • Inputs: { scriptId, storagePath, title }
  • Action: Runs Mistral OCR → Uses the shared extraction core (extractFromPages) with self-managed cache (no PubSub) → Saves blocks, structure, and characterHints to personal namespace.
  • Output: Saves scriptBlocks.json.gz and structure.json to personalScripts/{userId}/scripts/{scriptId}/.
  • Status: Creates a document in personalScriptJobs (root collection, not nested under user).
  • Scope: Runs both block extraction and structure extraction (no props/sound cues). Includes characterHints for voice profile generation.

C. Manual Re-Analysis

Re-analysis is handled by calling startOcrJob again with forceReOcr: true (or without, to reuse cached OCR). No separate debug endpoint is required.


3. Shared Core Module (core/)

The functions/src/script-import/core/ directory contains shared helpers consumed by both the project and personal pipelines. This centralization eliminates duplication and ensures consistent behavior.

core/mistral-ocr.ts — Mistral OCR Client

  • runMistralOcr(gcsUri) — Runs Mistral OCR on a PDF via GCS signed URL. Returns OCRPage[] and raw response for caching. Uses tableFormat: null, includeImageBase64: false for lean payloads.
  • saveMistralOcrResult(bucketName, projectId, rawResponse) — Saves raw OCR JSON to ocr-results/{projectId}/mistral-ocr.json for reuse/repair.
  • loadCachedOcrPages(bucketName, projectId) — Loads cached OCR pages from GCS. Used by OCR reuse and repair flows.
  • Singleton Client: Lazy-initialized Mistral client from MISTRAL_API_KEY secret.

core/ocr-helpers.ts — Backward Compatibility

Re-exports runMistralOcr, saveMistralOcrResult, loadCachedOcrPages from mistral-ocr.ts. The former Document AI helpers (submitBatchOcr, pollDocAiOperation, downloadShards) have been removed.

core/extract.ts — AI Extraction Pipeline

  • extractFromPages(options) — Core extraction pipeline used by the personal pipeline. Takes OCR pages, creates (or uses external) context cache, runs blocks + structure extraction in parallel, and returns ExtractionResult.
  • buildBatches(pages, pagesPerBatch) — Pure function: splits pages into batch descriptors.
  • parseBlockResponse(raw) — Parses Gemini response into ScriptBlock[]. With responseSchema enforcement, output is guaranteed { scriptBlocks: [...] } — backward-compatible parsing retained for safety.

core/ai-schemas.ts — JSON Schema Constraints

Defines both Zod validators (for testing) and JSON Schema constants (for Gemini responseSchema) for all extraction types:

  • BLOCK_EXTRACTION_JSON_SCHEMA — Blocks with id, pageNumber, blockIndex, type, text, characterId, lineNumber. Follows the One-Line-Per-Block standard: each script line maps to exactly one ScriptBlock, eliminating the former preserveLineBreaks field and the corresponding multi-line block ambiguity.
  • STRUCTURE_EXTRACTION_JSON_SCHEMA — Acts, scenes (with song/parentScene support), characters (with optional definition for voice profiles).
  • PROPS_EXTRACTION_JSON_SCHEMA — Props with typed categories (Hand, Set, Personal, Consumable).
  • SOUND_CUES_EXTRACTION_JSON_SCHEMA — Sound cues with description, cueLine, cueNumber.

core/types.ts — Shared Type Definitions

  • OCRPage{ pageNumber, text }
  • ScriptBlock — Full block interface with actId, sceneId, characterId (nullable).
  • ShowStructure — Acts, scenes, characters (characters include optional definition for voice hints).
  • ExtractionResult — Combined result with blocks, structure, characterHints, failedBatches.
  • ScriptTarget'project' | 'personal' namespace discriminator.

4. The Project Pipeline (startOcrJob + PubSub Workers)

Source: functions/src/script-import/process-ocr.ts, functions/src/script-import/pubsub-workers.ts

The project pipeline uses a PubSub fan-out architecture: the coordinator (startOcrJob) runs OCR synchronously, creates a context cache, then dispatches all extraction tasks as independent PubSub messages. Workers execute in parallel Cloud Function instances.

Key Components

A. OCR (Mistral OCR)

  • Model: mistral-ocr-2512 (Mistral OCR 3).
  • Access: PDF accessed via GCS signed URL (15-minute expiry) — not base64, keeping memory lean.
  • Output: Markdown text per page (no HTML tables). Raw JSON response cached in OCR bucket.
  • Reuse: Cached OCR results are checked before running fresh OCR. forceReOcr: true bypasses the cache.

B. Context Caching (cache-utils.ts)

To process a 100+ page script efficiently, we create a Vertex AI Context Cache.

  • Model: gemini-2.5-pro (supports long context & caching).
  • Cache Content: The FULL Markdown text of the script (from Mistral OCR).
  • TTL: 60 minutes.
  • Benefit: We pay the "input token" cost once. All subsequent API calls for blocks, structure, props, and sound cues pay a fraction of the cost.

C. PubSub Dispatch (dispatchExtractionPipeline)

After OCR and cache creation, the coordinator publishes:

  • N extract-batch messages — One per batch of pages (7 pages/batch). Each carries { jobId, projectId, cacheName, batchNum, startPage, endPage }.
  • 1 extract-metadata message (type structure) — Triggers structure extraction, which chains props and sound cues after completion.

D. Block Extraction Workers (extractBatchWorker)

  • Topic: extract-batch
  • Goal: Convert raw pages into ScriptBlock[] objects.
  • Granularity: One Line = One Block (strict rule enforced in prompt).
  • Schema Enforcement: responseSchema with BLOCK_EXTRACTION_JSON_SCHEMA guarantees structured output.
  • Retry Logic: Per-batch retry via re-publishing to same topic with incremented retryCount (max 3), exponential backoff.
  • Persistence: Each batch saves to projects/{jobId}/system/batches/batch_{N}.json.
  • Final Merge: The last completing batch triggers a transactional merge — downloads all batch files, sorts by batch number, enriches with defaults (isVerified: false, nullable actId/sceneId/characterId), and saves scriptBlocks.json.gz.
  • Progress Reporting: Batch completion tracked via completedBatches array in script_jobs document.

E. Metadata Extraction Workers (extractMetadataWorker)

  • Topic: extract-metadata
  • Execution Order: Structure runs first → chains props and sound cues with resolved scene IDs.
  • Structure: Extracts acts, scenes (with song/parentScene support), and characters. After saving, publishes props and soundCues messages with the scene list for correct ID assignment.
  • Props: Uses scene list from structure to assign correct sceneId. Saves extractedProps.json.
  • Sound Cues: Uses scene list from structure. Saves extractedSoundCues.json.
  • Completion: After both props and sound cues complete, a transactional check marks the job as completed.

F. Repair Mechanism (repairScriptImport)

  • Source: functions/src/script-import/process-ocr.ts
  • Trigger: onCall (user-initiated from the client).
  • Purpose: Re-runs only the failed phases from a partial import, using the cached OCR data in GCS. No Mistral OCR reprocessing required.
  • Flow: Verifies ownership → Loads cached OCR pages (loadCachedOcrPages) → Creates fresh context cache → Re-processes failed phases → Updates job status.
  • Block Repair: Failed batches are re-dispatched via PubSub fan-out (same worker pattern). Cache lifecycle managed by Vertex AI TTL (not deleted by repair function).
  • Phase 2 Repair: Structure, props, and sound cues can be individually re-extracted within the same function invocation (sequential).
  • Safety Deadline: Writes partial status 30s before Cloud Run kills the function (540s timeout).

G. Client-Side Structure Application

Since block extraction workers emit blocks without actId/sceneId, the client is responsible for assigning structure to blocks at load time.

  • Source: src/features/show-structure/hooks/useScriptBlocks.ts
  • Helper: applyStructure (extracted helper function)
  • Flow:
    1. useScriptBlocks loads both scriptBlocks.json.gz and classifiedStructure.json from GCS via script-storage-api.ts.
    2. applyStructure matches each block to an act/scene using page-number ranges from the classified structure.
    3. Blocks receive their actId and sceneId fields before being placed in the Zustand store.
    4. If classifiedStructure.json is unavailable, a client-side fallback derives structure from blocks:
    • deriveStructureFromBlocks() — scans heading blocks for ACT/SCENE patterns; creates a "Full Show" parent act if no acts are detected.
    • deriveCharactersFromBlocks() — extracts unique names from character_name blocks (stripping parentheticals like "V.O."), and assigns each character to the scenes they appear in.
  • Why Client-Side: Moving structure assignment to the client eliminated cross-batch state dependencies, enabling fully parallel extraction without risk of ordering bugs.

5. The Personal Pipeline (processPersonalScript)

Source: functions/src/script-import/process-personal.ts

A streamlined pipeline for On Book Rehearsal personal script imports. Uses the shared extractFromPages core function (self-managed cache, no PubSub) and runs both block and structure extraction.

Flow

  1. Trigger: onCall — client calls after uploading PDF to users/{userId}/scripts/{fileName}.pdf.
  2. OCR: runMistralOcr() with signed URL → saveMistralOcrResult() to OCR bucket.
  3. Extraction: extractFromPages() — blocks and structure in parallel (5 pages/batch, concurrency 3).
  4. Persistence: Saves scriptBlocks.json.gz and structure.json to personalScripts/{userId}/scripts/{scriptId}/.
  5. Metadata: Writes script metadata (title, counts, characterHints) to Firestore at personalScripts/{userId}/scripts/{scriptId}.
  6. Status: Tracks progress in root personalScriptJobs collection (not nested under user).

Differences from Project Pipeline

AspectProjectPersonal
TriggeronCall (startOcrJob)onCall (processPersonalScript)
ExtractionPubSub fan-out (separate workers)In-process (extractFromPages)
PhasesBlocks + Structure + Props + Sound CuesBlocks + Structure only
Storageprojects/{id}/system/personalScripts/{uid}/scripts/{scriptId}/
Job Collectionscript_jobspersonalScriptJobs (root)
RepairSupported (selective phase re-run)Not supported
Character HintsNot stored in job metadataStored for voice profile generation
Context CacheManaged by coordinator, shared via PubSubSelf-managed within extractFromPages

6. Code Configuration (State as of March 2, 2026)

Files

ModulePurpose
Core (shared)
functions/src/script-import/core/mistral-ocr.tsrunMistralOcr, saveMistralOcrResult, loadCachedOcrPages — Mistral OCR 3 client & GCS caching
functions/src/script-import/core/ocr-helpers.tsBackward-compat re-export of mistral-ocr.ts functions
functions/src/script-import/core/extract.tsextractFromPages, buildBatches, parseBlockResponse — AI extraction pipeline
functions/src/script-import/core/ai-schemas.tsZod validators + JSON Schema constants for responseSchema enforcement
functions/src/script-import/core/types.tsShared types (OCRPage, ScriptBlock, ShowStructure, ExtractionResult, ScriptTarget)
functions/src/script-import/types.tsRe-exports core types + MISTRAL_OCR_MODEL constant
functions/src/script-import/storage.tsGCS persistence (saveScriptBlocks, saveStructure, saveProps, saveSoundCues, savePersonalScriptBlocks, savePersonalStructure)
functions/src/cache-utils.tsVertex AI context caching, withRetry (unified retry logic), generateWithCache (with responseSchema support)
functions/src/json-utils.tsrepairJSON, runWithConcurrency (worker-pool pattern)
Project Pipeline
functions/src/script-import/process-ocr.tsCoordinatorstartOcrJob (onCall), repairScriptImport (onCall)
functions/src/script-import/pubsub-workers.tsWorkersextractBatchWorker (blocks), extractMetadataWorker (structure → props → soundCues)
Personal Pipeline
functions/src/script-import/process-personal.tsOrchestratorprocessPersonalScript (onCall, blocks + structure)
Prompts
functions/prompts/*.promptExtraction prompts (blocks, structure, props, sound cues, system instructions) — plain text, no YAML frontmatter
functions/src/index.tsRe-export hub — all functions are re-exported from here
Client
src/features/sync/ai-api.tsClient API — subscribeToScriptJob, subscribeToLatestScriptJob, ScriptJobStatus type
src/features/show-structure/hooks/useScriptJobStatus.tsStatus hook — subscribes to latest job, computes failedPhases, exposes repairJob()
src/features/show-structure/components/ScriptJobStatusChip.tsxToolbar chip — processing/partial/failed/completed visual states
src/features/show-structure/components/ScriptRepairPopover.tsxRepair popover — phase checkboxes, error summaries, repair action
src/contexts/JobProgressContext.tsxLightweight "started" toast only — ongoing status delegated to sidebar/toolbar

Tests

Test FileCoverage
functions/tests/ai-pipeline-integration.test.tsIntegration test for the AI extraction pipeline

Tunings

SettingValueReason
Gemini Modelgemini-2.5-proBest balance of reasoning vs. speed for extraction tasks.
OCR Modelmistral-ocr-2512 (Mistral OCR 3)High-quality Markdown output from PDFs.
Project Batch Size7 pages (PubSub workers)Each worker processes one batch independently.
Personal Batch Size5 pages (in-process)Smaller batches for in-process concurrency.
Block ConcurrencyUnlimited (PubSub) / 3 (in-process)PubSub workers run in separate Cloud Function instances; in-process uses worker pool.
Metadata ConcurrencySequential chainStructure → (props + soundCues in parallel). Props/soundCues need scene IDs from structure.
Repair Block ConcurrencyPubSub fan-outFailed batches re-dispatched as independent messages.
Max Output Tokens65,536 (Blocks), 32,768 (Structure), 16,384 (Props/Sound)Sized per extraction task.
Retry Attempts2 (withRetry) / 3 (PubSub batch retry)withRetry reduced to 2 to prevent 30-minute hangs on 300s timeouts. PubSub retries are per-batch with exponential backoff.
Cache TTL60 minsSufficient for even long script processing. Shared across all PubSub workers.
startOcrJob Timeout300s (5 min)Mistral OCR is synchronous but fast.
Worker/Repair Timeout540s (9 min)Maximum for Cloud Functions.
startOcrJob Memory1 GiBSufficient for OCR + dispatch overhead.
Worker/Repair Memory2 GiBRequired for large script extraction.

Known "Gotchas" & Fixes

  1. Signed URL Permissions: Mistral OCR accesses PDFs via GCS signed URLs. The Cloud Functions service account must have iam.serviceAccounts.signBlob permission.
  2. PubSub Ordering: Batch workers complete in arbitrary order. The final merge sorts by batch number before saving.
  3. Transactional Completion: Both extractBatchWorker and extractMetadataWorker use Firestore transactions to safely detect "last completer" and trigger final status updates.
  4. OCR Reuse: startOcrJob checks for cached Mistral OCR before running fresh OCR. Use forceReOcr: true to force re-processing.
  5. Cache Lifecycle: For PubSub fan-out, the coordinator does NOT delete the cache — it's shared by all workers. Vertex AI TTL (1 hour) handles cleanup. For in-process extraction (extractFromPages without external cache), the function deletes its own cache.
  6. responseSchema Enforcement: All Gemini calls use JSON Schema constraints via responseSchema, eliminating JSON repair utilities and polymorphic parsing. Backward-compatible parsing is retained in parseBlockResponse for safety.
  7. Scene ID Propagation: Structure extraction chains into props/soundCues extraction. The structure worker formats the scene list and passes it to props/soundCues messages, ensuring correct sceneId assignment.
  8. Prompt Hygiene: Structure prompt explicitly prevents hallucinated divisions; sound cue prompt preserves original cue numbers instead of generating sequential ones.
  9. Unified Retry: generateWithCache delegates to withRetry for consistent 429/5xx handling with Retry-After header parsing, AbortSignal.timeout, and exponential backoff with jitter.
  10. Auth Timing: useScriptJobStatus uses useSyncExternalStore via useAuthUid() to reactively track Firebase Auth. Without this, the hook subscribes before auth resolves and silently receives a no-op unsubscribe that never re-fires.
  11. Safety Deadline: repairScriptImport writes partial status 30s before the 540s Cloud Run timeout kills the function, preventing silent failures.

7. Client-Side Status & Repair UI

Architecture

The status communication uses a Hybrid pattern:

  • Lightweight toast signals the start of the import (auto-closes after 3s via JobProgressContext).
  • Sidebar badge on the "Show Structure" tool icon provides cross-tab awareness.
  • Toolbar status chip within the "Show Structure" toolbar shows detailed status and enables repair actions.

Components

useScriptJobStatus (Hook)

  • Source: src/features/show-structure/hooks/useScriptJobStatus.ts
  • Auth Reactivity: Uses useAuthUid() (via useSyncExternalStore) to track auth state. The userId is included in the useEffect dependency array so the Firestore subscription re-establishes after login.
  • Subscription: Calls subscribeToLatestScriptJob(projectId, ...) to watch the most recent script_jobs/{jobId} document for the current project/user.

Client-Side Fallback (Structure Derivation)

  • Trigger: useScriptBlocks.ts calls deriveStructureFromBlocks() + deriveCharactersFromBlocks() when classifiedStructure.json is absent. If classified structure exists but has no characters, only deriveCharactersFromBlocks() is called.
  • Quality: Adequate for simple shows. AI output is superior for ensemble detection, character notes, and complex multi-act structures. User can run repair to upgrade from fallback to AI-derived structure.
  • State Derivation:
    • failedPhases — computed from phaseErrors keys on the job document.
    • indicator'processing' (blue pulse), 'warning' (amber), 'success' (green), or null.
  • Actions:
    • repairJob(phases) — calls repairScriptImport Cloud Function with selected phases.
    • dismiss() — resets local state without affecting Firestore.

ScriptJobStatusChip (Toolbar)

  • Source: src/features/show-structure/components/ScriptJobStatusChip.tsx
  • States: Processing (spinner), Partial (amber warning), Failed (red error), Completed (green check → auto-fades after 10s).
  • Interaction: Clicking partial/failed opens the ScriptRepairPopover.
  • Rendering: Only renders when a job exists (returns null otherwise).

ScriptRepairPopover

  • Source: src/features/show-structure/components/ScriptRepairPopover.tsx
  • Content: Phase checkboxes (Structure, Props, Sound Cues, Blocks) with error summaries.
  • Pre-Selection: Failed phases are pre-checked; user can toggle any phase for selective repair.
  • Result Feedback: Shows repair progress, success, or error inline.
  • Dismiss: Closes via outside click, escape key, or dismiss button.

WorkspaceSidebar Integration

  • Prop: toolStatusMap: Partial<Record<ToolId, StatusIndicator>>
  • Behavior: SidebarTool renders a colored dot when statusIndicator is set:
    • processing → blue pulsing dot
    • warning → amber dot
    • success → green dot
  • Precedence: Status dot takes visual priority over numeric badges.

JobProgressContext (Simplified)

  • Only fires the initial "Script analysis started..." toast (auto-closes after 3s).
  • All ongoing/terminal status communication is delegated to the sidebar badge + toolbar chip.
  • Still loads completed/partial results into the Zustand store and navigates to Show Structure.

8. Development Guide

How to Monitor

View Firebase Functions logs:

bash
firebase functions:log --only startOcrJob
firebase functions:log --only extractBatchWorker
firebase functions:log --only extractMetadataWorker
firebase functions:log --only repairScriptImport
firebase functions:log --only processPersonalScript

How to Test Prompts

Prompts are standard text files in functions/prompts/.

  • extract_blocks.prompt: Source of truth for block extraction.
  • extract_structure.prompt: Source for Structure/Characters.
  • extract_props.prompt: Source for Props (uses template).
  • extract_sound_cues.prompt: Source for Sound Cues (uses template).
  • system_instructions_cache.prompt: System instructions cached with the script context.

Deployment

To deploy updates to the pipeline:

bash
cd functions
npm run build   # Compiles TS and copies prompts to lib/
firebase deploy --only functions

CRITICAL: Always run npm run build first. The prompts are not TypeScript files, so they must be manually copied (handled by the build script using cpx).

PubSub Topics: The extract-batch and extract-metadata topics must exist in the GCP project. They are auto-created on first publish if the service account has pubsub.topics.create permission.

Secrets: MISTRAL_API_KEY must be set in Firebase Functions secrets (firebase functions:secrets:set MISTRAL_API_KEY).


Last updated: March 22, 2026 (One-Line-Per-Block standard)