Script Ingestion Pipeline Architecture (March 2026)

This document outlines the architecture for the Script Ingestion Pipeline, a critical feature that converts raw PDF scripts into structured, interactive data (Acts, Scenes, Characters, Dialogue, Props, Sound Cues). The pipeline supports two modes: Project Import (collaborative, production-level) and Personal Import (individual, for On Book Rehearsal).

1. High-Level Flow

The pipeline uses a Hybrid Architecture combining Mistral OCR (for OCR) and Vertex AI Gemini 2.5 Pro (for semantic extraction via context caching).

Steps:

Trigger: User uploads a PDF and calls startOcrJob (project) or processPersonalScript (personal).
OCR (The Eyes): Mistral OCR converts the PDF to Markdown text via signed URL.
Context Caching: The full script text is cached in Vertex AI (reducing token costs by ~90% for subsequent calls).
Dispatch: The coordinator dispatches parallel extraction via PubSub topics (extract-batch for blocks, extract-metadata for structure).
Block Extraction: PubSub workers extract ScriptBlock[] in parallel batches.
Structure & Elements: Structure is extracted first, then chains props and sound cues extraction with resolved scene IDs.
Merge & Persist: The final batch worker merges all block batches and saves scriptBlocks.json.gz. Metadata workers save classifiedStructure.json, extractedProps.json, extractedSoundCues.json.

2. Trigger Integration

A. Project Uploads (`startOcrJob`)

Source: functions/src/script-import/process-ocr.ts
Trigger: onCall (Callable Cloud Function — invoked by client after PDF upload).
Inputs: { projectId, storagePath, forceReOcr? }
OCR Reuse: Before running Mistral OCR, checks the OCR bucket for cached results. If found (and forceReOcr is false), skips OCR and proceeds directly to extraction dispatch.
Action: Runs Mistral OCR synchronously → Caches raw OCR JSON → Dispatches parallel extraction via PubSub.
Output: Extraction results saved to gs://{defaultBucket}/projects/{projectId}/system/.
Status: Creates a document in script_jobs collection with status processing.

B. Personal Uploads (`processPersonalScript`)

Source: functions/src/script-import/process-personal.ts
Trigger: onCall (Callable Cloud Function — invoked by client).
Inputs: { scriptId, storagePath, title }
Action: Runs Mistral OCR → Uses the shared extraction core (extractFromPages) with self-managed cache (no PubSub) → Saves blocks, structure, and characterHints to personal namespace.
Output: Saves scriptBlocks.json.gz and structure.json to personalScripts/{userId}/scripts/{scriptId}/.
Status: Creates a document in personalScriptJobs (root collection, not nested under user).
Scope: Runs both block extraction and structure extraction (no props/sound cues). Includes characterHints for voice profile generation.

C. Manual Re-Analysis

Re-analysis is handled by calling startOcrJob again with forceReOcr: true (or without, to reuse cached OCR). No separate debug endpoint is required.

3. Shared Core Module (`core/`)

The functions/src/script-import/core/ directory contains shared helpers consumed by both the project and personal pipelines. This centralization eliminates duplication and ensures consistent behavior.

`core/mistral-ocr.ts` — Mistral OCR Client

runMistralOcr(gcsUri) — Runs Mistral OCR on a PDF via GCS signed URL. Returns OCRPage[] and raw response for caching. Uses tableFormat: null, includeImageBase64: false for lean payloads.
saveMistralOcrResult(bucketName, projectId, rawResponse) — Saves raw OCR JSON to ocr-results/{projectId}/mistral-ocr.json for reuse/repair.
loadCachedOcrPages(bucketName, projectId) — Loads cached OCR pages from GCS. Used by OCR reuse and repair flows.
Singleton Client: Lazy-initialized Mistral client from MISTRAL_API_KEY secret.

`core/ocr-helpers.ts` — Backward Compatibility

Re-exports runMistralOcr, saveMistralOcrResult, loadCachedOcrPages from mistral-ocr.ts. The former Document AI helpers (submitBatchOcr, pollDocAiOperation, downloadShards) have been removed.

`core/extract.ts` — AI Extraction Pipeline

extractFromPages(options) — Core extraction pipeline used by the personal pipeline. Takes OCR pages, creates (or uses external) context cache, runs blocks + structure extraction in parallel, and returns ExtractionResult.
buildBatches(pages, pagesPerBatch) — Pure function: splits pages into batch descriptors.
parseBlockResponse(raw) — Parses Gemini response into ScriptBlock[]. With responseSchema enforcement, output is guaranteed { scriptBlocks: [...] } — backward-compatible parsing retained for safety.

`core/ai-schemas.ts` — JSON Schema Constraints

Defines both Zod validators (for testing) and JSON Schema constants (for Gemini responseSchema) for all extraction types:

BLOCK_EXTRACTION_JSON_SCHEMA — Blocks with id, pageNumber, blockIndex, type, text, characterId, lineNumber. Follows the One-Line-Per-Block standard: each script line maps to exactly one ScriptBlock, eliminating the former preserveLineBreaks field and the corresponding multi-line block ambiguity.
STRUCTURE_EXTRACTION_JSON_SCHEMA — Acts, scenes (with song/parentScene support), characters (with optional definition for voice profiles).
PROPS_EXTRACTION_JSON_SCHEMA — Props with typed categories (Hand, Set, Personal, Consumable).
SOUND_CUES_EXTRACTION_JSON_SCHEMA — Sound cues with description, cueLine, cueNumber.

`core/types.ts` — Shared Type Definitions

OCRPage — { pageNumber, text }
ScriptBlock — Full block interface with actId, sceneId, characterId (nullable).
ShowStructure — Acts, scenes, characters (characters include optional definition for voice hints).
ExtractionResult — Combined result with blocks, structure, characterHints, failedBatches.
ScriptTarget — 'project' | 'personal' namespace discriminator.

4. The Project Pipeline (`startOcrJob` + PubSub Workers)

Source: functions/src/script-import/process-ocr.ts, functions/src/script-import/pubsub-workers.ts

The project pipeline uses a PubSub fan-out architecture: the coordinator (startOcrJob) runs OCR synchronously, creates a context cache, then dispatches all extraction tasks as independent PubSub messages. Workers execute in parallel Cloud Function instances.

Key Components

A. OCR (Mistral OCR)

Model: mistral-ocr-2512 (Mistral OCR 3).
Access: PDF accessed via GCS signed URL (15-minute expiry) — not base64, keeping memory lean.
Output: Markdown text per page (no HTML tables). Raw JSON response cached in OCR bucket.
Reuse: Cached OCR results are checked before running fresh OCR. forceReOcr: true bypasses the cache.

B. Context Caching (`cache-utils.ts`)

To process a 100+ page script efficiently, we create a Vertex AI Context Cache.

Model: gemini-2.5-pro (supports long context & caching).
Cache Content: The FULL Markdown text of the script (from Mistral OCR).
TTL: 60 minutes.
Benefit: We pay the "input token" cost once. All subsequent API calls for blocks, structure, props, and sound cues pay a fraction of the cost.

C. PubSub Dispatch (`dispatchExtractionPipeline`)

After OCR and cache creation, the coordinator publishes:

N extract-batch messages — One per batch of pages (7 pages/batch). Each carries { jobId, projectId, cacheName, batchNum, startPage, endPage }.
1 extract-metadata message (type structure) — Triggers structure extraction, which chains props and sound cues after completion.

D. Block Extraction Workers (`extractBatchWorker`)

Topic: extract-batch
Goal: Convert raw pages into ScriptBlock[] objects.
Granularity: One Line = One Block (strict rule enforced in prompt).
Schema Enforcement: responseSchema with BLOCK_EXTRACTION_JSON_SCHEMA guarantees structured output.
Retry Logic: Per-batch retry via re-publishing to same topic with incremented retryCount (max 3), exponential backoff.
Persistence: Each batch saves to projects/{jobId}/system/batches/batch_{N}.json.
Final Merge: The last completing batch triggers a transactional merge — downloads all batch files, sorts by batch number, enriches with defaults (isVerified: false, nullable actId/sceneId/characterId), and saves scriptBlocks.json.gz.
Progress Reporting: Batch completion tracked via completedBatches array in script_jobs document.

E. Metadata Extraction Workers (`extractMetadataWorker`)

Topic: extract-metadata
Execution Order: Structure runs first → chains props and sound cues with resolved scene IDs.
Structure: Extracts acts, scenes (with song/parentScene support), and characters. After saving, publishes props and soundCues messages with the scene list for correct ID assignment.
Props: Uses scene list from structure to assign correct sceneId. Saves extractedProps.json.
Sound Cues: Uses scene list from structure. Saves extractedSoundCues.json.
Completion: After both props and sound cues complete, a transactional check marks the job as completed.

F. Repair Mechanism (`repairScriptImport`)

Source: functions/src/script-import/process-ocr.ts
Trigger: onCall (user-initiated from the client).
Purpose: Re-runs only the failed phases from a partial import, using the cached OCR data in GCS. No Mistral OCR reprocessing required.
Flow: Verifies ownership → Loads cached OCR pages (loadCachedOcrPages) → Creates fresh context cache → Re-processes failed phases → Updates job status.
Block Repair: Failed batches are re-dispatched via PubSub fan-out (same worker pattern). Cache lifecycle managed by Vertex AI TTL (not deleted by repair function).
Phase 2 Repair: Structure, props, and sound cues can be individually re-extracted within the same function invocation (sequential).
Safety Deadline: Writes partial status 30s before Cloud Run kills the function (540s timeout).

G. Client-Side Structure Application

Since block extraction workers emit blocks without actId/sceneId, the client is responsible for assigning structure to blocks at load time.

Source: src/features/show-structure/hooks/useScriptBlocks.ts
Helper: applyStructure (extracted helper function)
Flow:
1. useScriptBlocks loads both scriptBlocks.json.gz and classifiedStructure.json from GCS via script-storage-api.ts.
2. applyStructure matches each block to an act/scene using page-number ranges from the classified structure.
3. Blocks receive their actId and sceneId fields before being placed in the Zustand store.
4. If classifiedStructure.json is unavailable, a client-side fallback derives structure from blocks:
- deriveStructureFromBlocks() — scans heading blocks for ACT/SCENE patterns; creates a "Full Show" parent act if no acts are detected.
- deriveCharactersFromBlocks() — extracts unique names from character_name blocks (stripping parentheticals like "V.O."), and assigns each character to the scenes they appear in.
Why Client-Side: Moving structure assignment to the client eliminated cross-batch state dependencies, enabling fully parallel extraction without risk of ordering bugs.

5. The Personal Pipeline (`processPersonalScript`)

Source: functions/src/script-import/process-personal.ts

A streamlined pipeline for On Book Rehearsal personal script imports. Uses the shared extractFromPages core function (self-managed cache, no PubSub) and runs both block and structure extraction.

Flow

Trigger: onCall — client calls after uploading PDF to users/{userId}/scripts/{fileName}.pdf.
OCR: runMistralOcr() with signed URL → saveMistralOcrResult() to OCR bucket.
Extraction: extractFromPages() — blocks and structure in parallel (5 pages/batch, concurrency 3).
Persistence: Saves scriptBlocks.json.gz and structure.json to personalScripts/{userId}/scripts/{scriptId}/.
Metadata: Writes script metadata (title, counts, characterHints) to Firestore at personalScripts/{userId}/scripts/{scriptId}.
Status: Tracks progress in root personalScriptJobs collection (not nested under user).

Differences from Project Pipeline

Aspect	Project	Personal
Trigger	`onCall` (`startOcrJob`)	`onCall` (`processPersonalScript`)
Extraction	PubSub fan-out (separate workers)	In-process (`extractFromPages`)
Phases	Blocks + Structure + Props + Sound Cues	Blocks + Structure only
Storage	`projects/{id}/system/`	`personalScripts/{uid}/scripts/{scriptId}/`
Job Collection	`script_jobs`	`personalScriptJobs` (root)
Repair	Supported (selective phase re-run)	Not supported
Character Hints	Not stored in job metadata	Stored for voice profile generation
Context Cache	Managed by coordinator, shared via PubSub	Self-managed within `extractFromPages`

6. Code Configuration (State as of March 2, 2026)

Files

Module	Purpose
Core (shared)
`functions/src/script-import/core/mistral-ocr.ts`	`runMistralOcr`, `saveMistralOcrResult`, `loadCachedOcrPages` — Mistral OCR 3 client & GCS caching
`functions/src/script-import/core/ocr-helpers.ts`	Backward-compat re-export of `mistral-ocr.ts` functions
`functions/src/script-import/core/extract.ts`	`extractFromPages`, `buildBatches`, `parseBlockResponse` — AI extraction pipeline
`functions/src/script-import/core/ai-schemas.ts`	Zod validators + JSON Schema constants for `responseSchema` enforcement
`functions/src/script-import/core/types.ts`	Shared types (`OCRPage`, `ScriptBlock`, `ShowStructure`, `ExtractionResult`, `ScriptTarget`)
`functions/src/script-import/types.ts`	Re-exports core types + `MISTRAL_OCR_MODEL` constant
`functions/src/script-import/storage.ts`	GCS persistence (`saveScriptBlocks`, `saveStructure`, `saveProps`, `saveSoundCues`, `savePersonalScriptBlocks`, `savePersonalStructure`)
`functions/src/cache-utils.ts`	Vertex AI context caching, `withRetry` (unified retry logic), `generateWithCache` (with `responseSchema` support)
`functions/src/json-utils.ts`	`repairJSON`, `runWithConcurrency` (worker-pool pattern)
Project Pipeline
`functions/src/script-import/process-ocr.ts`	Coordinator — `startOcrJob` (onCall), `repairScriptImport` (onCall)
`functions/src/script-import/pubsub-workers.ts`	Workers — `extractBatchWorker` (blocks), `extractMetadataWorker` (structure → props → soundCues)
Personal Pipeline
`functions/src/script-import/process-personal.ts`	Orchestrator — `processPersonalScript` (onCall, blocks + structure)
Prompts
`functions/prompts/*.prompt`	Extraction prompts (blocks, structure, props, sound cues, system instructions) — plain text, no YAML frontmatter
`functions/src/index.ts`	Re-export hub — all functions are re-exported from here
Client
`src/features/sync/ai-api.ts`	Client API — `subscribeToScriptJob`, `subscribeToLatestScriptJob`, `ScriptJobStatus` type
`src/features/show-structure/hooks/useScriptJobStatus.ts`	Status hook — subscribes to latest job, computes `failedPhases`, exposes `repairJob()`
`src/features/show-structure/components/ScriptJobStatusChip.tsx`	Toolbar chip — processing/partial/failed/completed visual states
`src/features/show-structure/components/ScriptRepairPopover.tsx`	Repair popover — phase checkboxes, error summaries, repair action
`src/contexts/JobProgressContext.tsx`	Lightweight "started" toast only — ongoing status delegated to sidebar/toolbar

Tests

Test File	Coverage
`functions/tests/ai-pipeline-integration.test.ts`	Integration test for the AI extraction pipeline

Tunings

Setting	Value	Reason
Gemini Model	`gemini-2.5-pro`	Best balance of reasoning vs. speed for extraction tasks.
OCR Model	`mistral-ocr-2512` (Mistral OCR 3)	High-quality Markdown output from PDFs.
Project Batch Size	7 pages (PubSub workers)	Each worker processes one batch independently.
Personal Batch Size	5 pages (in-process)	Smaller batches for in-process concurrency.
Block Concurrency	Unlimited (PubSub) / 3 (in-process)	PubSub workers run in separate Cloud Function instances; in-process uses worker pool.
Metadata Concurrency	Sequential chain	Structure → (props + soundCues in parallel). Props/soundCues need scene IDs from structure.
Repair Block Concurrency	PubSub fan-out	Failed batches re-dispatched as independent messages.
Max Output Tokens	65,536 (Blocks), 32,768 (Structure), 16,384 (Props/Sound)	Sized per extraction task.
Retry Attempts	2 (withRetry) / 3 (PubSub batch retry)	`withRetry` reduced to 2 to prevent 30-minute hangs on 300s timeouts. PubSub retries are per-batch with exponential backoff.
Cache TTL	60 mins	Sufficient for even long script processing. Shared across all PubSub workers.
`startOcrJob` Timeout	300s (5 min)	Mistral OCR is synchronous but fast.
Worker/Repair Timeout	540s (9 min)	Maximum for Cloud Functions.
`startOcrJob` Memory	1 GiB	Sufficient for OCR + dispatch overhead.
Worker/Repair Memory	2 GiB	Required for large script extraction.

Known "Gotchas" & Fixes

Signed URL Permissions: Mistral OCR accesses PDFs via GCS signed URLs. The Cloud Functions service account must have iam.serviceAccounts.signBlob permission.
PubSub Ordering: Batch workers complete in arbitrary order. The final merge sorts by batch number before saving.
Transactional Completion: Both extractBatchWorker and extractMetadataWorker use Firestore transactions to safely detect "last completer" and trigger final status updates.
OCR Reuse: startOcrJob checks for cached Mistral OCR before running fresh OCR. Use forceReOcr: true to force re-processing.
Cache Lifecycle: For PubSub fan-out, the coordinator does NOT delete the cache — it's shared by all workers. Vertex AI TTL (1 hour) handles cleanup. For in-process extraction (extractFromPages without external cache), the function deletes its own cache.
responseSchema Enforcement: All Gemini calls use JSON Schema constraints via responseSchema, eliminating JSON repair utilities and polymorphic parsing. Backward-compatible parsing is retained in parseBlockResponse for safety.
Scene ID Propagation: Structure extraction chains into props/soundCues extraction. The structure worker formats the scene list and passes it to props/soundCues messages, ensuring correct sceneId assignment.
Prompt Hygiene: Structure prompt explicitly prevents hallucinated divisions; sound cue prompt preserves original cue numbers instead of generating sequential ones.
Unified Retry: generateWithCache delegates to withRetry for consistent 429/5xx handling with Retry-After header parsing, AbortSignal.timeout, and exponential backoff with jitter.
Auth Timing: useScriptJobStatus uses useSyncExternalStore via useAuthUid() to reactively track Firebase Auth. Without this, the hook subscribes before auth resolves and silently receives a no-op unsubscribe that never re-fires.
Safety Deadline: repairScriptImport writes partial status 30s before the 540s Cloud Run timeout kills the function, preventing silent failures.

7. Client-Side Status & Repair UI

Architecture

The status communication uses a Hybrid pattern:

Lightweight toast signals the start of the import (auto-closes after 3s via JobProgressContext).
Sidebar badge on the "Show Structure" tool icon provides cross-tab awareness.
Toolbar status chip within the "Show Structure" toolbar shows detailed status and enables repair actions.

Components

`useScriptJobStatus` (Hook)

Source: src/features/show-structure/hooks/useScriptJobStatus.ts
Auth Reactivity: Uses useAuthUid() (via useSyncExternalStore) to track auth state. The userId is included in the useEffect dependency array so the Firestore subscription re-establishes after login.
Subscription: Calls subscribeToLatestScriptJob(projectId, ...) to watch the most recent script_jobs/{jobId} document for the current project/user.

Client-Side Fallback (Structure Derivation)

Trigger: useScriptBlocks.ts calls deriveStructureFromBlocks() + deriveCharactersFromBlocks() when classifiedStructure.json is absent. If classified structure exists but has no characters, only deriveCharactersFromBlocks() is called.
Quality: Adequate for simple shows. AI output is superior for ensemble detection, character notes, and complex multi-act structures. User can run repair to upgrade from fallback to AI-derived structure.
State Derivation:
- failedPhases — computed from phaseErrors keys on the job document.
- indicator — 'processing' (blue pulse), 'warning' (amber), 'success' (green), or null.
Actions:
- repairJob(phases) — calls repairScriptImport Cloud Function with selected phases.
- dismiss() — resets local state without affecting Firestore.

`ScriptJobStatusChip` (Toolbar)

Source: src/features/show-structure/components/ScriptJobStatusChip.tsx
States: Processing (spinner), Partial (amber warning), Failed (red error), Completed (green check → auto-fades after 10s).
Interaction: Clicking partial/failed opens the ScriptRepairPopover.
Rendering: Only renders when a job exists (returns null otherwise).

`ScriptRepairPopover`

Source: src/features/show-structure/components/ScriptRepairPopover.tsx
Content: Phase checkboxes (Structure, Props, Sound Cues, Blocks) with error summaries.
Pre-Selection: Failed phases are pre-checked; user can toggle any phase for selective repair.
Result Feedback: Shows repair progress, success, or error inline.
Dismiss: Closes via outside click, escape key, or dismiss button.

`WorkspaceSidebar` Integration

Prop: toolStatusMap: Partial<Record<ToolId, StatusIndicator>>
Behavior: SidebarTool renders a colored dot when statusIndicator is set:
- processing → blue pulsing dot
- warning → amber dot
- success → green dot
Precedence: Status dot takes visual priority over numeric badges.

`JobProgressContext` (Simplified)

Only fires the initial "Script analysis started..." toast (auto-closes after 3s).
All ongoing/terminal status communication is delegated to the sidebar badge + toolbar chip.
Still loads completed/partial results into the Zustand store and navigates to Show Structure.

8. Development Guide

How to Monitor

View Firebase Functions logs:

bash

firebase functions:log --only startOcrJob
firebase functions:log --only extractBatchWorker
firebase functions:log --only extractMetadataWorker
firebase functions:log --only repairScriptImport
firebase functions:log --only processPersonalScript

How to Test Prompts

Prompts are standard text files in functions/prompts/.

extract_blocks.prompt: Source of truth for block extraction.
extract_structure.prompt: Source for Structure/Characters.
extract_props.prompt: Source for Props (uses template).
extract_sound_cues.prompt: Source for Sound Cues (uses template).
system_instructions_cache.prompt: System instructions cached with the script context.

Deployment

To deploy updates to the pipeline:

bash

cd functions
npm run build   # Compiles TS and copies prompts to lib/
firebase deploy --only functions

CRITICAL: Always run npm run build first. The prompts are not TypeScript files, so they must be manually copied (handled by the build script using cpx).

PubSub Topics: The extract-batch and extract-metadata topics must exist in the GCP project. They are auto-created on first publish if the service account has pubsub.topics.create permission.

Secrets: MISTRAL_API_KEY must be set in Firebase Functions secrets (firebase functions:secrets:set MISTRAL_API_KEY).

Last updated: March 22, 2026 (One-Line-Per-Block standard)

Script Ingestion Pipeline Architecture (March 2026) ​

1. High-Level Flow ​

2. Trigger Integration ​

A. Project Uploads (startOcrJob) ​

B. Personal Uploads (processPersonalScript) ​

C. Manual Re-Analysis ​

3. Shared Core Module (core/) ​

core/mistral-ocr.ts — Mistral OCR Client ​

core/ocr-helpers.ts — Backward Compatibility ​

core/extract.ts — AI Extraction Pipeline ​

core/ai-schemas.ts — JSON Schema Constraints ​

core/types.ts — Shared Type Definitions ​

4. The Project Pipeline (startOcrJob + PubSub Workers) ​

Key Components ​

A. OCR (Mistral OCR) ​

B. Context Caching (cache-utils.ts) ​

C. PubSub Dispatch (dispatchExtractionPipeline) ​

D. Block Extraction Workers (extractBatchWorker) ​

E. Metadata Extraction Workers (extractMetadataWorker) ​

F. Repair Mechanism (repairScriptImport) ​

G. Client-Side Structure Application ​

5. The Personal Pipeline (processPersonalScript) ​

Flow ​

Differences from Project Pipeline ​

6. Code Configuration (State as of March 2, 2026) ​

Files ​

Tests ​

Tunings ​

Known "Gotchas" & Fixes ​

7. Client-Side Status & Repair UI ​

Architecture ​

Components ​

useScriptJobStatus (Hook) ​

Client-Side Fallback (Structure Derivation) ​

ScriptJobStatusChip (Toolbar) ​

ScriptRepairPopover ​

WorkspaceSidebar Integration ​

JobProgressContext (Simplified) ​

8. Development Guide ​

How to Monitor ​

How to Test Prompts ​

Deployment ​