Lewis C. Lin’s Newsletter

Deep Dives

O4: Design a Structured Data Extraction Pipeline for Enterprise LLM Applications

How enterprise systems transform messy documents into LLM-ready structured data.

Dec 20, 2025
∙ Paid

When a user asks “What are the P0 bugs mentioned in last week’s standup?”, an LLM needs more than raw text—it needs structured data it can filter, relate, and reason. This system sits between enterprise sources (Jira, Confluence, Outlook) and LLM applications, transforming unstructured documents into queryable, contextualized knowledge. The challenge: Process millions of heterogeneous documents, extract precise entities and relationships, and serve results in milliseconds—while maintaining accuracy across 10,000 isolated customer environments.

Design a Structured Data Extraction Pipeline for Enterprise LLM Applications

Requirements

Functional:

  • Ingest from multiple sources (Jira, Confluence, Outlook, Word)

  • Classify document types (bug report, meeting notes, design doc, email)

  • Extract structured fields (entities, dates, relationships)

  • Index for fast retrieval (keyword + semantic search)

  • Provide query API for LLM applications

Non-Functional (CLT):

  • Consistency: Eventual consistency (5-10 min lag acceptable)

  • Latency: Ingestion 10K docs/min, search <500ms p95

  • Throughput: 100M docs per customer, 10K customers (multi-tenant)

Out of Scope:

  • Real-time streaming (<1 min latency)

  • Custom ML models per customer

  • Document versioning/history


How LLM Applications Use This System

User Query Example: User asks our system: “What are the P0 bugs assigned to the backend team mentioned in last week’s standup?”

What the LLM needs:

  • Structured filters: priority=P0, team=backend, type=bug

  • Relationships: mentioned_in=[meeting-notes-123]

  • Temporal context: meeting_date >= last_week

What our system provides:

{
  “doc_id”: “jira-456”,
  “doc_type”: “bug_report”,
  “title”: “API timeout in checkout service”,
  “entities”: {
    “teams”: [”backend”, “payments”],
    “priority”: “P0”,
    “status”: “in_progress”
  },
  “dates”: {
    “created”: “2024-11-28”,
    “updated”: “2024-12-02”
  },
  “relationships”: {
    “mentioned_in”: [”meeting-123”],
    “blocks”: [”jira-457”]
  },
  “summary”: “P0 bug causing checkout timeouts”
}

LLM retrieval flow:

  1. LLM parses query → identifies filters (P0, backend, last week)

  2. Calls our search API with structured filters + semantic query

  3. Gets ranked documents with precise metadata

  4. LLM generates answer with citations


Data Model

RawDocuments (S3)
- id (UUID)
- customer_id (partition key)
- source_type (jira|confluence|outlook|word)
- content_hash (SHA256, for change detection)
- content_uri (S3 path)

StructuredDocuments (PostgreSQL)
- id (UUID)
- customer_id (indexed)
- doc_type (bug_report|meeting_notes|design_doc|email)
- title (text)
- entities (JSONB: people, teams, priority, status)
- dates (JSONB: created, updated, due)
- relationships (JSONB: mentioned_in, blocks, related_to)

Embeddings (Vector DB - Pinecone)
- doc_id
- customer_id (namespace)
- embedding (1536-dim vector)

SearchIndex (Elasticsearch)
- doc_id
- customer_id (filtered on every query)
- doc_type, title, entities (for filtering)
- full_text (analyzed for keyword search)

API Design

User's avatar

Continue reading this post for free, courtesy of Lewis C. Lin.

Or purchase a paid subscription.
© 2026 Lewis C. Lin · Privacy ∙ Terms ∙ Collection notice
Start your SubstackGet the app
Substack is the home for great culture