O4: Design a Structured Data Extraction Pipeline for Enterprise LLM Applications
How enterprise systems transform messy documents into LLM-ready structured data.
When a user asks “What are the P0 bugs mentioned in last week’s standup?”, an LLM needs more than raw text—it needs structured data it can filter, relate, and reason. This system sits between enterprise sources (Jira, Confluence, Outlook) and LLM applications, transforming unstructured documents into queryable, contextualized knowledge. The challenge: Process millions of heterogeneous documents, extract precise entities and relationships, and serve results in milliseconds—while maintaining accuracy across 10,000 isolated customer environments.
Design a Structured Data Extraction Pipeline for Enterprise LLM Applications
Requirements
Functional:
Ingest from multiple sources (Jira, Confluence, Outlook, Word)
Classify document types (bug report, meeting notes, design doc, email)
Extract structured fields (entities, dates, relationships)
Index for fast retrieval (keyword + semantic search)
Provide query API for LLM applications
Non-Functional (CLT):
Consistency: Eventual consistency (5-10 min lag acceptable)
Latency: Ingestion 10K docs/min, search <500ms p95
Throughput: 100M docs per customer, 10K customers (multi-tenant)
Out of Scope:
Real-time streaming (<1 min latency)
Custom ML models per customer
Document versioning/history
How LLM Applications Use This System
User Query Example: User asks our system: “What are the P0 bugs assigned to the backend team mentioned in last week’s standup?”
What the LLM needs:
Structured filters: priority=P0, team=backend, type=bug
Relationships: mentioned_in=[meeting-notes-123]
Temporal context: meeting_date >= last_week
What our system provides:
{
“doc_id”: “jira-456”,
“doc_type”: “bug_report”,
“title”: “API timeout in checkout service”,
“entities”: {
“teams”: [”backend”, “payments”],
“priority”: “P0”,
“status”: “in_progress”
},
“dates”: {
“created”: “2024-11-28”,
“updated”: “2024-12-02”
},
“relationships”: {
“mentioned_in”: [”meeting-123”],
“blocks”: [”jira-457”]
},
“summary”: “P0 bug causing checkout timeouts”
}LLM retrieval flow:
LLM parses query → identifies filters (P0, backend, last week)
Calls our search API with structured filters + semantic query
Gets ranked documents with precise metadata
LLM generates answer with citations
Data Model
RawDocuments (S3)
- id (UUID)
- customer_id (partition key)
- source_type (jira|confluence|outlook|word)
- content_hash (SHA256, for change detection)
- content_uri (S3 path)
StructuredDocuments (PostgreSQL)
- id (UUID)
- customer_id (indexed)
- doc_type (bug_report|meeting_notes|design_doc|email)
- title (text)
- entities (JSONB: people, teams, priority, status)
- dates (JSONB: created, updated, due)
- relationships (JSONB: mentioned_in, blocks, related_to)
Embeddings (Vector DB - Pinecone)
- doc_id
- customer_id (namespace)
- embedding (1536-dim vector)
SearchIndex (Elasticsearch)
- doc_id
- customer_id (filtered on every query)
- doc_type, title, entities (for filtering)
- full_text (analyzed for keyword search)
