The Production Gap
Every enterprise AI project we have worked on follows the same arc. A proof of concept gets built in two weeks, it impresses stakeholders, and then the question becomes: how do we put this in production? That is where the real engineering begins.
The demo works because demos use curated inputs, ignore error cases, and have no users depending on them. Production systems have to handle malformed inputs, API outages, token limits, cost overruns, and compliance requirements. This post covers the patterns we use to bridge that gap.
Pattern 1: Structured Output Enforcement
The single biggest source of production failures in LLM-powered applications is unstructured output. If your application expects JSON and the model returns a markdown code block with JSON inside it, your parser breaks.
The solution is not better prompting — it is output enforcement at the API level. OpenAI's function calling and structured outputs features, combined with a validation layer using Zod or Pydantic, give you a guarantee that either the model returns valid output matching your schema, or the call fails with a recoverable error.
// Example: Enforcing structured extraction with Zod
const extractionSchema = z.object({
companyName: z.string(),
invoiceNumber: z.string(),
lineItems: z.array(z.object({
description: z.string(),
amount: z.number()
})),
totalAmount: z.number()
});
const result = await openai.chat.completions.create({
model: 'gpt-4o',
messages: [...],
response_format: { type: 'json_schema', json_schema: zodToJsonSchema(extractionSchema) }
});
Pattern 2: Semantic Caching
LLM API calls are expensive. A typical GPT-4o call with a 2,000-token prompt costs roughly $0.01. If your application processes 50,000 similar documents per month, that is $500 in API costs before you account for output tokens.
Semantic caching works by embedding the incoming query, checking for similar queries in a vector store (Pinecone, pgvector, or Redis with RediSearch), and returning the cached response if the similarity score exceeds a threshold. Unlike exact-match caching, semantic caching handles paraphrased queries.
The implementation detail that most teams miss: cache invalidation. When you change your prompt template, all semantic cache entries for that prompt version are stale. Version your prompts and scope your cache keys to the prompt version.
Pattern 3: The RAG Pipeline with Source Grounding
Retrieval-Augmented Generation is now the standard pattern for enterprise knowledge bases. The basic pattern — chunk documents, embed them, retrieve relevant chunks on query, inject into prompt — is well understood. The production challenges are in the details.
Chunking strategy matters more than model choice. Fixed-size chunks lose semantic coherence at boundaries. Sentence-based chunking works better for prose but breaks down for structured documents like contracts. We typically use recursive character splitting with sentence boundary detection, with chunk sizes tuned to the document type.
Retrieval ranking is not just cosine similarity. A hybrid retrieval approach combining vector search with BM25 keyword search consistently outperforms pure semantic search, particularly for queries that contain specific terms (product names, SKUs, contract clauses) that semantic embeddings can miss.
Source attribution is non-negotiable in enterprise contexts. If your RAG system tells a user something, it must be able to cite exactly which document and which section that information came from. This requires storing document metadata alongside embeddings and including citation logic in your response generation prompt.
Pattern 4: Agentic Tool Use with Guardrails
LLM agents — systems where the model can call external tools and act on the results — are powerful but dangerous in production environments. The key design principle: agents should have narrow authority.
Every tool the agent can call should be the minimum necessary to complete the task. An agent that can read from a database should not be able to write to it unless the task explicitly requires writes. An agent that can send emails should require human approval for emails to external addresses.
The implementation pattern we use:
- Define tools with explicit schemas and descriptions.
- Implement a tool authorization layer that checks the calling context against allowed tool permissions.
- Log every tool call with inputs, outputs, and the reasoning step that triggered it.
- Implement a maximum step limit to prevent infinite loops.
- Add a human-in-the-loop checkpoint for any action that is irreversible (deleting records, sending notifications, making payments).
Pattern 5: Cost and Latency Monitoring
In a production LLM system, cost is an operational concern, not just a budget concern. A single poorly-designed prompt that processes large documents can consume your monthly API budget in a day.
Every LLM call in your system should emit metrics: token usage (input and output separately), model used, latency, and whether the response was cache-hit or cache-miss. Set alerting thresholds on per-operation cost and total daily spend. Track cost per tenant in multi-tenant systems so you can bill accurately or identify tenants with unusual usage patterns.
The Governance Question
Enterprise buyers will ask: how do we know what the AI decided and why? Your answer needs to be an audit trail, not an explanation. Every AI-assisted decision should log the prompt template version used, the retrieved context (for RAG), the model response, the structured output, and the downstream action taken. This is not optional in regulated industries — it is the difference between a sale and a lost deal.
