AI

A Developer's Guide to Generative AI Integration in Production Apps

8 min read
A

Beyond the Demo

Every developer has seen impressive AI demos, but shipping generative AI in production is a different challenge entirely. After integrating LLMs into three production applications over the past year, I have encountered a consistent set of problems that demos never surface: unpredictable latency, hallucinated data, cost overruns, and users who don't understand what the AI can and cannot do.

The gap between a working prototype and a reliable production feature is wider with LLMs than almost any other technology I have worked with. A demo can tolerate a 5-second response time, occasional nonsense outputs, and unlimited API spend. Production cannot.

Designing the Prompt Layer

The single most impactful architectural decision is treating prompts as versioned, testable artifacts rather than inline strings. We maintain a prompt registry where each prompt has a name, version number, template with typed variables, and a suite of test cases. Deployments can roll back to a previous prompt version without changing application code.

interface PromptConfig {
  version: number;
  model: string;
  maxTokens: number;
  temperature: number;
  template: string;
  testCases: Array<{
    input: Record<string, string>;
    expectedContains: string[];
    expectedNotContains?: string[];
  }>;
}

const promptRegistry: Record<string, PromptConfig> = {
  "summarize-notes": {
    version: 3,
    model: "gpt-4o",
    maxTokens: 500,
    temperature: 0.3,
    template: `You are a medical documentation assistant. Summarize the following clinical notes in exactly 3 bullet points.

Focus on: diagnosis, treatment plan, and follow-up actions.
Do NOT include patient identifiers or speculative information.

Notes:
{{notes}}

Output format:
- Diagnosis: ...
- Treatment: ...
- Follow-up: ...`,
    testCases: [
      {
        input: {
          notes: "Patient presents with persistent cough for 2 weeks...",
        },
        expectedContains: ["Diagnosis:", "Treatment:", "Follow-up:"],
        expectedNotContains: ["patient name", "SSN"],
      },
    ],
  },
};

function renderPrompt(name: string, variables: Record<string, string>): string {
  const config = promptRegistry[name];
  if (!config) throw new Error(`Unknown prompt: ${name}`);

  let rendered = config.template;
  for (const [key, value] of Object.entries(variables)) {
    rendered = rendered.replace(`{{${key}}}`, value);
  }
  return rendered;
}

We run the test cases in CI against every prompt change. If a prompt modification causes test cases to fail, the deployment is blocked. This has prevented several regressions where a well-intentioned prompt tweak for one use case broke the output format for another.

Building a Resilient LLM Client

LLM APIs fail in ways that traditional APIs do not. Timeouts are longer, rate limits are stricter, and responses can be technically valid JSON but semantically nonsensical. We wrap every LLM call in a structured client with retry logic, circuit breaking, and response validation.

import OpenAI from "openai";

interface LLMResponse {
  content: string;
  model: string;
  tokensUsed: { prompt: number; completion: number; total: number };
  latencyMs: number;
  cached: boolean;
}

class LLMClient {
  private openai: OpenAI;
  private cache: Map<string, { response: LLMResponse; expiry: number }>;
  private failureCount: number = 0;
  private circuitOpen: boolean = false;
  private circuitResetTime: number = 0;

  constructor(apiKey: string) {
    this.openai = new OpenAI({ apiKey });
    this.cache = new Map();
  }

  async complete(
    promptName: string,
    variables: Record<string, string>,
    options: { skipCache?: boolean; timeout?: number } = {}
  ): Promise<LLMResponse> {
    // Circuit breaker check
    if (this.circuitOpen) {
      if (Date.now() < this.circuitResetTime) {
        throw new Error("LLM service circuit breaker is open");
      }
      this.circuitOpen = false;
      this.failureCount = 0;
    }

    const config = promptRegistry[promptName];
    const prompt = renderPrompt(promptName, variables);

    // Check semantic cache
    const cacheKey = `${promptName}:${JSON.stringify(variables)}`;
    if (!options.skipCache) {
      const cached = this.cache.get(cacheKey);
      if (cached && cached.expiry > Date.now()) {
        return { ...cached.response, cached: true };
      }
    }

    const startTime = Date.now();

    try {
      const response = await this.openai.chat.completions.create({
        model: config.model,
        max_tokens: config.maxTokens,
        temperature: config.temperature,
        messages: [{ role: "user", content: prompt }],
      });

      const result: LLMResponse = {
        content: response.choices[0]?.message?.content ?? "",
        model: config.model,
        tokensUsed: {
          prompt: response.usage?.prompt_tokens ?? 0,
          completion: response.usage?.completion_tokens ?? 0,
          total: response.usage?.total_tokens ?? 0,
        },
        latencyMs: Date.now() - startTime,
        cached: false,
      };

      // Cache successful responses for 10 minutes
      this.cache.set(cacheKey, {
        response: result,
        expiry: Date.now() + 10 * 60 * 1000,
      });

      this.failureCount = 0;
      return result;
    } catch (error) {
      this.failureCount++;
      if (this.failureCount >= 5) {
        this.circuitOpen = true;
        this.circuitResetTime = Date.now() + 30_000; // 30-second cooldown
      }
      throw error;
    }
  }
}

The circuit breaker has saved us from cascading failures twice in production. When OpenAI experienced elevated error rates, our application gracefully switched to the fallback path instead of queuing up hundreds of timed-out requests that would have consumed all available connections.

Managing Cost at Scale

Token costs add up quickly when you have thousands of users. We implemented a tiered approach: simple classification tasks use a smaller, cheaper model, while complex generation tasks use GPT-4o. A caching layer intercepts semantically similar requests using embedding-based similarity search, reducing redundant API calls by about 35%.

The numbers are concrete. Before optimization, our monthly LLM spend for 3,000 daily active users was approximately $4,200. After implementing model tiering, caching, and prompt length optimization, we brought it down to $1,800. The breakdown:

  • Model tiering saved roughly 40%. Tasks like sentiment classification and category tagging moved from GPT-4o to GPT-4o-mini, which costs a fraction per token with no measurable quality difference for structured classification.
  • Semantic caching saved roughly 25%. Customer support applications generate many similar queries. Caching responses for semantically equivalent inputs eliminated redundant API calls.
  • Prompt optimization saved roughly 15%. We audited every prompt template and removed unnecessary instructions, redundant examples, and verbose system messages. Shorter prompts mean fewer input tokens.
// Cost tracking middleware that logs token usage per feature and tenant
async function trackLLMCost(
  tenantId: string,
  feature: string,
  response: LLMResponse
): Promise<void> {
  const costPerInputToken = MODEL_PRICING[response.model]?.input ?? 0;
  const costPerOutputToken = MODEL_PRICING[response.model]?.output ?? 0;

  const totalCost =
    response.tokensUsed.prompt * costPerInputToken +
    response.tokensUsed.completion * costPerOutputToken;

  await db.query(
    `INSERT INTO llm_usage_log (tenant_id, feature, model, prompt_tokens, completion_tokens, cost_usd, latency_ms, cached, created_at)
     VALUES ($1, $2, $3, $4, $5, $6, $7, $8, NOW())`,
    [tenantId, feature, response.model, response.tokensUsed.prompt, response.tokensUsed.completion, totalCost, response.latencyMs, response.cached]
  );

  // Check if tenant is approaching their monthly budget
  const monthlySpend = await getMonthlySpend(tenantId);
  if (monthlySpend > TENANT_BUDGET_THRESHOLD) {
    await notifyTenantAdmin(tenantId, monthlySpend);
  }
}

Logging every API call with its cost has been invaluable for identifying optimization opportunities. We discovered that one feature accounted for 60% of our total spend because it was sending the entire conversation history as context on every request. Trimming the history to the last 10 messages cut that feature's cost by 70%.

Error Handling and Fallbacks

Every feature that uses AI has a graceful degradation path that works without it. This is a non-negotiable design principle. If the LLM API is down, slow, or returns garbage, the user should still be able to accomplish their task, just without the AI assistance.

For our note summarization feature, the fallback extracts the first three sentences of the notes as a basic summary. It is not as good as the LLM-generated version, but it gives the user something useful immediately. For the classification feature, the fallback presents the user with a manual category selector instead of auto-classifying.

We also validate every LLM response against a schema before presenting it to the user. The model is instructed to return structured output, but it sometimes deviates. A Zod schema validates the response, and if validation fails, we retry once with a more explicit prompt before falling back.

Common Pitfalls

Streaming without backpressure. Streaming LLM responses to the client improves perceived latency, but without backpressure handling, a slow client connection can cause memory to balloon on the server. We use Node.js readable streams with proper highWaterMark settings and abort the upstream LLM request if the client disconnects.

Trusting the model's confidence. LLMs do not have calibrated confidence. A response that sounds authoritative can be completely fabricated. We never use LLM output for decisions that require factual accuracy without a verification step. For medical note summarization, every AI-generated summary includes a disclaimer and links back to the source notes.

Ignoring prompt injection. User input that gets interpolated into prompts can manipulate the model's behavior. We sanitize all user inputs before template interpolation, limit input length, and use system messages to instruct the model to ignore contradictory instructions in the user content.

Not setting token limits. Without explicit max_tokens, a model can generate thousands of tokens for a simple request. We set conservative limits on every prompt configuration and monitor for responses that consistently hit the limit, which usually indicates the prompt needs restructuring.

Underestimating latency variance. LLM response times can range from 500ms to 15 seconds for the same prompt depending on load. We set aggressive timeouts (8 seconds for user-facing features) and always show a loading state with a progress indicator. Users tolerate waiting when they can see something is happening, but they abandon the feature if it appears frozen.

Integrating generative AI into production is fundamentally an engineering discipline, not a prompt-writing exercise. The model is the easy part. The reliability layer, cost management, error handling, and user experience design around it determine whether the feature ships successfully or gets reverted after a week.

Related Posts