Back to all posts

Post 27 Mar 2026 3 min read

Semantic Caching in Production

Repeated user intents can quietly inflate LLM cost and latency. Semantic caching helps, but production use comes with trade-offs.

Diagram of a semantic caching workflow with vector search, similarity threshold, cache hit and cache miss paths.

If you’re running an LLM in production, you’ve probably seen this in your logs and in your bill:

Users ask the same question repeatedly, just phrased differently.

On something like an HR system, this can show up as:

  • “How do I file expenses?”
  • “Where do I file a reimbursement?”
  • “What’s the expense process?”

All roughly mean the same thing, but each one still triggers a fresh LLM call. That means paying for the same answer multiple times while adding unnecessary latency.

One pattern that helps is semantic caching.

The idea is simple: When a query comes in, embed it and compare it to previous queries in a vector database (usually with cosine similarity). If a new query is similar enough to a cached one, return the cached response.

A Redis cache hit can come back in around 20 to 30ms, compared to seconds for an LLM call. And you avoid paying for the same tokens again.

Finding the right similarity threshold takes iteration, but around 0.88 to 0.92 is a practical place to start.

What gets interesting is making this reliable in production. This is where semantic caching becomes less of a simple optimization and more of a trade-off system.

Not all queries are the same

Queries have different lifecycles.

  • “What’s the weather today?” can be stale in minutes.
  • “What is our sick leave policy?” might stay valid for days or months.

If you apply one TTL to everything, you’ll either serve stale answers or miss savings. Redis makes TTL easy. Choosing the right TTL policy is the hard part.

Personalization complicates everything

Some queries look identical on the surface but are not interchangeable.

  • “How many sick days do I have?” depends on the individual.
  • “What benefits do I get?” can vary by region or role.

Without proper cache scoping (for example user_id, org, role, region), you can return the wrong answer. In the worst case, you leak data.

Invalidation is where systems break

This part sounds easy until it isn’t.

Your knowledge source changes, but cache entries do not magically know that. So you can return answers confidently that are no longer correct.

You need active invalidation or refresh strategies:

  • data update triggers
  • source versioning
  • user feedback signals (like thumbs down)

Fine-grained keys and expirations help, but the invalidation logic is still on you.

Hit rate is not the full story

It’s easy to optimize for hit rate because it’s visible. But it can mislead.

A cache hit on a short cheap query barely moves the needle. A hit on a long expensive prompt can save much more.

The better question is: “What did we actually save when we hit the cache?”

Tie cache metrics to token usage and latency, not just hit percentage.

Extra challenges I keep running into

Multi-turn conversations

How do we cache responses that depend on conversation history and context? One useful approach is to cache a summary of the prior context.

Cold starts

A cache helps only after it has data. Pre-warming with common intents can make a noticeable difference early.

One practical starting point

Profile spend first.

If there is little repeated intent, semantic caching may not help much. But when repetition exists, savings compound quickly.

How are you thinking about LLM cost optimization in real systems?