← back

How I cut AI inference costs by 90% with a semantic cache

20245 min read

Every LLM API call costs money and takes time. For most applications, a significant portion of those calls are asking functionally the same question — slightly reworded, but semantically identical. A user asking "how do I reset my password?" and another asking "what's the process for changing my password?" are asking the same thing. Why pay for two LLM calls?

This is the core idea behind Sentinel-AI's semantic cache.

How it works

When a request comes in, we generate an embedding — a vector representation of the prompt's meaning — and search a Redis Vector Store for similar embeddings we've seen before. If we find one above a 95% similarity threshold, we return the cached response immediately. No LLM call, no latency, a flat $0.01 charge to the tenant instead of whatever the model would have cost.

// Simplified version of the cache check
VectorStoreSearchRequest searchRequest = VectorStoreSearchRequest.query(prompt)
    .withTopK(1)
    .withSimilarityThreshold(0.95f)
    .build();

List<Document> results = vectorStore.similaritySearch(searchRequest);

if (!results.isEmpty()) { return CacheHit.of(results.get(0).getContent()); } ```

The 95% threshold

This number matters. Too high (99%+) and you miss obvious paraphrases — the cache barely helps. Too low (85%) and you start returning answers to questions the user didn't actually ask — correctness degrades. 95% is aggressive enough to catch real duplicates while staying conservative enough to preserve answer accuracy.

In practice, for applications with any conversational patterns or FAQ-style queries, cache hit rates of 30-40% are achievable. On a high-volume tenant paying $0.002/token for Claude Sonnet, that's a significant reduction.

What semantic caching can't do

It doesn't help with genuinely unique prompts — creative writing, code generation for novel problems, anything where the user is asking something they've never asked before. It's most effective for support applications, documentation bots, and any domain with repeated question patterns.

The other limitation is staleness. If the underlying answer changes (a policy update, a product change), cached responses can serve stale information until the cache expires. Build in TTLs and invalidation hooks from the start.

The broader pattern

Semantic caching is just the retrieval side of RAG applied to your own query history. Once you're generating embeddings for incoming prompts anyway, the cost of storing them in a vector store is trivial. The infrastructure is already there — you're just deciding whether to use it on the way in as well as the way out.

Written by Basit Tijani. Find me on GitHub or LinkedIn.