← back

Sentinel-AI

2024 – Present

Backend Engineer

A SaaS LLM gateway that acts as an intelligent proxy between client applications and three major AI providers. It classifies prompt complexity in real time, routes to the cheapest capable model, and serves repeated queries from a semantic vector cache — all while maintaining a precise per-token billing ledger per tenant.

KEY METRIC

Up to 90% reduction in AI inference costs via intelligent routing + semantic caching

Stack

Java 21Spring BootSpring AIRedis Vector StorePostgreSQLVirtual ThreadsOAuth2Docker

Overview

Sentinel-AI sits between your application and the LLM providers. Every request passes through a three-layer Spring AI CallAroundAdvisor chain: the SmartRouterAdvisor classifies prompt complexity and routes to Gemini (simple), DeepSeek (reasoning), or Claude (high-stakes); the SemanticCacheAdvisor checks a Redis Vector Store for structurally similar previous prompts at a 95% similarity threshold; and the BillingService intercepts token usage metadata to calculate costs atomically.

The gateway supports both synchronous and streaming (Flux) response modes. Java 21 Virtual Threads with ZGC provide the concurrency headroom to handle thousands of simultaneous tenant requests without thread exhaustion.

Architecture

Client AppAPI GatewayTenantFilterSmartRouterAdvisorSemanticCacheAdvisorBillingAdvisorGemini(Simple)DeepSeek(Reasoning)Claude(High-Stakes)RedisVector StorePostgreSQLBilling Ledger

System architecture overview

Technical Challenges

Thread Context Leakage Across Reactive Boundaries

Tenant context stored in a ThreadLocal leaks when execution crosses into Project Reactor's async Flux pipeline. I solved this by explicitly capturing the TenantContext before entering the reactive block and spawning a new Virtual Thread for post-stream billing — ensuring no cross-tenant data contamination.

HikariCP Exhaustion on Long LLM Calls

Wrapping slow LLM calls (Claude Opus can take 20s) inside broad @Transactional boundaries held database connections hostage for the entire generation duration, exhausting the pool globally. Fixed by isolating transactions entirely to the post-generation ledger write phase.

Billing Race Conditions Under Concurrency

Parallel requests hitting the same tenant account simultaneously caused balance anomalies with naive ORM read-modify-write patterns. Replaced with a direct atomic UPDATE decrement in SQL — the database handles the concurrency, not the application layer.

View on GitHub ↗Live Demo ↗