API Caching Strategies That Actually Work

Caching is one of those things that every developer knows they should do, but the specifics of how to do it well are surprisingly nuanced. After running MC Heads and serving millions of API requests, we have developed a layered caching strategy that keeps response times fast and infrastructure costs low. Here is what works in practice.

The Caching Pyramid

Think of caching in layers, like a pyramid. Each layer is faster but holds less data than the one below it.

At the top is the CDN edge cache — the fastest possible response, served from a server physically close to the user. Below that is your application-level cache — an in-memory store like Redis or a local SQLite database. At the base is the origin data source — the upstream API or database that always has the freshest data.

The goal is to serve as many requests as possible from the upper layers, so only a small fraction of traffic actually hits your origin.

Cache Invalidation: The Hard Problem

There is a famous quote in computer science: "There are only two hard things in computer science: cache invalidation and naming things." It is a cliche because it is true.

The simplest invalidation strategy is time-based expiration (TTL). You set a time-to-live on each cache entry, and after that time passes, the entry is considered stale. This is what we use at MC Heads — skin data is cached for one hour. It is not perfect (a player who changes their skin has to wait up to an hour to see the update), but it is simple, predictable, and works well for our use case.

Event-driven invalidation is more precise but more complex. If you have a webhook or message queue that tells you when data changes, you can invalidate specific cache entries immediately. This is ideal for systems where freshness matters more — think social media feeds or real-time dashboards.

Versioned keys are another approach. Instead of invalidating old entries, you change the cache key when the data changes. For example, you might include a version number or hash in the key: player:notch:v3:head:64. Old versions naturally expire, and you never serve stale data for the current version.

CDN Caching Done Right

CDN caching is the single highest-impact optimization for any public API. A CDN like Cloudflare, Fastly, or AWS CloudFront can serve cached responses from edge servers around the world, with latency measured in single-digit milliseconds.

The key to effective CDN caching is getting your Cache-Control headers right. Here is the pattern we use:

Cache-Control: public, max-age=3600, s-maxage=86400

This tells browsers to cache the response for 1 hour (max-age=3600) and CDN edge servers to cache it for 24 hours (s-maxage=86400). The distinction matters: you want browsers to check back relatively often (so users see updates within an hour), but you want the CDN to hold onto responses longer to reduce origin load.

Do not forget the Vary header. If your API returns different content based on request headers (like Accept for content negotiation), you need Vary: Accept to prevent the CDN from serving the wrong cached version.

Stale-while-revalidate is a powerful directive that many developers overlook:

Cache-Control: public, max-age=3600, stale-while-revalidate=86400

This tells the CDN: "After the max-age expires, you can still serve the stale response while fetching a fresh one in the background." The user gets an instant response (even if slightly stale), and the cache gets refreshed for the next request. This eliminates the latency spike that happens when a popular cache entry expires.

SQLite as a Cache Store

This might be controversial, but SQLite is an excellent cache store for single-server or moderate-scale applications. Here is why we chose it over Redis for MC Heads.

SQLite requires zero infrastructure. There is no separate process to manage, no network hop for cache reads, and no risk of connection pool exhaustion. For a project where simplicity matters, this is a huge win.

With WAL (Write-Ahead Logging) mode enabled, SQLite handles concurrent reads beautifully. Multiple request handlers can read from the cache simultaneously without blocking. Writes do take a lock, but cache writes are infrequent compared to reads, so this rarely causes contention.

The schema is simple. We use a single table with columns for the cache key, the cached value (stored as a blob for binary data like images, or text for JSON), and an expiration timestamp. An index on the expiration column makes cleanup queries fast.

The main limitation is that SQLite does not work well in distributed or multi-server environments. If you are running multiple application instances behind a load balancer, each instance has its own SQLite file with its own cache state. For our scale, running a single beefy server works fine. If we ever need to scale horizontally, we would add Redis as a shared cache layer.

Cache Warming

Cold caches are a real problem. When you deploy a new version of your application or restart your server, the cache is empty, and every request becomes a cache miss. This can overwhelm your upstream data sources if traffic is high.

Cache warming is the practice of pre-populating the cache before traffic arrives. For MC Heads, we maintain a list of the most frequently requested player names (the top 1,000 or so) and fetch their skins during startup. This means the most popular requests are served from cache immediately, even after a restart.

You can also warm caches gradually using a background job that processes requests at a controlled rate, avoiding the thundering herd problem that comes from trying to populate everything at once.

Preventing Cache Stampedes

A cache stampede (also called the thundering herd problem) happens when a popular cache entry expires and hundreds of concurrent requests all try to regenerate it simultaneously. Instead of one upstream request, you get hundreds — potentially overwhelming the upstream service.

The classic solution is lock-based recomputation. When a cache miss occurs, the first request acquires a lock and starts regenerating the cache entry. Subsequent requests for the same key either wait for the lock to release or get served a stale version of the data.

A simpler approach that works well in practice is probabilistic early expiration. Instead of expiring all entries at exactly their TTL, you add a small random jitter. Some entries expire a few seconds early, some a few seconds late. This spreads the regeneration load over time instead of creating a spike.

We use a hybrid approach: probabilistic early expiration for most entries, with lock-based recomputation for the most expensive operations (like fetching and rendering a skin for the first time).

Monitoring Your Cache

A cache you do not monitor is a cache that will eventually cause problems. The key metrics to track are:

Hit rate: The percentage of requests served from cache. Ours sits around 85% at the application level and 95% including CDN hits.
Miss latency: How long cache misses take. If this spikes, your upstream source might be struggling.
Eviction rate: How many entries are being evicted before their TTL expires (because the cache is full). A high eviction rate means you need more cache capacity.
Memory or disk usage: Unbounded caches will eventually consume all available resources. Set limits and monitor them.

The Bottom Line

The best caching strategy is the simplest one that meets your performance requirements. Start with CDN caching and appropriate Cache-Control headers — that alone will handle the majority of your traffic. Add an application-level cache (SQLite, Redis, or even a plain in-memory Map) for data that the CDN cannot cache. And invest in monitoring so you know when your caching layer is actually working.

Do not over-engineer it. A TTL-based cache with proper CDN headers will outperform a complex invalidation system that nobody fully understands. Simplicity is a feature.