System Design Interview Distributed Cache
Today we design a distributed cache. We will start with some simple ideas and little by little evolve our design into a fully-fledged architecture. Approach, that we highly encourage you to use during a real interview. We also will discuss several important tips along the way. As usual, let's start with the problem statement. Let’s take a look at a typical setup, a web application backed by a data store. This data store may be a database or another web service. Client makes a call to the web application, which in turn makes a call to the data store and result is returned back to the client. There may be several issues with this setup. First, calls to the data store may take a long time to execute or may utilize a lot of system resources. It would be good to store at least some results of these calls in memory, so that these results are retrieved and returned back to the client much faster. And if the data store is down or experiences a performance degradation and calls to the data store start to fail, our web application may still process requests as usual, at least for some period of time. So, storing data in memory will help to address these issues. When client request comes, we first check the cache and try to retrieve information from memory. And only if data is unavailable or stale, we then make a call to the datastore. And why do we call it a distributed cache? Because amount of data is too large to be stored in memory of a single machine and we need to split the data and store it across several machines. Caches are everywhere nowadays.
Even on this channel, remember when we designed distributed message queue or notification service or rate limiter, all those designs relied on a cache of some sort. Let’s formalize requirements. We need to implement two main operations: put and get. Put stores object in the cache under some unique key and get retrieves object from the cache based on the key. To simplify things a bit, let’s consider both key and value to be strings. As for non-functional requirements, we want to design scalable, highly available and fast cache. High scalability will help to ensure our cache can handle increased number of put and get requests. And be able to handle increasing amount of data we may need to store in the cache. High availability will help to ensure that data in the cache is not lost during hardware failures and cache is accessible in case of network partitions. This will minimize number of cache misses and as a result number of calls to the datastore. And high performance is probably the number one requirement for the cache. The whole point of the cache is to be fast as it is called on every request. And here is a tip. During an interview, when we need to define requirements, it is relatively easy to come up with functional requirements. Or a set of operations system will support. These are one-two-three operations which clearly characterize the system.
Usually, something around sending data to the system and retrieving it, right? But non-functional requirements may not be easy to define. What do we do? If you need to design a distributed system, think about the following 3 requirements first: scalability, availability and performance. And if data persistence is important think of durability as well. These 4 will give you and your interviewer a lot of room for discussion. If you recollect a CAP theorem, availability requirement may be replaced with consistency. But let me keep it simple for now and not dive deep into this topic here. I will elaborate more on this in a separate video. And here is my second tip. Remember that interviewer is your friend and you both have the same goal. Your goal is to provide as many positive data points as possible. And the interviewer's goal is to collect as many data points as possible. In practice it means that you should start approaching any design problem with some small and simple steps. And evolve your solution with every next step. This is a win-win situation. You demonstrate progress, ability to deal with ambiguity and simplify things. While the interviewer gets all the necessary data points to bring them to the table and champion your case in the discussion with other interviewers from the interview loop. With this in mind, let's start with the local cache first. You will later see how distributed cache design evolves from the solution for the local cache. So, we start with a single server and need to implement a basic in-memory data store, that has limited capacity. Local cache implementation is an algorithmic problem. We need to come up with data structures and an algorithm for storing and retrieving data. What data structure comes to your mind when you hear about cache. Most likely the same that comes to my mind, which is a hash table. We add key-value pairs to the hash table and retrieve them in constant time. Simple, right? But only until the moment when we reach maximum hash table size and cannot add any more elements. We need to evict some old data from the hash table before new data can be added. What data should be evicted? It turns out there are several dozens of different approaches, so called eviction or replacement policies. One of the easiest to implement is the least recently used policy, when we discard the least recently used items first. But hash tables do not track which entry has been used recently. Meaning that we need one more data structure to keep track of what was used when.
Some kind of a queue, but with the constant time for add, update and delete operations. And I already hear you shouting: doubly linked list. And I have no other choice than to agree with you. By the way, least recently used cache implementation is a very popular coding interview question. It is marked as a hard problem on the Leetcode. But you will see that it is relatively easy to implement once you know what data structures to use. When GET operation is called, we first check if this item is in the cache (hash table). If item is not in the cache, we return null immediately. If item is found, we need to move it to the head of the list and return the item back to the caller. And why do we need to move the item to the list head? To preserve the order of use. Item at the head of the list is the most recently used. And item at the tail of the list is the least recently used. So, when cache is full and we need to free space for a new item, we remove an item at the tail of the list. When PUT operation is called, we also check if item is in the cache. And if found, we update the item value and move the item to the head of the list. In case item is not in the cache, we need to check if cache is full. If cache has capacity (not full), we simply add this item to the hash table and to the list (at the head position). If cache is full, we need to free some space first. We take an item at the tail and remove it from both the hash table and the list. Now we have space for a new element and we add it to both the hash table and the list. Now, let’s run a simple simulation that will help to further understand and then code the algorithm. Cache is initially empty, so we just keep adding new items to it. We always add new items to the head of the list. 4 items have been added and cache became full. Now, GET operation has been called for item B, and we need to move it to the head. Right after that item A has been accessed (also a GET call) and we move it to the head as well. And now, PUT operation has been called. For a new item E. We first delete the item that is at the list tail, because cache is full. And then add item E to both the hash table and the list. Easy, right? Then let’s code the least recently used cache.
First, let’s define a class, which instances will be stored in the cache. Class stores key, value and references to the previous and next nodes in the linked list. In the cache class we define references to the hash table (HashMap class in Java), cache size and references to the head and tail of the doubly liked list. Constructor accepts a single argument - cache size. In the GET method, we check if item is in the cache. And we move existing item to the head of the list by first deleting it from the current position and inserting it at the head position. PUT is split into two parts. If item already exists in the cache, we update item’s value and move item to the head. And if item is not present in the cache, we check if cache if full. If yes, the least recently used item is deleted. And we add new item to both the hash table and the list. I have omitted implementation for both deleting a node from the list and adding it to the head. Hopefully, these are easy routines you can implement on your own. And we are done with the local cache implementation. Let’s think how to make it distributed. We can start with a really straightforward idea, when we move the least recently used cache we just implemented to its own host. The benefit of this, we can now make each host to store only chunk of data, called shard. Because data is split across several hosts, we now can store much more data in memory. Service hosts know about all shards, and they forward put and get requests to a particular shard. The same idea, but a slightly different realization, is to use service hosts for the cache. We run cache as a separate process on a service host. And data is also split into shards. And similar to the first option, when service needs to make a call to the cache, it picks the shard that stores data and makes a call. Le’ts call these options as distributed cache cluster and co-located cache. And take a look at the benefits of each option. Dedicated cluster helps to isolate cache resources from the service resources. Both the cache and the service do not share memory and CPU anymore. And can scale on their own. Dedicated cluster can be used by multiple services. And we can utilize the same cluster across several microservices our team owns. And dedicated cluster also gives us flexibility in choosing hardware. We can choose hardware hosts with a lot of memory and high network bandwidth. Public clouds nowadays provide a variety of memory optimized hardware.
As for co-located cache, the biggest benefit is that we do not need a separate cluster. This helps to save on hardware cost and usually less operationally intensive than a separate cluster. And with co-location, both the service and the cache scale out at the same time. We just add more hosts to the service cluster when needed. Ok, we have implemented a least recently used cache and made it runnable as a separate process, we told cache clients to call the cache process using either TCP or UDP connection. But how do cache clients decide which cache shard to call? Let’s discuss a naive approach first. A MOD function. Based on the item key and some hash function we compute a hash. We divide this hash number by a number of available cache hosts. And take a remainder. We treat this remainder as an index in the array of cache hosts. For example, we have 3 cache hosts. And hash is equal to 8. 8 MOD 3 is 2, so the cache host with index 2 will be selected by the service to store this item in the cache and while retrieving the item from the cache. But what happens when we add a new cache host (or some host dies due to hardware failures)? The MOD function will start to produce completely different results. Service hosts will start choosing completely different cache hosts than they did previously, resulting in a high percentage of cache misses. This is rarely acceptable in production systems and usually used for testing purposes only. A much better option is to use consistent hashing. Consistent hashing is based on mapping each object to a point on a circle. We pick an arbitrary point on this circle and assign a 0 number to it. We move clockwise along the circle and assign values. We then take a list of cache hosts and calculate a hash for each host based on a host identifier, for example IP address or name. The hash value tells us where on the consistent hashing circle that host lives. And the reason we do all that, is that we want to assign a list of hash ranges each cache host owns. Specifically, each host will own all the cache items that live between this host and the nearest clockwise neighbor. It can be counter clockwise, not matter much. So, for a particular item, when we need to look up what cache host stores it, we calculate a hash and move backwards to identify the host. In this case, host 4 is storing the item. And what happens when we add new host to the cache cluster? Same as before, we calculate a hash for a new host, and this new host becomes responsible for its own range of keys on the circle. While its counter clockwise neighbor (host 4 in this case) becomes responsible for a smaller range.
In other words, host 6 took responsibility for a subset of what was formerly owned by host 4. And nothing has changed for all the other hosts. Which is exactly what we wanted, to minimize a number of keys we need to re-hash. Consistent hashing is much better than MOD hashing, as significantly smaller fraction of keys is re-hashed when new host is added or host is removed from the cache cluster. Now we know how cache cluster host is selected for both put and get, but who is actually doing this selection? On the service side, who is responsible for running all these hash calculations and routing requests to the selected cache host? Cache client is that component. It’s a small and lightweight library, that is integrated with the service code and is responsible for the cache host selection. Cache client knows about all cache servers. And all clients should have the same list. Otherwise, different clients will have their own view of the consistent hashing circle and the same key may be routed to different cache hosts. Client stores list of cache hosts in sorted order (for a fast host lookup) and binary search can be used to find a cache server that owns the key. Cache client talks to cache hosts using TCP or UDP protocol. And if cache host is unavailable, client proceeds as though it was a cache miss. As you may see, list of cache hosts is the most important knowledge for clients. And what we need to understand, is how this list is created, maintained and shared among all the clients. Let’s discuss several options. In the first option we store a list of cache hosts in a file and deploy this file to service hosts using some continuous deployment pipeline. We can use configuration management tools such as chef and puppet to deploy file to every service host. This is the simplest option. But not very flexible. Every time list changes we need to make a code change and deploy it out to every service host. What if we keep the file, but simplify the deployment process? Specifically, we may put the file to the shared storage and make service hosts poll for the file periodically. This is exactly what the second option is about. All service hosts try to retrieve the file from some common location, for example S3 storage service. To implement this option, we may introduce a daemon process that runs on each service host and polls data from the storage once a minute or several minutes. The drawback of this approach is that we still need to maintain the file manually. Make changes and deploy it to the shared storage every time cache host dies or new host is added. It would be great if we could somehow monitor cache server health and if something bad happens to the cache server, all service hosts are notified and stop sending any requests to the unavailable cache server. And if a new cache server is added, all service hosts are also notified and start sending requests to it. To implement this approach, we will need a new service, configuration service, whose purpose is to discover cache hosts and monitor their health. Each cache server registers itself with the configuration service and sends heartbeats to the configuration service periodically. As long as heartbeats come, server is keep registered in the system. If heartbeats stop coming, the configuration service unregisters a cache server that is no longer alive or inaccessible. And every cache client grabs the list of registered cache servers from the configuration service. The third option is the hardest from implementation standpoint and its operational cost is higher. But it helps to fully automate the list maintenance. And in a couple of minutes you will see one other benefit of using configuration service for a distributed cache. Let’s quickly summarize what we have discussed so far. To store more data in memory we partition data into shards. And put each shard on its own server. Every cache client knows about all cache shards. And cache clients use consistent hashing algorithm to pick a shard for storing and retrieving a particular cache key. Let’s recall non-functional requirements we defined in the beginning of our design. We wanted to build fast, highly scalable and available distributed cache. Have we built a highly performant cache? Yes. Least recently used cache implementation uses constant time operations. Cache client picks cache server in log n time, very fast. And connection between cache client and cache server is done over TCP or UDP, also fast. So, performance is there. But what about other two: scalability and availability. Scalability is also there. We can easily create more shards and have more data stored in memory. Although those of you who did data sharding in real systems know that common problem for shards is that some of them may become hot. Meaning that some shards process much more requests then their peers. Resulting in a bottleneck. And adding more cache servers may not be very effective. With consistent hashing in place, a new cache server will further split some shard into two smaller shards. But we do not want to split any shard, we need to split a very concrete one. And high availability is not there at all. If some shard dies or becomes unavailable due to a network partition, all cache data for that shard is lost and all requests to that shard will result in a cache miss, until keys are re-hashed. Can you think of a mechanism that will help us to improve availability and better deal with a hot shard problem? And such mechanism is a data replication. There are many different techniques for data replication. We can distinguish two categories of data replication protocols. The first category includes a set of probabilistic protocols like gossip, epidemic broadcast trees, bimodal multicast. These protocols tend to favor eventual consistency.
The second category includes consensus protocols such as 2 or 3 phase commit, paxos, raft, chain replication. These protocols tend to favor strong consistency. Let's keep things simple and use leader follower (also known as master-slave) replication. For each shard we will designate a master cache server and several read replicas. Replicas (or followers) try to be an exact copy of the master. Every time the connection between master and replica breaks, replica attempts to automatically reconnect to the master. And replicas live in different data centers, so that cache data is still available when one data center is down. All put calls go through the master node, while get calls are handled by both master node and all the replicas. And because calls to a cache shard are now spread across several nodes, it is much easier to deal with hot shards. We may scale out by adding more read replicas. And while talking about leaders, we need to mention how these leaders are elected. There are two options: we can rely on a separate component, let's call it a Configuration service or if we want to avoid a separate component, we can implement leader election in the cache cluster. Configuration service is responsible for monitoring of both leaders and followers and failover, if some leader is not working as expected, configuration service can promote follower to leader. And as we discussed before, configuration service is a source of authority for clients. Cache clients use configuration service to discover all cache servers. Configuration service is a distributed service by its nature. It usually consists of an odd number of nodes (to achieve quorum easier), nodes are located on machines that fail independently (so that configuration service remains available in case for example network partitions) and all nodes talk to each other using TCP protocol. Zookeeper is a good candidate for a configuration service, we can use it here. Redis also implemented Redis Sentinel for this purpose. Ok, by introducing data replication we are able to better deal with hot shard problem and also increased availability. Increased, but we did not actually achieve true high availability. Why? Because there are still points of failure. We do data replication asynchronously, to have a better performance. We do not want to wait until leader sever replicates data to all the followers. And if leader server got some data and failed before this data was replicated by any of the followers, data is lost. And this is actually an acceptable behavior in many real-life use cases, when we deal with cache. The first priority of the cache is to be fast, and if it loses data in some rare scenarios, it should not be a big deal. This is just a cache miss and we should design our service in a way that such failures are expected. Let's see what other topics may pop up during an interview. Distributed cache we built favors performance and availability over consistency. There are several things that lead to inconsistency. We replicate data asynchronously to have a better performance. So, a get call processed by the master node, may return a different result than a get call for the same key but processed by a read replica.
Another potential source of inconsistency is when clients have a different list of cache servers. Cache servers may go down and go up again, and it is possible that a client write values that no other clients can read. Yes, we can fix these issues. Introduce synchronous replication. And make sure all clients share a single view of the cache servers list. But this will increase latency and overall complexity of the system. I highly encourage you to discuss these tradeoffs with your interviewer when you get a chance. Least recently used algorithm evicts data from cache when cache is full. But if cache is not full, some items may sit there for a long time. And such items may become stale. To address this issue, we may introduce some metadata for a cache entry and include time-to-live attribute. There are two common approaches how expired items are cleaned up from cache. We can passively expire an item, when some client tries to access it, and the item is found to be expired. Or we can actively expire, when we create a maintenance thread that runs at regular intervals and removes expired items. As there may be billions of items in the cache, we cannot simply iterate over all cache items. Usually, some probabilistic algorithms are used, when several random items are tested with every run. Services that use distributed (or remote) cache, often use local cache as well. If data is not found in the local cache, call to the distributed cache is initiated. To make life of service teams easier, so they do not need to deal with both caches, we can implement a support for the local cache inside the cache client. So, when cache client instance is created, we also construct a local cache. This way we hide all the complexity behind a single component - cache client. We can utilize previously introduced LRU cache implementation as a local cache, or use well-known 3-rd party implementations, for example Guava cache. Caches are optimized for maximum performance, as well as simplicity. And not optimized for security. Caches are usually accessed by trusted clients inside trusted environments and we should not expose cache servers directly to the internet, if it is not absolutely required. For these reasons we should use a firewall to restrict access to cache server ports and ensure only approved clients can access the cache. Clients may also encrypt data before storing it in cache and decrypt it on the way out. But we should expect performance implications. Our cache has to be instrumented with metrics and logging. This is especially important if we launch our distributed cache as a service. Because so many service teams in the organization may use our cache, every time those services experience performance degradation, they will come to us as one of the potential sources of these degradations. And we should be able to answer their questions. What metrics we may want to emit: number of faults while calling the cache, latency, number of hits and misses, CPU and memory utilization on cache hosts, network I/O. With regards to logging we may capture the details of every request to the cache. The basic information like who and when accessed the cache, what was the key and return status code. Log entries should be small, but useful. As you have seen cache client has many responsibilities: maintain a list of cache servers, pick a shard to route a request to, handle a remote call and any potential failures, emit metrics. Ideally, client software should be very simple, dumb if you want. And we can simplify the cache client. One idea is to introduce a proxy, that will sit between cache clients and cache servers and will be responsible for picking a cache shard. Take a look at the twemproxy project, created by Twitter.
Another idea, is to make cache servers responsible for picking a shard. Client sends request to a random cache server and cache server applies consistent hashing (or some other partitioning algorithm) and redirects request to the shard that stores the data. This idea is utilized by Redis cluster. Consistent hashing algorithm is great. Simple and effective. But it has two major flaws: so called domino effect and the fact that cache servers do not split the circle evenly. Let’s clarify this. Domino effect may appear when cache server dies. And all of its load is transferred to the next server. This transfer might overload the next server, and then that server would fail, causing a chain reaction of failures. And to understand the second problem, remember how we placed cache servers on the circle. Some servers may reside close to each other and some may be far apart. Causing uneven distribution of keys among the cache servers. To deal with these problems, several modifications of the consistent hashing algorithm have been introduced. One simple idea is to add each server on the circle multiple times. You can also read about Jump Hash algorithm (a paper published by Google in 2014) or proportional hashing (algorithm used by Yahoo! Video Platform). Let's do a quick recap of what we have discussed. We started with a single host and implemented least recently used cache. Because local cache has limited capacity and does not scale, we decided to run our LRU cache as a standalone process. Either on its own host or on the service host. And we made each process responsible for its own part of the data set. We introduced a consistent hashing ring, a logical structure that helps to assign owners for ranges of cache keys. And we introduced a cache client, that is responsible to routing requests for each key to a specific shard that stores data for this key. We could have stopped right there. These simple ideas proved to be very effective in practice. Memcached, which is an open source high-performance distributed cache is built on top of these principles. But we went further, as we wanted to improve availability of our cache and read scalability. And introduced master-slave data replication. For monitoring leaders and read replicas and to provide failover support, we brought in a configuration service. Which is also used by cache clients for discovering cache servers. That is it for today. Thank you for watching this video. Please, do not hesitate to share your thoughts and ask questions in the comments section below. In the next video we will design a system for identifying heavy hitters (also known as the top k most frequent items), a popular system design interview question that takes many different forms. Such as: find the most frequent search queries on Google, or most viewed videos on Youtube, or most played songs on Spotify, or most shared or liked posts on Twitter, Facebook or Instagram. If you imaging scale of those services and a requirement to identify a list of top elements as close to real time as possible, you will see that the problem is hard.