Hitchhiker's guide to Distributed Systems: Part 1

Introduction

Distributed systems are complex, both in theory and practice, yet they have become essential in today's world to handle massive compute and data workloads. In this blog, we will embark on a journey from the simplest machines to fully-fledged distributed systems, exploring the necessity of key concepts such as replication, partitioning, load-balancing, and leader election. At each stage, we will discuss the limitations of the current state and introduce a system concept that addresses these challenges.

Let’s start

Imagine we had to build a simple blogging application with exposes REST based API’s to the users to create, modify, delete and view the blogs. We will start with the simplest of architecture possible

Figure 1: Basic REST Application | Design Stage - 1

Architecture represented in figure 1 is good start for our service. There is a web application(Python Flask, Java Spring, Node.js etc) and a database(MySQL, PostgreSQL etc) running in a same physical service either self-hosted or on cloud providers such as AWS EC2, Azure Virtual Machines etc. The web application exposes REST based API via HTTPS to which users can connect to. On every request from a user, the web-application queries the database, performs some additional computation on top of the data returned from the database and returns the results to the user.

This architecture functions well for our use case when the number of users is limited. However, as our application gains traction, we will begin to encounter challenges. There are mainly 4 fundamental physical resources essential for computation i.e CPU, memory, network and disk. Given enough no. of users access our application, we are bound to be constrained by at-least 1 resource to being with. Lets say that our application is performing some heavy computation which is leading the CPU utilization exceed the safe limits.

Before moving on to the solution for the previous problem, there is another issue in the architecture shown in Figure 1. It shares resources between the application and the database. This setup has its own problems, with the major ones being:

Resource Contention: The application and database will compete for the same hardware resources, such as CPU, memory, and I/O bandwidth. This can lead to performance degradation and resource contention issues, especially when both components have high resource demands.
Fault isolation: If a bug in one component causes issues like memory or compute overload, it can result in the server being terminated. This termination would also impact the database, even though it is not the source of the problem.

The solution for above problems is quite simple. Its “Compute and Data Separation”.

Figure 2: Application with compute and data segregation | Design State - 2

Improving the design of our application to as represented in Figure 2 not only solves the above mentioned problems but also provides few other advantages:

The application and database has its own independent physical server which ensures that there is no resource contention.
If one of them fail due to software/hardware error, the other component will continue to operate.(Although this doesn’t really help in our current state of the application since if one component is down, the entire application will not be usable. We will come back to why this is required later on.)
It's quite possible that our application server has different resource needs compared to the database server. The application server might require more CPU and memory, while the database server might need more disk space and network I/O bandwidth. This data-compute separation allows us to independently choose the type of server needed for each use case.

Another thing to note in this improvement is that, we also segregated the systems into stateless and state-full components. In simple terms, stateless component is when no prior information regarding the previous transactions is required to process the current state and state-full is the one that needs the previous info. In our architecture, the application server is the stateless component since it can process the each of the request independently. Database server is not stateless since it will store the information or state from all its previous transactions and that information is essential to process the new request.

This is a good improvement over previous design, but it still has some caveats:

Since both of our application server and database server runs only on one server each, if one of them is down our entire application becomes unavailable.
The TPS requests this architecture can handle is still limited by the physical limitations of the single server.

Since our application layer is compute-intensive, let's assume our application has enough users to cause CPU utilization to exceed safe limits. To over-come this, we need to scale our application layer. 2 kinds of scaling can be done:

Vertical scaling(scale-up): Here we increase the resources within the same physical machine. For example, if the current machine has 128GB RAM, we can increase it to 256GB, 512GB, 1TB RAM until the availability of the machine limits us.
Horizontal scaling(scale-out): At some point, if we keep scaling vertically, we will either run out of larger machines or it will become too expensive to get a bigger machine. Then, we will have to horizontally scale i.e Add more machines rather than increasing the size.

It is almost always efficient to perform scaling vertically before scaling the service horizontally. Having 4 machines with 256GB RAM is more efficient in most cases than having 256 machines of 4GB RAM because there is some additional overhead in-case of more machines. Example, OS takes up some part of the server resources to keep the machine active and working. However, the system resources used by 256 instances would be larger than 4 instances.

In our case, lets assume that we have scaled-up the physical instances to largest easily available instance type. But, our users have grown beyond the capacity of the largest instance. Hence, vertically scaling need to be done as represented in below diagram.

Figure 3: Vertically scaled compute layer with N instances of application server | Design State - 3

Since the application layer is a stateless component, theoretically we can scale it infinitely i.e We can have N number of instances of the service running in N physical machines. But here we face a small problem, if you remember, our application server exposes a REST API to which uses connect to. But if there are N instances, to which server should the domain redirect to? To address this problem we introduce another layer called Load Balancer. There are other advantages of Load Balancer apart from solving the previous mentioned problem which can be referred here.

Load Balancer is a reverse proxy layer that sits between the users and the application servers. It receives user requests and forwards them to one of the application servers based on the configured load balancing algorithm. This setup offers the following advantages:

If one of the application servers is terminated, our application will be still active and running.
If we see a huge rise in traffic, we can scale the application layer by add hosts in the compute layer without any additional changes (assuming that our data layer would be able to handle the traffic)

Now our architecture is in a fairly good state. However, there are still a few problems that need to be addressed:

Since our database is a software running on a single physical machine, it acts a Single point of failure. i.e If our database goes down, our entire application will not be functional.
At some point, with enough users even the biggest single machine hosting the database will not be able to handle the traffic.

Here's where the hard part begins. As we've seen, scaling the stateless system was pretty straightforward. However, the complexity arises when scaling the state-full components.

Lets start simple and handle the above mentioned 2 problems:

Figure 4: Data layer with master and 2 replica database servers

Now, what we have done to our data layer is add 2 replica servers. This means they will have exactly the same data as our primary or master database server. This solves 2 problems:

If the primary/master server becomes unavailable, we have an exact copy of the data on a backup server/replica that can be used.
In most real-world scenarios, workloads are read-heavy, meaning we read from the database more often than we write to it. We can use the replica database to handle the reads while our master database manages the writes.

Still there are lot of nuances with this design like:

How does the write workflow look like?
Does the user write to all the 3 servers in sync or async? If its async, how do we guarantee that our application when reading from the replica server always sees the latest data?
What exactly happens when master node goes down? How is the new master decided?

Lets answer these questions.

How does the write workflow look like? Lets look at the 2 extreme scenarios

Scenario 1: The user will be responded with write ack only after the data is written to all the replicas

Figure 5: Synchronous replication database

Scenario 2: User gets the ack as soon as its written to master and later the data is replicated in async manner to databases.

Figure 6: Asynchronous replication database

Scenario 1 is the case which guarantees that all the database servers will always have the latest copy of the data and the state of the data across all the 3 servers would be consistent.

Scenario 2 responds to the user as soon as the entry in the master is successful, without waiting for the data to reach the replica database servers. This offers two benefits over Scenario 1: a) It's obviously faster, and b) In Scenario 1, if even one of the replica servers becomes unavailable, the write wouldn't succeed. However, in Scenario 2, it continues to accept writes even if both replica servers are down.

The above describes the well-known CAP theorem (Consistency - Availability - Partition tolerance). In theory, it suggests that you can achieve two out of the three guarantees. However, in practice, it's a choice between Consistency and Availability, along with Partition tolerance. To expand this further, we cannot have a system where we chose consistency and availability to give-up on partition tolerance as with network failures are bound to happen in real world scenarios.

The decision between consistency and availability is not a binary choice. Rather, it presents a range of possibilities and trade-offs. It can been seen as spectrum as represented below:

Figure 7: Consistency - Availability spectrum

The service can choose the balance between consistency and availability anywhere on the spectrum, based on the requirements. In practical situations, we don't want our systems to be at the extremes, as shown in the scenarios above. Instead, we prefer something in the middle.

Let's consider an example where we have a master node and 5 replica nodes. In this case, we want to achieve a Quorum for read and write operations. The protocol uses two configurable values: R, the number of nodes involved in the read operation, and W, the number of nodes involved in the write operation. These values must satisfy the condition R + W > N, where N is the total number of nodes in the system. With this setup, we can guarantee that the data will be consistent. The values of R and W will determine the latency of reads and writes. Depending on the use case, it can be configured for either a fast read or fast write service.

Now, summarizing the above section with respect to the design of our system, we have a master data node with a 2 replica nodes. Lets say we have chosen the W and R value to be 2 satisfying the equation W + R > N (2 + 2 > 3). The write request would be at-least be replicated to 1 other node apart from the master node and read request would read from at-least 2 different nodes before returning the data to the user.

Figure 8: Sample Write Flow | Design State - 4

Figure 9: Sample Read Flow | Design State - 4

Our application now will be able to handle lot more requests than Design State - 3 but it still has limits. Before starting to design state - 4, we had assumed that our service was read heavy. But lets say product team added a new feature which made it write heavy. But we still have only 1 node handling all the write traffic. How do we scale this further? Partitioning.

Partitioning the data

Partitioning the data is dividing the data into multiple parts which can be independently processed on different servers.

Figure 10: Representation of data partitioning

If we consider the bar in Figure 10 as the complete data, the colored sections are the representation of the different partitions which can be handled by different servers. But how do we segregate the data into to which partition it belong to?

The simplest strategy is to use a mod N, where N is the number of servers. If we have 3 servers and data ranging from 1 to 10, the data would be divided into [1, 4, 7, 10], [2, 5, 8] and [3, 6, 9]. This approach effectively divides the data if we have a fixed set of nodes all the time. But in-practice the node count may not remain consistent.

We could have a system where the traffic patterns are uneven, when the load is high more servers needs to be added and when load is low, servers need to be decreased to keep the cost in check. Along with this, the data node might unexpectedly go down, which might alter the server count. But the mod N, based approach doesn’t handle this well. If the server count increases to 4 with the same data range, the data partitioning would now look like [1, 5, 9], [2, 6, 10], [3, 7] and [4, 8]. If you observe the previous data partition and this, the entire data is shuffled. This is the Rehashing problem. Data shuffling is a very expensive and time consuming operation, it may also cause the application down time leading to bad user experience. How do we solve this?

Here’s when the consistent hashing algorithm helps us minimize the data movement. Consistent hashing is solution that operates independently of the number of keys or the number of servers on the ring.

Now that we have figured out how the data can be divided, lets see how the system looks with partitioning.

Figure 11: Data layer with partitioning

Above is the representation of the architecture once we have multiple data partitions deployed. For simplicity, lets assume that our application layer is where the consistent hashing algorithm resides and it is aware of where to look for the data for a particular key and somehow has the visibility into all the partitions server.

Another thing to note from the above diagram is that, each of the partition has its own replica servers to add redundancy to the system.

Summary

In this article we have explored journey building a simple application progressing from a single server set-up to distributed architecture.
This only provides a high-level view of the overall journey without diving into the details of each of the concepts. In the coming blogs, I will be going deep into the details of each of the high-level concept mentioned here.

What next?

For people who want to explore further, I have 2 book recommendations:

Designing Data-Intensive Applications : Gives a highly overview of every concept without going into much of the details
Patterns of Distributed System: This book provides a in-depth view along with implementation details for each of the concepts.