DynamoDB: Exploring its architecture in brief

Disclaimer: The content of this blog is based on information obtained from publicly available research papers as well as other publicly accessible sources. Most of the information in the article is derived or directly consumed from the following papers:

Dynamo: Amazon’s highly available key-value store

Amazon DynamoDB: A Scalable, Predictably Performant, and Fully Managed NoSQL Database Service

Introduction

Amazon DynamoDB is a fully managed, elastic, NoSQL, cloud database which guarantees single-digit millisecond response and 5 9's availability at any scale. In Prime Day’23 AWS DynamoDB requests peaked at 126 million requests per second.

In this blog we examine the internal working and architecture of Amazon DynamoDB which supports such fast response time and high availability.

What is DynamoDB?

DynamoDB is primarily as NoSQL key-value database primarily to serve OLTP use-cases. It provides a very simple interface to interact with.
- putItem(key, value)
- getItem(key)
- updateItem(key, value)
- deleteItem(key)
It also provides ACID capabilities where multiple queries/operations can be performed as single transaction.
A item of size 400kb can be stored against the partition key. Partition key needs to be unique across the table. However, if a Sort Key is defined, multiple partition keys can exists but partition key - sort key combination must be unique.
DynamoDB also has the concept of Local Secondary Indexes and Global Secondary Indexes which tries to replicate the Secondary Indexes of the SQL world. These secondary indexes in DyanmoDB provides the capability to query besides partition key and sort key values of the main table.
DynamoDB supports both highly consistent and eventually consistent reads.

Data Storage and Replication

While there is no publicly available information on what storage engine is being used in DynamoDB internally, the original Dynamo had a plugable storage engine with Berkeley Database and MySQL in-use. However, DynamoDB paper specifies that data is stored as B tree in disk over Log-Structured Merge Tree.
It wouldn't be possible to store all the data in a single physical node/server. Even if you are able to fit the data in one physical server, the tables write and read operations would be limited by the I/O, network of the physical server. Hence DynamoDB divides the single table data into multiple partitions which is stored in different node. DynamoDB uses variation of consistent hashing on its partition key to map the input to the partition in which the data would be stored. Each partition of the table hosts a unique block of data.
Each virtual partition of the table is replicated and distributed across multiple AZ's to achieve high availability. The replicas of each partition for a replication group or a cluster.
DynamoDB uses a single leader replication strategy while remaining nodes of the cluster take-up follower role. It internally uses multi-paxos algorithm to elect the leader among the replicas.

Only leader of the partition cluster can server the write request. However, the replicas can serve only eventually consistent reads and only leader node can server the highly consistent reads.
On receiving the write request, the leader node generates the Write-Ahead Log and sends it to the peer replicas. To maintain high-consistency, write is acknowledged to the application only after a quorum of nodes in replication group are able to persists the write request.
The nodes in the replication group consists of a Write Ahead Log(WAL) and B tree on disk to store the key-value data. WAL is used for point in-time recovery as well.
The replication group consists of log replicas which simply store the WAL and not the B trees. This is done to increase availability and durability.

Throughput and Rate-limiting

One of the key features of the DynamoDB is its elasticity and auto-scaling feature. There is no theoretical limit to the size of table. Irrespective of the size of the table a constant response time is guaranteed without any limitations on table's throughput.
DynamoDB provides 2 modes of throughput provisioning:
- Provisioned mode - For predictable traffic patterns.
- On-Demand mode - For unpredictable traffic patterns.
The underlying architecture to support the throughput for provisioned mode table should be simpler compared to on-demand tables. The table would be split into partitions across the Replication Groups. A distributed rate limiting algorithm would be used to rate-limit the client calls.
The storage node of the replication group could consists of multiple partitions from different table. It should be ensured that the total throughput capacity of the partitions stored on the table is less than the capacity of the storage node.
However, the solution for provisioned table is much more complicated. As there is no predefined limit on what would be the max throughput that would be experienced by the table.
It could happen that the partitions from different table which are in the same storage node could be requesting a high throughput which exceeds the physical limitations of storage node.
For example: Lets assume there are 2 partitions P1 and P2 which are of different table but residing in single storage node which has limitation of 5000 TPS. Now, lets say both of these table are of on-demand mode. It could happen that both of these tables could request for more than 5000 TPS at the same time.
To handle this scenario, DynamoDB colocates the physical partition from one node to another in-case of increased throughput capacity which the current physical storage node cannot serve. The colocation of the partition is also done in-case the table size exceeds the storage limit of the node.
The auto-scaling feature in DynamoDB is reactive rather than pro-active. i.e At any given point in time, the max throughput capacity of the table would be 2x the past peak. If the table is server 100TPS - 500 TPS with 500 TPS being the previous peak, the current max throughput capacity of the table would be 1000 TPS. If the 1000 TPS limit is met then the capacity is increased to 2000 TPS. But DynamoDB does not ensure that it would not be throttling if the client exceeds double previous traffic's peak within 30 minutes.

Availability and Consistency

DynamoDB supports both eventual and strong consistency for the reads. By default all the reads are eventually consistent, however if required the strongly consistent flag can be set to True to get a strongly consistent read.
To support strong consistency, the DynamoDB writes to the Replication group needs a quorum for a successful one. i.e At-least 2 of the 3 nodes in the replication group must accept the write. If one of the storage node is unavailable, the write can also be done to Log Nodes which only consists of the WAL.