Amazon Dynamo is a highly-available key-value storage system that is fully scalable, decentralized and tolerant to network partitions. Nodes are symmetric in functionality, heterogeneous in performance, and provide single hop routing while not requiring any partitioning or configuration. Dynamo focuses on providing high availability through replication, using a loose quorum protocol where non-replica nodes can be used to deliver updates to unavailable nodes. It implements an eventual consistency model facilitated by gossiping, hinted hand off, and an anti-entropy synchronization protocol based on Merkle trees to optimize performance.

Context and purpose

Amazon Dynamo is a decentralized high-available key-value storage system. These are a class of database management systems that do not rely on the traditional relational data model but on one where querying is simply limited to storing and retrieving data by a primary key. Traditional database management systems typically provide the ACID properties (Atomicity, Consistency, Isolation, Durability). These relational database management systems focus on consistency at the cost of availability where as Amazon Dynamo focuses on availability at the cost of consistency in order to provide the so called "always on" experience.

Amazon Dynamo is used in architectures that require a key-value storage typically with high availability requirements while giving the engineer control over the trade-offs between availability, consistency, and performance. This is done tuning the $N$ (required replica nodes), $R$ (required number of nodes accessed for read operations), $W$ (required number of acknowledged nodes for write operations) parameters. Amazon Dynamo does not provide any means of data integrity or security so its should be used in trusted infrastructures within a single administrative domain.

Architecture

Amazon Dynamo is distributed and can be horizontally scaled with minimal impact. Data is partitioned and replicated over $N$ nodes using a variant of consistent hashing.

Each node has the same set of responsibilities and there is no concept of a so called master or super server. This to simplify maintenance and provisioning of nodes without manual partitioning or configuration. There is however a concept of a seed node which are nodes known to the whole network, statically configured or discovered via external services, and used to avoid logical partitioning of the network.

Amazon Dynamo employs the concept of so called virtual nodes to achieve a proportional load distribution and to solve the problems of data and load non-uniformity caused by consistent hashing. Each physical node is responsible for an amount of virtual nodes that is proportional to the node its capabilities.

New nodes announce which virtual nodes they are going to represent and exchange their membership to the network via a gossip-based protocol. As result the existing nodes currently representing the virtual nodes will offer their keys for take over to the new node.

Amazon Dynamo can be characterized as a single-hop DHT network since it avoids routing requests through multiple nodes in order provide low latencies. Requests enter the system via a load balancer which uniformly distributes requests over available nodes. Any node maintains sufficient information to derive the physical nodes responsible (the top of the preference list) for storing a particular key and can therefore route the client request directly to the appropriate node.

The node handling the read of write request on behalf of the client is called the coordinator node. It is responsible for contacting the $R$ or $W$ replicas. Any node can be used to coordinate a read request while write requests must be coordinated from a replica node in the top of the preference list. This because the coordinator node is responsible for creating a new version which requires access to the data. The fact that any node can coordinate a read request, and that in an ideal scenario $N$ healthy nodes in the preference list can coordinate a write request, help in achieving a uniform load distribution.

If a contacted node is offline, the coordinator node uses the next node in the preference list while maintaining a record of the unhealthy node until recovery. Offline nodes are only detected when they fail to communicate to the coordinator host. There is no global consistent view of failure since node failures are likely to be temporary while persistent membership changes are explicitly being communicated.

Dynamo attempts to provide an "always write" experience which required them to move the conflict resolution process to read operations such that write operations always succeed. This means that multiple versions of an object can exist in the system at the same time.

Each object has a context which contains meta data such as the version and vector clock information. A context is returned in each read operation and required by each write operation to specify which version is updated. Dynamo uses vector clocks to capture causality of multiple versions and to determine if a new version can replace the previous version or that conflict resolution is required. Conflict resolution can be performed by the client or by the server, although resolution by the server is limited to simple procedures as using the last write.

If Dynamo can not automatically reconcile multiple version, then the next read operation will contain all versions deemed causally unrelated. Using this context in a write statement will reconcile the conflict.

Consistency

Dynamo implements a so called eventual consistency model (Vogels 2008), this is a model where replicated nodes will eventually become consistent with the latest update via background propagation. Often $W \le N$, meaning not all nodes need to acknowledge an update for the update to succeed. As a result the probability of failure decreases meaning availability increases while consistency decreases. This also benefits the performance since the duration of a request depends on the slowest node.

The coordinator node is responsible for propagating the update to the $N - 1$ (minus one since coordinator nodes for write operations must be replica nodes) outdated replica nodes and achieve consistency. Selection of coordinator nodes for write operations are usually favored towards nodes that replied fastest to the last read operation (which must again be a replica node) due to the fact that writes are usually followed by a read which gives "read-your-writes" consistency.

The size of the preference list is actually $M > N$, meaning not all $M$ nodes contain a replica. This to ensure that still $W$ nodes can acknowledge a write even when $N_{available} < W$. This benefits availability. If a replica node is unavailable, then the coordinator node will send the update with additional meta data to a remaining node in the preference which doesn't contain a replica. The replacement non-replica node will periodically check if the target replica node recovers and hand off the update when it becomes available again. This is called hinted hand off.

A standard Dynamo configuration for $(N,R,W)$ is $(3,2,2)$. Note that $R+W > N$ meaning that according to a strict quorum protocol the system should achieve strong consistency. Dynamo however does not implement a strict quorum protocol but a loose quorum protocol, this due to the fact that non-replica nodes can acknowledge the write under the assumption its handed off to the replica node later.

Now the problem arises that again this replacement non-replica node may become unavailable with as result that the update is lost. Now there is inconsistency. To solve this problem Amazon Dynamo employs an anti-entropy (replica synchronization) protocol that actively synchronizes the replicas.

Performance

The ability of nodes to directly determine a coordinator node limits the numbers of hops and thus the time a request is passed forward within the system. This decreases the latencies. There could be a downside to this, namely that the lookup table is linear to the number of nodes, meaning that if too many nodes participate in the network that this table significantly grows in size.

The fact that coordinator nodes for write operations are favored towards nodes that replied fastest to the last read operation plays well with the query buffer mechanism since the object is likely to be still available in the buffer or cache.

Background tasks as replica synchronization and hand off are performed in the background in order to minimize the impact on critical operations.

It uses Merkle trees (a tree where each parent is a hash of its children) to detect inconsistencies and limit the data transfers. These trees can be used to easily detect if two replicas are synchronized by comparing the trees' roots, if they are equal then their children must also be equal, else their children are pair wised exchanged and compared in order to determine which data is out of sync.

Availability

Amazon Dynamo offers engineers the means to configure the $N, R, W$ parameters. In order to achieve the desired availability the parameters $R$ and $W$ are typically smaller than $N$. The $W-1$ (minus one since the coordinator node must be a replica node) acknowledgments don't necessarily have to origin from a replica node which again benefits the availability at the cost of consistency.

DeCandia et al. [2007a] does not provide any proof of availability. It does however state that applications received in 99.9995% of the requests a successful response without any data loss, although no source is given.


DeCandia, Giuseppe, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubramanian, Peter Vosshall, and Werner Vogels. 2007a. "Dynamo: Amazon's highly available key-value store." SIGOPS Oper. Syst. Rev. 41 (6): 205--20. http://doi.acm.org/10.1145/1294261.1294281.