Distributed Data Guide

In this article we cover the basics of some modern distributed services and databases. This guide is by no means comprehensive or complete.

Distributed system generally fall into two main categories:

Shared Disk/Data: These systems share data resources with each other. This means that nodes are dependent upon each other for satisfying requests, which can cause scaling complications. Shared systems can experience downtime caused by a single point of failure (SPOF). An example of this system is a DB cluster with a single master node that has no standby.
Shared Nothing: Each update request is satisfied by a single node, independently of other nodes (for the most part). Thus a cluster can potential scale like a blob of nodes, allowing for easier growth and maintenance. It also eliminates SPOF. An example of a shared nothing system is the OpenStack cloud platform and ScyllaDB.

Distributed Storage

A distributed storage system is one that has distributed resources for resilience and performance.

In particular we will discuss distributed storage technologies:

These tools are used to enhance and enable data distribution:

HA-Proxy (https://haproxy.org): Load balancer to distribute load to clusters
Pingora (https://github.com/cloudflare/pingora): A framework for network proxy, balancing, and observability
Etcd (https://etcd.io): A Key-Value store for cluster-wide data consistency
ZooKeeper: Another Key-Value store for cluster-wide data consistency
Apache Hadoop (https://hadoop.apache.org): A software library framework that distributes processing of large data sets across clusters

Additional Information