Hadoop
Apache Hadoop is an open-source framework for scalable distributed computing. It uses HDFS for storage and MapReduce for processing, enabling scalable, fault-tolerant data analytics across various data types. It has its own APIs for accessing and manipulating data.
Hadoop excels in
Data lakes
Data warehousing and big data analytics via SQL
Large-scale batch processing (e.g. map reduce) of structured and unstructured data.
Stream Processing
Machine learning and data mining pipelines
Hadoop is complex, so most organizations use automation tool to deploy it. Hadoop requires similar human and hardware resources as Ceph and MinIO.
Hadoop Services
Hadoop Distributed File System (HDFS): This is the data storage layer that Hadoop uses to split the data up into blocks and distribute those to multiple nodes in order to get better performance.
MapReduce: The processing layer of Hadoop, which is a programming model for parallel processing of large datasets. MapReduce breaks tasks into smaller sub-tasks (Map) and processes them in parallel.
Hadoop Components
Hadoop clusters have these major components:
ResourceManager: manages resources across the cluster, scheduling and task allocation
NameNode: master server of HDFS, manages metadata and namespaces of HDFS.
DataNodes: worker nodes that store data in the HDFS
NodeManager: manages compute nodes and resources for applications within the cluster
In addition, a real cluster would have these components as well:
Secondary NameNode: performs periodic checkpointing of the NameNode
JobHistory Server: keeps history of completed jobs and their status
ZooKeeper: maintains static configuration data between the distributed components
![digraph G {
graph [fontsize=10 fontname="Verdana" compound=true];
node [shape=rectangle fontsize=10 fontname="Verdana"];
rank=TB;
subgraph Cluster{
label = "Hadoop Cluster";
subgraph RackA {
node [shape=record, color=blue];
label = "Rack_A";
l1 [label = "{
Switch \r| \
NameNode \r| \
node-1 \r| \
node-2 \r| \
node-3 \r| \
node-4 \r|
...
} \
"; nojustify=true; style=filled; fillcolor="lightblue"];
}
subgraph RackB {
node [shape=record];
label = "Rack_B";
l2 [label = "{
Switch \r| \
Standby NameNode \r| \
node-1 \r| \
node-2 \r| \
node-3 \r| \
node-4 \r|
...
} \
"; nojustify=true; style=filled; fillcolor="lightblue"];
}
subgraph RackC {
node [shape=record];
label = "Rack_C";
l3 [label = "{
Switch \r| \
Resource Manager \r| \
node-1 \r| \
node-2 \r| \
node-3 \r| \
node-4 \r|
...
} \
"; nojustify=true; style=filled; fillcolor="lightblue"];
}
node [style=filled; fillcolor="pink"];
SwitchFabrick -> l1:n;
SwitchFabrick -> l2:n;
SwitchFabrick -> l3:n;
}
}](../../../_images/graphviz-048d46ee70c5215e01796fbd8fa48719261df662.png)
Idealized Diagram of a Hadoop Cluster
See https://en.wikipedia.org/wiki/Hadoop for more details.