Hadoop

Apache Hadoop is an open-source framework for scalable distributed computing. It uses HDFS for storage and MapReduce for processing, enabling scalable, fault-tolerant data analytics across various data types. It has its own APIs for accessing and manipulating data.

Hadoop excels in

  • Data lakes

  • Data warehousing and big data analytics via SQL

  • Large-scale batch processing (e.g. map reduce) of structured and unstructured data.

  • Stream Processing

  • Machine learning and data mining pipelines

Hadoop is complex, so most organizations use automation tool to deploy it. Hadoop requires similar human and hardware resources as Ceph and MinIO.

Hadoop Services

  • Hadoop Distributed File System (HDFS): This is the data storage layer that Hadoop uses to split the data up into blocks and distribute those to multiple nodes in order to get better performance.

  • MapReduce: The processing layer of Hadoop, which is a programming model for parallel processing of large datasets. MapReduce breaks tasks into smaller sub-tasks (Map) and processes them in parallel.

Hadoop Components

Hadoop clusters have these major components:

  • ResourceManager: manages resources across the cluster, scheduling and task allocation

  • NameNode: master server of HDFS, manages metadata and namespaces of HDFS.

  • DataNodes: worker nodes that store data in the HDFS

  • NodeManager: manages compute nodes and resources for applications within the cluster

In addition, a real cluster would have these components as well:

  • Secondary NameNode: performs periodic checkpointing of the NameNode

  • JobHistory Server: keeps history of completed jobs and their status

  • ZooKeeper: maintains static configuration data between the distributed components

digraph G {

   graph [fontsize=10 fontname="Verdana" compound=true];
   node [shape=rectangle fontsize=10 fontname="Verdana"];
   rank=TB;


   subgraph Cluster{
   label = "Hadoop Cluster";

      subgraph RackA {
          node [shape=record, color=blue];
          label = "Rack_A";
          l1 [label = "{
                         Switch \r| \
                         NameNode \r| \
                         node-1 \r| \
                         node-2 \r| \
                         node-3 \r| \
                         node-4 \r|
                         ...
                         } \
                        "; nojustify=true; style=filled; fillcolor="lightblue"];
      }

      subgraph RackB {
          node [shape=record];
          label = "Rack_B";
          l2 [label = "{
                         Switch \r| \
                         Standby NameNode \r| \
                         node-1 \r| \
                         node-2 \r| \
                         node-3 \r| \
                         node-4 \r|
                         ...
                         } \
                        "; nojustify=true; style=filled; fillcolor="lightblue"];
      }

      subgraph RackC {
          node [shape=record];
          label = "Rack_C";
          l3 [label = "{
                         Switch \r| \
                         Resource Manager \r| \
                         node-1 \r| \
                         node-2 \r| \
                         node-3 \r| \
                         node-4 \r|
                         ...
                         } \
                        "; nojustify=true; style=filled; fillcolor="lightblue"];
      }
      node [style=filled; fillcolor="pink"];
      SwitchFabrick -> l1:n;
      SwitchFabrick -> l2:n;
      SwitchFabrick -> l3:n;
   }
}

Idealized Diagram of a Hadoop Cluster

See https://en.wikipedia.org/wiki/Hadoop for more details.