🌍 Hadoop Operating Modes

🌍 Hadoop 操作模式

Overview of Hadoop Operating Modes

Hadoop 操作模式概述

  1. Local Runtime Mode
  2. 本地运行模式
  3. Pseudo-Distributed Operating Mode
  4. 伪分布式操作模式
  5. Fully Distributed Operating Mode
  6. 完全分布式操作模式
  7. High Availability (HA) Operating Mode
  8. 高可用性 (HA) 操作模式

Local Runtime Mode

本地运行模式

Definition

定义

Local mode, also known as Standalone mode, runs Hadoop as a single Java process without any daemons. It is primarily used for development and testing purposes.

本地模式,也称为独立模式,将 Hadoop 作为单个 Java 进程运行,没有任何守护进程。它主要用于开发和测试目的。

Configuration

配置

  • Environment Requirements:
    • OS: Windows or Linux x64
    • JDK: jdk1.8.0_241
    • Hadoop: 3.x
  • 环境要求:
    • 操作系统:Windows 或 Linux x64
    • JDK:jdk1.8.0_241
    • Hadoop:3.x
  • Configuration File:
    • hadoop-env.sh: Configure the JAVA_HOME environment variable.
  • 配置文件:
    • hadoop-env.sh:配置 JAVA_HOME 环境变量。

Pseudo-Distributed Operating Mode

伪分布式操作模式

Definition

定义

In Pseudo-Distributed mode, Hadoop runs on a single node with each daemon running in a separate Java process to simulate a fully distributed environment. This mode is useful for experimental learning.

在伪分布式模式下,Hadoop 在单个节点上运行,每个守护进程都在一个单独的 Java 进程中运行,以模拟完全分布式的环境。此模式对于实验性学习非常有用。

Processes

进程

The following five processes are active in this mode:

在此模式下,以下五个进程处于活动状态:

  • NameNode: Manages the filesystem namespace.
  • NameNode:管理文件系统命名空间。
  • DataNode: Manages the storage of data.
  • DataNode:管理数据存储。
  • SecondaryNameNode: Assists the NameNode.
  • SecondaryNameNode:协助 NameNode。
  • ResourceManager: Manages resources in YARN.
  • ResourceManager:管理 YARN 中的资源。
  • NodeManager: Manages and monitors resources on individual nodes.
  • NodeManager:管理和监控各个节点上的资源。

Configuration

配置

  • HDFS Daemon Process:
    • Use start-dfs.sh to start HDFS services.
  • HDFS 守护进程:
    • 使用 start-dfs.sh 启动 HDFS 服务。
  • YARN Daemon Process:
    • Use start-yarn.sh to start YARN services.
  • YARN 守护进程:
    • 使用 start-yarn.sh 启动 YARN 服务。

Configuration Files

配置文件

Configuration File Key Setting Description
core-site.xml fs.defaultFS hdfs://localhost:8020 (RPC remote communication)
hdfs-site.xml dfs.replication 1 (number of data block copies)
mapred-site.xml mapreduce.framework.name YARN (MapReduce framework)
yarn-site.xml yarn.resourcemanager.hostname localhost (ResourceManager communication address)
hadoop-env.sh JAVA_HOME Location of Java in your system
配置文件 关键设置 描述
core-site.xml fs.defaultFS hdfs://localhost:8020 (RPC 远程通信)
hdfs-site.xml dfs.replication 1 (数据块副本数)
mapred-site.xml mapreduce.framework.name YARN (MapReduce 框架)
yarn-site.xml yarn.resourcemanager.hostname localhost (ResourceManager 通信地址)
hadoop-env.sh JAVA_HOME 您系统中 Java 的位置

Fully Distributed Operating Mode

完全分布式操作模式

Definition

定义

In Fully Distributed mode, Hadoop operates on a cluster where each daemon runs in separate Java processes on each server node. This mode is used during experimental verification and enterprise commissioning.

在完全分布式模式下,Hadoop 在集群上运行,其中每个守护进程都在每个服务器节点上的单独 Java 进程中运行。此模式用于实验验证和企业调试。

Processes

进程

The same five processes as in pseudo-distributed mode are present:

与伪分布式模式中相同的五个进程存在:

  • NameNode
  • NameNode
  • DataNode
  • DataNode
  • SecondaryNameNode
  • SecondaryNameNode
  • ResourceManager
  • ResourceManager
  • NodeManager
  • NodeManager

Configuration

配置

Configuration File Key Setting Description
core-site.xml fs.defaultFS hdfs://<Master_hostname>:8020 (Master NameNode RPC)
hdfs-site.xml dfs.replication 2 (number of data block copies)
mapred-site.xml mapreduce.framework.name YARN (MapReduce framework)
yarn-site.xml yarn.resourcemanager.hostname <hostname> (ResourceManager communication address)
slaves DataNode Address IP addresses of slave nodes
hadoop-env.sh JAVA_HOME Location of Java in your system
配置文件 关键设置 描述
core-site.xml fs.defaultFS hdfs://<Master_hostname>:8020 (主 NameNode RPC)
hdfs-site.xml dfs.replication 2 (数据块副本数)
mapred-site.xml mapreduce.framework.name YARN (MapReduce 框架)
yarn-site.xml yarn.resourcemanager.hostname <hostname> (ResourceManager 通信地址)
slaves DataNode Address 从属节点的 IP 地址
hadoop-env.sh JAVA_HOME 您系统中 Java 的位置

High Availability (HA) Operating Mode

高可用性 (HA) 操作模式

Definition

定义

The High Availability cluster was introduced in Hadoop 2.x to resolve the single point of failure issue present in Hadoop 1.x.

高可用性集群是在 Hadoop 2.x 中引入的,以解决 Hadoop 1.x 中存在的单点故障问题。

Explanation

说明

  • HDFS Architecture: Follows Master/Slave topology with the NameNode as the master daemon.

  • HDFS 架构:遵循主/从拓扑,NameNode 作为主守护进程。

  • Active/Passive Configuration

    :

    • Two NameNodes:
      • Active NameNode: Handles requests.
      • Standby/Passive NameNode: Acts as a backup.
  • 主动/被动配置

    • 两个 NameNode:
      • 活动 NameNode:处理请求。
      • 备用/被动 NameNode:充当备份。

Functionality

功能

  • If the Active NameNode fails, the Standby NameNode can take over, thus reducing downtime.
  • 如果活动 NameNode 发生故障,备用 NameNode 可以接管,从而减少停机时间。
  • Ensures both NameNodes are synchronized to maintain consistency:
    • Metadata Synchronization: Both NameNodes should have identical metadata for fast failover.
    • Single Active NameNode: Only one active NameNode to prevent conflicts between nodes.
  • 确保两个 NameNode 同步以保持一致性:
    • 元数据同步:两个 NameNode 应具有相同的元数据以实现快速故障转移。
    • 单个活动 NameNode:只有一个活动 NameNode 以防止节点之间的冲突。

🗂️ High Availability (HA) Architecture in HDFS

🗂️ HDFS 中的高可用性 (HA) 架构

Split-Brain Scenario

裂脑场景

“A split-brain scenario occurs when a cluster gets divided into smaller clusters, each one believing it is the only active cluster, which can lead to data corruption.”

“当集群分裂成较小的集群,每个集群都认为自己是唯一活动的集群时,就会发生裂脑场景,这可能导致数据损坏。”

Fencing

隔离 (Fencing)

  • Fencing is a process that ensures only one NameNode remains active at any given time to prevent split-brain scenarios.
  • 隔离 (Fencing) 是一个确保在任何给定时间只有一个 NameNode 保持活动状态以防止裂脑场景的过程。

Implementation of HA Architecture

HA 架构的实现

In HDFS HA Architecture, two NameNodes operate simultaneously, and synchronization is achieved through one of the following methods:

在 HDFS HA 架构中,两个 NameNode 同时运行,并通过以下方法之一实现同步:

Using Quorum Journal Nodes

使用 Quorum Journal Nodes (QJM)

  • JournalNodes are a group of nodes that facilitate synchronization between the active and standby NameNodes.
  • JournalNodes 是一组节点,用于促进活动 NameNode 和备用 NameNode 之间的同步。
  • The active NameNode updates the EditLogs in the JournalNodes, while the standby NameNode continuously reads these changes and applies them to its namespace.
  • 活动 NameNode 更新 JournalNodes 中的 EditLogs,而备用 NameNode 持续读取这些更改并将其应用于其命名空间。
  • During a failover, the standby NameNode ensures that it has the latest metadata from the JournalNodes before becoming the active NameNode.
  • 在故障转移期间,备用 NameNode 在成为活动 NameNode 之前,确保已从 JournalNodes 获取最新的元数据。

Architecture Illustration

架构图示

Component Description
Active NameNode Updates EditLogs in JournalNodes.
Standby NameNode Reads changes from JournalNodes and applies them to its namespace.
JournalNodes Facilitate synchronization and provide fault tolerance.
DataNodes Send heartbeats and block location information to both NameNodes for fast failover.
组件 描述
活动 NameNode (Active NameNode) 在 JournalNodes 中更新 EditLogs。
备用 NameNode (Standby NameNode) 从 JournalNodes 读取更改并将其应用于其命名空间。
JournalNodes 促进同步并提供容错能力。
DataNodes 向两个 NameNode 发送心跳和块位置信息,以实现快速故障转移。

Using Shared Storage

使用共享存储

  • The active NameNode logs modifications in its namespace to an EditLog in shared storage, which the standby NameNode reads and applies.
  • 活动 NameNode 将其命名空间中的修改记录到共享存储中的 EditLog,备用 NameNode 读取并应用这些修改。
  • In a failover situation, the standby updates its metadata from the shared storage before taking over as the active NameNode.
  • 在故障转移情况下,备用 NameNode 在接管成为活动 NameNode 之前,会从共享存储更新其元数据。

Fencing Mechanisms

隔离机制

  • At least one fencing method must be configured to avoid split-brain scenarios, which may include:
    • Killing the NameNode’s process.
    • Revoking access to the shared storage directory.
  • 必须至少配置一种隔离方法以避免裂脑场景,这些方法可能包括:
    • 终止 NameNode 的进程。
    • 撤销对共享存储目录的访问权限。

Hadoop Access Control

Hadoop 访问控制

Hadoop access control operates at two levels: system level and scheduler level.

Hadoop 访问控制在两个级别上运行:系统级别和调度程序级别。

ServiceLevel Authorization

服务级别授权

  • This system-level control manages which services can be accessed, taking precedence over file permissions and queue permissions.
  • 这种系统级控制管理哪些服务可以被访问,优先于文件权限和队列权限。

XML

1
2
3
4
<property>
<name>hadoop.security.authorization</name>
<value>true</value>
</property>

Privilege Management Level

权限管理级别

  • hadoop.security.authorization=true: Enables ServiceLevel Authorization. If set to false, users have full permissions.
  • hadoop.security.authorization=true:启用服务级别授权。如果设置为 false,则用户拥有完全权限。

Service Level Authorization Properties

服务级别授权属性

  • There are nine configurable properties specifying access rights for users or user groups.
  • 有九个可配置属性,用于指定用户或用户组的访问权限。

Adding DataNodes

添加 DataNode

To address capacity issues, you can add a DataNode by following these steps:

要解决容量问题,您可以按照以下步骤添加 DataNode:

Step Description
Increase hostname Edit the hosts file to add the new DataNode’s hostname and IP address.
Copy Hadoop installation Use scp to copy the Hadoop installation files to the new DataNode.
Start new node Execute hadoop-daemon.sh start datanode in the Hadoop sbin directory.
步骤 描述
增加主机名 编辑 hosts 文件以添加新 DataNode 的主机名和 IP 地址。
复制 Hadoop 安装文件 使用 scp 将 Hadoop 安装文件复制到新的 DataNode。
启动新节点 在 Hadoop sbin 目录中执行 hadoop-daemon.sh start datanode

Load Balancing

负载均衡

It is not advisable to stop a DataNode using hadoop-daemon.sh stop datanode, as this can cause missing blocks in HDFS.

不建议使用 hadoop-daemon.sh stop datanode 停止 DataNode,因为这可能导致 HDFS 中出现块丢失。

Proper Procedure for Removing a DataNode

删除 DataNode 的正确步骤

  1. Edit core-site.xml:

  2. 编辑 core-site.xml

    1
    2
    3
    4
    5
    <property>
    <name>dfs.hosts.exclude</name>
    <value>/opt/niit/hadoop/conf/exclude</value>
    <description>Names a file that contains a list of hosts not permitted to connect to the namenode.</description>
    </property>
  3. Edit hdfs-site.xml:

  4. 编辑 hdfs-site.xml

    1
    2
    3
    4
    5
    <property>
    <name>dfs.hosts.exclude</name>
    <value>/opt/niit/hadoop/conf/exclude</value>
    <description>Names a file that contains a list of hosts not permitted to connect to the namenode.</description>
    </property>
  5. Edit exclude file:

    • Add nodes to be deleted, one per line.
  6. 编辑 exclude 文件

    • 添加要删除的节点,每行一个。
  7. Dynamic refresh node:

    • Execute bin/refresh-namenodes.sh.
  8. 动态刷新节点

    • 执行 bin/refresh-namenodes.sh
  9. Achieving Load Balancing:

    • Use the command bin/start-balancer.sh -threshold 10.
  10. 实现负载均衡

    • 使用命令 bin/start-balancer.sh -threshold 10

Factors Considered During Balancing

均衡期间考虑的因素

  • Blocks that have not been moved during the current process.
  • 在当前进程中尚未移动的块。
  • Blocks that do not exist on the target machine.
  • 目标计算机上不存在的块。
  • Ensuring block distribution across racks remains unchanged.
  • 确保跨机架的块分布保持不变。

Hadoop Ecosystem

Hadoop 生态系统

The Hadoop Ecosystem is a platform that addresses big data challenges and includes various services for data ingestion, storage, analysis, and maintenance.

Hadoop 生态系统是一个解决大数据挑战的平台,包括用于数据提取、存储、分析和维护的各种服务。

Core Components of Hadoop

Hadoop 核心组件

  1. HDFS: Hadoop Distributed File System.
  2. HDFS:Hadoop 分布式文件系统。
  3. YARN: Yet Another Resource Negotiator.
  4. YARN:另一种资源協商器。
  5. MapReduce: Programming model for processing large data sets.
  6. MapReduce:用于处理大型数据集的编程模型。
  7. Common: Shared utilities and libraries.
  8. Common:共享实用程序和库。

Big Data Technology Architecture

大数据技术架构

The V’s of Big Data: volume, variety, velocity, and veracity impact data collection, monitoring, storage, analysis, and reporting.

大数据的 V 特征:体量 (volume)、多样性 (variety)、速度 (velocity) 和真实性 (veracity) 影响数据的收集、监控、存储、分析和报告。

Real-Time Computing in Hadoop

Hadoop 中的实时计算

  • Real-time computing capabilities are enhanced by Apache Storm, which supports use cases such as real-time analysis and distributed RPC.
  • Apache Storm 增强了实时计算能力,支持实时分析和分布式 RPC 等用例。