🌍 Hadoop Operating Modes
🌍 Hadoop 操作模式
Overview of Hadoop Operating Modes
Hadoop 操作模式概述
- Local Runtime Mode
- 本地运行模式
- Pseudo-Distributed Operating Mode
- 伪分布式操作模式
- Fully Distributed Operating Mode
- 完全分布式操作模式
- High Availability (HA) Operating Mode
- 高可用性 (HA) 操作模式
Local Runtime Mode
本地运行模式
Definition
定义
Local mode, also known as Standalone mode, runs Hadoop as a single Java process without any daemons. It is primarily used for development and testing purposes.
本地模式,也称为独立模式,将 Hadoop 作为单个 Java 进程运行,没有任何守护进程。它主要用于开发和测试目的。
Configuration
配置
- Environment Requirements:
- OS: Windows or Linux x64
- JDK: jdk1.8.0_241
- Hadoop: 3.x
- 环境要求:
- 操作系统:Windows 或 Linux x64
- JDK:jdk1.8.0_241
- Hadoop:3.x
- Configuration File:
hadoop-env.sh: Configure theJAVA_HOMEenvironment variable.
- 配置文件:
hadoop-env.sh:配置JAVA_HOME环境变量。
Pseudo-Distributed Operating Mode
伪分布式操作模式
Definition
定义
In Pseudo-Distributed mode, Hadoop runs on a single node with each daemon running in a separate Java process to simulate a fully distributed environment. This mode is useful for experimental learning.
在伪分布式模式下,Hadoop 在单个节点上运行,每个守护进程都在一个单独的 Java 进程中运行,以模拟完全分布式的环境。此模式对于实验性学习非常有用。
Processes
进程
The following five processes are active in this mode:
在此模式下,以下五个进程处于活动状态:
- NameNode: Manages the filesystem namespace.
- NameNode:管理文件系统命名空间。
- DataNode: Manages the storage of data.
- DataNode:管理数据存储。
- SecondaryNameNode: Assists the NameNode.
- SecondaryNameNode:协助 NameNode。
- ResourceManager: Manages resources in YARN.
- ResourceManager:管理 YARN 中的资源。
- NodeManager: Manages and monitors resources on individual nodes.
- NodeManager:管理和监控各个节点上的资源。
Configuration
配置
- HDFS Daemon Process:
- Use
start-dfs.shto start HDFS services.
- Use
- HDFS 守护进程:
- 使用
start-dfs.sh启动 HDFS 服务。
- 使用
- YARN Daemon Process:
- Use
start-yarn.shto start YARN services.
- Use
- YARN 守护进程:
- 使用
start-yarn.sh启动 YARN 服务。
- 使用
Configuration Files
配置文件
| Configuration File | Key Setting | Description |
|---|---|---|
core-site.xml |
fs.defaultFS |
hdfs://localhost:8020 (RPC remote communication) |
hdfs-site.xml |
dfs.replication |
1 (number of data block copies) |
mapred-site.xml |
mapreduce.framework.name |
YARN (MapReduce framework) |
yarn-site.xml |
yarn.resourcemanager.hostname |
localhost (ResourceManager communication address) |
hadoop-env.sh |
JAVA_HOME |
Location of Java in your system |
| 配置文件 | 关键设置 | 描述 |
|---|---|---|
core-site.xml |
fs.defaultFS |
hdfs://localhost:8020 (RPC 远程通信) |
hdfs-site.xml |
dfs.replication |
1 (数据块副本数) |
mapred-site.xml |
mapreduce.framework.name |
YARN (MapReduce 框架) |
yarn-site.xml |
yarn.resourcemanager.hostname |
localhost (ResourceManager 通信地址) |
hadoop-env.sh |
JAVA_HOME |
您系统中 Java 的位置 |
Fully Distributed Operating Mode
完全分布式操作模式
Definition
定义
In Fully Distributed mode, Hadoop operates on a cluster where each daemon runs in separate Java processes on each server node. This mode is used during experimental verification and enterprise commissioning.
在完全分布式模式下,Hadoop 在集群上运行,其中每个守护进程都在每个服务器节点上的单独 Java 进程中运行。此模式用于实验验证和企业调试。
Processes
进程
The same five processes as in pseudo-distributed mode are present:
与伪分布式模式中相同的五个进程存在:
- NameNode
- NameNode
- DataNode
- DataNode
- SecondaryNameNode
- SecondaryNameNode
- ResourceManager
- ResourceManager
- NodeManager
- NodeManager
Configuration
配置
| Configuration File | Key Setting | Description |
|---|---|---|
core-site.xml |
fs.defaultFS |
hdfs://<Master_hostname>:8020 (Master NameNode RPC) |
hdfs-site.xml |
dfs.replication |
2 (number of data block copies) |
mapred-site.xml |
mapreduce.framework.name |
YARN (MapReduce framework) |
yarn-site.xml |
yarn.resourcemanager.hostname |
<hostname> (ResourceManager communication address) |
slaves |
DataNode Address |
IP addresses of slave nodes |
hadoop-env.sh |
JAVA_HOME |
Location of Java in your system |
| 配置文件 | 关键设置 | 描述 |
|---|---|---|
core-site.xml |
fs.defaultFS |
hdfs://<Master_hostname>:8020 (主 NameNode RPC) |
hdfs-site.xml |
dfs.replication |
2 (数据块副本数) |
mapred-site.xml |
mapreduce.framework.name |
YARN (MapReduce 框架) |
yarn-site.xml |
yarn.resourcemanager.hostname |
<hostname> (ResourceManager 通信地址) |
slaves |
DataNode Address |
从属节点的 IP 地址 |
hadoop-env.sh |
JAVA_HOME |
您系统中 Java 的位置 |
High Availability (HA) Operating Mode
高可用性 (HA) 操作模式
Definition
定义
The High Availability cluster was introduced in Hadoop 2.x to resolve the single point of failure issue present in Hadoop 1.x.
高可用性集群是在 Hadoop 2.x 中引入的,以解决 Hadoop 1.x 中存在的单点故障问题。
Explanation
说明
HDFS Architecture: Follows Master/Slave topology with the NameNode as the master daemon.
HDFS 架构:遵循主/从拓扑,NameNode 作为主守护进程。
Active/Passive Configuration
:
- Two NameNodes:
- Active NameNode: Handles requests.
- Standby/Passive NameNode: Acts as a backup.
- Two NameNodes:
主动/被动配置
:
- 两个 NameNode:
- 活动 NameNode:处理请求。
- 备用/被动 NameNode:充当备份。
- 两个 NameNode:
Functionality
功能
- If the Active NameNode fails, the Standby NameNode can take over, thus reducing downtime.
- 如果活动 NameNode 发生故障,备用 NameNode 可以接管,从而减少停机时间。
- Ensures both NameNodes are synchronized to maintain consistency:
- Metadata Synchronization: Both NameNodes should have identical metadata for fast failover.
- Single Active NameNode: Only one active NameNode to prevent conflicts between nodes.
- 确保两个 NameNode 同步以保持一致性:
- 元数据同步:两个 NameNode 应具有相同的元数据以实现快速故障转移。
- 单个活动 NameNode:只有一个活动 NameNode 以防止节点之间的冲突。
🗂️ High Availability (HA) Architecture in HDFS
🗂️ HDFS 中的高可用性 (HA) 架构
Split-Brain Scenario
裂脑场景
“A split-brain scenario occurs when a cluster gets divided into smaller clusters, each one believing it is the only active cluster, which can lead to data corruption.”
“当集群分裂成较小的集群,每个集群都认为自己是唯一活动的集群时,就会发生裂脑场景,这可能导致数据损坏。”
Fencing
隔离 (Fencing)
- Fencing is a process that ensures only one NameNode remains active at any given time to prevent split-brain scenarios.
- 隔离 (Fencing) 是一个确保在任何给定时间只有一个 NameNode 保持活动状态以防止裂脑场景的过程。
Implementation of HA Architecture
HA 架构的实现
In HDFS HA Architecture, two NameNodes operate simultaneously, and synchronization is achieved through one of the following methods:
在 HDFS HA 架构中,两个 NameNode 同时运行,并通过以下方法之一实现同步:
Using Quorum Journal Nodes
使用 Quorum Journal Nodes (QJM)
- JournalNodes are a group of nodes that facilitate synchronization between the active and standby NameNodes.
- JournalNodes 是一组节点,用于促进活动 NameNode 和备用 NameNode 之间的同步。
- The active NameNode updates the EditLogs in the JournalNodes, while the standby NameNode continuously reads these changes and applies them to its namespace.
- 活动 NameNode 更新 JournalNodes 中的 EditLogs,而备用 NameNode 持续读取这些更改并将其应用于其命名空间。
- During a failover, the standby NameNode ensures that it has the latest metadata from the JournalNodes before becoming the active NameNode.
- 在故障转移期间,备用 NameNode 在成为活动 NameNode 之前,确保已从 JournalNodes 获取最新的元数据。
Architecture Illustration
架构图示
| Component | Description |
|---|---|
| Active NameNode | Updates EditLogs in JournalNodes. |
| Standby NameNode | Reads changes from JournalNodes and applies them to its namespace. |
| JournalNodes | Facilitate synchronization and provide fault tolerance. |
| DataNodes | Send heartbeats and block location information to both NameNodes for fast failover. |
| 组件 | 描述 |
|---|---|
| 活动 NameNode (Active NameNode) | 在 JournalNodes 中更新 EditLogs。 |
| 备用 NameNode (Standby NameNode) | 从 JournalNodes 读取更改并将其应用于其命名空间。 |
| JournalNodes | 促进同步并提供容错能力。 |
| DataNodes | 向两个 NameNode 发送心跳和块位置信息,以实现快速故障转移。 |
Using Shared Storage
使用共享存储
- The active NameNode logs modifications in its namespace to an EditLog in shared storage, which the standby NameNode reads and applies.
- 活动 NameNode 将其命名空间中的修改记录到共享存储中的 EditLog,备用 NameNode 读取并应用这些修改。
- In a failover situation, the standby updates its metadata from the shared storage before taking over as the active NameNode.
- 在故障转移情况下,备用 NameNode 在接管成为活动 NameNode 之前,会从共享存储更新其元数据。
Fencing Mechanisms
隔离机制
- At least one fencing method must be configured to avoid split-brain scenarios, which may include:
- Killing the NameNode’s process.
- Revoking access to the shared storage directory.
- 必须至少配置一种隔离方法以避免裂脑场景,这些方法可能包括:
- 终止 NameNode 的进程。
- 撤销对共享存储目录的访问权限。
Hadoop Access Control
Hadoop 访问控制
Hadoop access control operates at two levels: system level and scheduler level.
Hadoop 访问控制在两个级别上运行:系统级别和调度程序级别。
ServiceLevel Authorization
服务级别授权
- This system-level control manages which services can be accessed, taking precedence over file permissions and queue permissions.
- 这种系统级控制管理哪些服务可以被访问,优先于文件权限和队列权限。
XML
1 | <property> |
Privilege Management Level
权限管理级别
- hadoop.security.authorization=true: Enables ServiceLevel Authorization. If set to false, users have full permissions.
- hadoop.security.authorization=true:启用服务级别授权。如果设置为 false,则用户拥有完全权限。
Service Level Authorization Properties
服务级别授权属性
- There are nine configurable properties specifying access rights for users or user groups.
- 有九个可配置属性,用于指定用户或用户组的访问权限。
Adding DataNodes
添加 DataNode
To address capacity issues, you can add a DataNode by following these steps:
要解决容量问题,您可以按照以下步骤添加 DataNode:
| Step | Description |
|---|---|
| Increase hostname | Edit the hosts file to add the new DataNode’s hostname and IP address. |
| Copy Hadoop installation | Use scp to copy the Hadoop installation files to the new DataNode. |
| Start new node | Execute hadoop-daemon.sh start datanode in the Hadoop sbin directory. |
| 步骤 | 描述 |
|---|---|
| 增加主机名 | 编辑 hosts 文件以添加新 DataNode 的主机名和 IP 地址。 |
| 复制 Hadoop 安装文件 | 使用 scp 将 Hadoop 安装文件复制到新的 DataNode。 |
| 启动新节点 | 在 Hadoop sbin 目录中执行 hadoop-daemon.sh start datanode。 |
Load Balancing
负载均衡
It is not advisable to stop a DataNode using hadoop-daemon.sh stop datanode, as this can cause missing blocks in HDFS.
不建议使用 hadoop-daemon.sh stop datanode 停止 DataNode,因为这可能导致 HDFS 中出现块丢失。
Proper Procedure for Removing a DataNode
删除 DataNode 的正确步骤
Edit core-site.xml:
编辑 core-site.xml:
1
2
3
4
5<property>
<name>dfs.hosts.exclude</name>
<value>/opt/niit/hadoop/conf/exclude</value>
<description>Names a file that contains a list of hosts not permitted to connect to the namenode.</description>
</property>Edit hdfs-site.xml:
编辑 hdfs-site.xml:
1
2
3
4
5<property>
<name>dfs.hosts.exclude</name>
<value>/opt/niit/hadoop/conf/exclude</value>
<description>Names a file that contains a list of hosts not permitted to connect to the namenode.</description>
</property>Edit exclude file:
- Add nodes to be deleted, one per line.
编辑 exclude 文件:
- 添加要删除的节点,每行一个。
Dynamic refresh node:
- Execute
bin/refresh-namenodes.sh.
- Execute
动态刷新节点:
- 执行
bin/refresh-namenodes.sh。
- 执行
Achieving Load Balancing:
- Use the command
bin/start-balancer.sh -threshold 10.
- Use the command
实现负载均衡:
- 使用命令
bin/start-balancer.sh -threshold 10。
- 使用命令
Factors Considered During Balancing
均衡期间考虑的因素
- Blocks that have not been moved during the current process.
- 在当前进程中尚未移动的块。
- Blocks that do not exist on the target machine.
- 目标计算机上不存在的块。
- Ensuring block distribution across racks remains unchanged.
- 确保跨机架的块分布保持不变。
Hadoop Ecosystem
Hadoop 生态系统
The Hadoop Ecosystem is a platform that addresses big data challenges and includes various services for data ingestion, storage, analysis, and maintenance.
Hadoop 生态系统是一个解决大数据挑战的平台,包括用于数据提取、存储、分析和维护的各种服务。
Core Components of Hadoop
Hadoop 核心组件
- HDFS: Hadoop Distributed File System.
- HDFS:Hadoop 分布式文件系统。
- YARN: Yet Another Resource Negotiator.
- YARN:另一种资源協商器。
- MapReduce: Programming model for processing large data sets.
- MapReduce:用于处理大型数据集的编程模型。
- Common: Shared utilities and libraries.
- Common:共享实用程序和库。
Big Data Technology Architecture
大数据技术架构
The V’s of Big Data: volume, variety, velocity, and veracity impact data collection, monitoring, storage, analysis, and reporting.
大数据的 V 特征:体量 (volume)、多样性 (variety)、速度 (velocity) 和真实性 (veracity) 影响数据的收集、监控、存储、分析和报告。
Real-Time Computing in Hadoop
Hadoop 中的实时计算
- Real-time computing capabilities are enhanced by Apache Storm, which supports use cases such as real-time analysis and distributed RPC.
- Apache Storm 增强了实时计算能力,支持实时分析和分布式 RPC 等用例。