Hadoop Composition and Structure

Hadoop 的组成和结构

🗂️ Introduction to Hadoop

🗂️ Hadoop 简介

“Hadoop is a distributed System infrastructure developed by the Apache Foundation and designed by Doug Cutting, inspired by Map/Reduce and Google File System (GFS) developed by Google Lab.”

“Hadoop 是一个由 Apache 基金会开发的分布式系统基础设施，由 Doug Cutting 设计，其灵感来源于谷歌实验室开发的 Map/Reduce 和谷歌文件系统 (GFS)。”

Core Architecture of Hadoop

Hadoop 的核心架构

The core architecture consists of:
- MapReduce programming model
- HDFS (Hadoop Distributed File System)
核心架构包括：
- MapReduce 编程模型
- HDFS (Hadoop 分布式文件系统)

Learning Objectives

学习目标

Components of Hadoop
Hadoop 的组件
Hadoop HDFS architecture
Hadoop HDFS 架构
Hadoop YARN architecture
Hadoop YARN 架构
Hadoop MapReduce architecture
Hadoop MapReduce 架构

🖥️ Hadoop Architecture

🖥️ Hadoop 架构

Master-Slave Topology

主从拓扑结构

Master Node: Assigns tasks and manages resources.
Slave Nodes: Perform actual computing and store real data.
主节点 (Master Node): 分配任务并管理资源。
从节点 (Slave Nodes): 执行实际计算并存储真实数据。

Node Type	Function
Master Node	Assigns tasks, manages resources
Slave Nodes	Perform computations, store actual data

节点类型	功能
主节点 (Master Node)	分配任务，管理资源
从节点 (Slave Nodes)	执行计算，存储实际数据

🔍 Hadoop Components

🔍 Hadoop 组件

Hadoop HDFS: Stores data across slave machines.
Hadoop HDFS: 在从机器上存储数据。
Hadoop YARN: Manages resources in the Hadoop cluster.
Hadoop YARN: 管理 Hadoop 集群中的资源。
Hadoop MapReduce: Processes data in a distributed fashion.
Hadoop MapReduce: 以分布式方式处理数据。

📂 Hadoop HDFS Architecture

📂 Hadoop HDFS 架构

Overview of HDFS

HDFS 概述

HDFS: The primary storage unit in the Hadoop Ecosystem.
Function: Quick data access and scalability.
HDFS: Hadoop 生态系统中的主要存储单元。
功能: 快速数据访问和可扩展性。

Core Components of HDFS

HDFS 的核心组件

NameNode (Master Node):
- Stores metadata in RAM and disk.
Secondary NameNode:
- Acts as a buffer, storing copies of NameNode metadata.
DataNode (Slave Node):
- Stores actual data as blocks.
名称节点 (NameNode) (主节点):
- 在 RAM 和磁盘中存储元数据。
第二名称节点 (Secondary NameNode):
- 充当缓冲区，存储名称节点元数据的副本。
数据节点 (DataNode) (从节点):
- 以块的形式存储实际数据。

Component	Role
NameNode	Centralizes metadata storage
Secondary NameNode	Buffers metadata updates
DataNode	Stores data blocks

组件	角色
名称节点 (NameNode)	集中存储元数据
第二名称节点 (Secondary NameNode)	缓冲元数据更新
数据节点 (DataNode)	存储数据块

⚙️ Client Responsibilities

⚙️ 客户端职责

Client:

Acts as a service provider that requests resources.
- File Sharing: Uploads files to HDFS by dividing them into blocks.
- Interacts with NameNode: Gets file location information.
- Interacts with DataNodes: Reads or writes data.
客户端 (Client):

充当请求资源的服务提供者。
- 文件共享 (File Sharing): 通过将文件分成块上传到 HDFS。
- 与名称节点交互 (Interacts with NameNode): 获取文件位置信息。
- 与数据节点交互 (Interacts with DataNodes): 读取或写入数据。

🗃️ HDFS Blocks

🗃️ HDFS 数据块

Block Size: Default is 128 MB.
Storage: Data divided into blocks and stored across slave machines.
数据块大小 (Block Size): 默认为 128 MB。
存储 (Storage): 数据被分成块并存储在从机器上。

🔄 Replication Management

🔄 复制管理

Fault Tolerance in HDFS

HDFS 中的容错

Replication Factor: Default is 3; determines how many copies of each block are stored.
Rules for Replication:
- No identical blocks on the same DataNode.
- For rack-aware clusters, replicas should not be on the same rack.
复制因子 (Replication Factor): 默认为 3；决定每个数据块存储多少个副本。
复制规则 (Rules for Replication):
- 同一个数据节点上不能有相同的块。
- 对于机架感知的集群，副本不应位于同一机架上。

Scenario	Outcome
DataNode crashes	Data accessed from remaining replicas
All replicas on same rack	Not allowed; enforced by HDFS rules

场景	结果
数据节点崩溃 (DataNode crashes)	从剩余副本访问数据
所有副本在同一机架上 (All replicas on same rack)	不允许；由 HDFS 规则强制执行

✅ Advantages and Disadvantages of HDFS

✅ HDFS 的优缺点

Advantages

优点

Fault-Tolerant: Multiple copies ensure data availability.
Handles Big Data: Suitable for gigabytes to petabytes of data.
Streamlined Data Retrieval: Consistent access to data.
Cost-Effective: Can be built on low-cost machines.
容错性 (Fault-Tolerant): 多个副本确保数据可用性。
处理大数据 (Handles Big Data): 适用于从 GB 到 PB 的数据。
简化的数据检索 (Streamlined Data Retrieval): 对数据的一致访问。
成本效益高 (Cost-Effective): 可以在低成本机器上构建。

Disadvantages

缺点

Low-Latency Access: Not suitable for millisecond access.
Small Files Storage: Inefficient for many small files due to high memory usage.
Concurrent Writing Limitations: Only allows single-threaded writes; no random modifications.
低延迟访问 (Low-Latency Access): 不适合毫秒级访问。
小文件存储 (Small Files Storage): 由于内存使用率高，不适合存储大量小文件。
并发写入限制 (Concurrent Writing Limitations): 只允许单线程写入；不支持随机修改。

🔄 Hadoop YARN Overview

🔄 Hadoop YARN 概述

Introduction to YARN

YARN 简介

YARN (Yet Another Resource Negotiator): Resource management layer introduced in Hadoop 2.0.
Function: Manages resource allocation and job scheduling.
YARN (另一种资源协调者): Hadoop 2.0 中引入的资源管理层。
功能: 管理资源分配和作业调度。

Transition from Hadoop v1.0 to v2.0

从 Hadoop v1.0 到 v2.0 的过渡

In v1.0 (MapReduce Version 1):
- Job Tracker managed both processing and resource management.
- Scalability bottleneck due to a single Job Tracker.
在 v1.0 (MapReduce 版本 1) 中：
- Job Tracker 同时管理处理和资源管理。
- 由于单个 Job Tracker 导致可扩展性瓶颈。
In v2.0:
- YARN separates resource management from MapReduce processing.
- Enables non-MapReduce jobs to run within the Hadoop framework.
在 v2.0 中：
- YARN 将资源管理与 MapReduce 处理分离。
- 使非 MapReduce 作业能够在 Hadoop 框架内运行。

Version	Components	Limitations
v1.0	Job Tracker, Task Trackers	Scalability bottleneck
v2.0	ResourceManager, ApplicationMaster	Improved resource and job handling

版本	组件	限制
v1.0	Job Tracker, Task Trackers	可扩展性瓶颈
v2.0	ResourceManager, ApplicationMaster	改进的资源和作业处理

🗂️ YARN Architecture

🗂️ YARN 架构

Overview of YARN

YARN 概述

YARN (Yet Another Resource Negotiator) is designed to separate resource management and job scheduling/monitoring functions into distinct daemons. This architecture enhances resource utilization and application performance.

YARN (另一种资源协调者) 旨在将资源管理和作业调度/监控功能分离到不同的守护进程中。这种架构提高了资源利用率和应用程序性能。

Core Components of YARN

YARN 的核心组件

Component	Description
NodeManager (NM)	Monitors resource usage by containers and sends heartbeats to the ResourceManager. Resources include CPU, memory, disk, and network.
ApplicationMaster (AM)	Manages the resource needs of individual applications and interacts with the scheduler to acquire the necessary resources. Connects with NodeManager to execute and monitor tasks.
Container	Houses resources like RAM, CPU, and network bandwidth. Allocations are based on YARN’s calculations, granting applications rights to specific resource amounts.
ResourceManager (RM)	Manages resource allocation across the cluster and tracks resource availability. It consists of two main components: Scheduler and Application Manager.

组件	描述
节点管理器 (NodeManager, NM)	监控容器的资源使用情况，并向 ResourceManager 发送心跳。资源包括 CPU、内存、磁盘和网络。
应用程序主管 (ApplicationMaster, AM)	管理单个应用程序的资源需求，并与调度器交互以获取必要的资源。连接 NodeManager 以执行和监控任务。
容器 (Container)	包含 RAM、CPU 和网络带宽等资源。分配基于 YARN 的计算，授予应用程序特定资源量的权限。
资源管理器 (ResourceManager, RM)	管理整个集群的资源分配并跟踪资源可用性。它由两个主要组件组成：调度器 (Scheduler) 和应用程序管理器 (Application Manager)。

ResourceManager Components

ResourceManager 组件

Scheduler: Allocates resources to various running applications based on their requirements. It does not monitor application statuses or reschedule tasks that fail due to errors.
Application Manager: Accepts job submissions from clients and monitors/restarts ApplicationMasters in case of failure.
调度器 (Scheduler)：根据各种正在运行的应用程序的需求为其分配资源。它不监控应用程序状态或重新调度因错误而失败的任务。
应用程序管理器 (Application Manager)：接受来自客户端的作业提交，并在 ApplicationMasters 发生故障时对其进行监控/重启。

Application Submission Steps in Hadoop YARN

Hadoop YARN 中的应用程序提交流程

Submit the job.
提交作业。
Retrieve Application ID.
检索应用程序 ID。
Create Application Submission Context.
创建应用程序提交上下文。
a) Start Container Launch 4. b) Launch Application Master.
a) 启动容器启动 4. b) 启动应用程序主管。
Allocate Resources.
分配资源。
a) Create Container. 6. b) Launch Container.
a) 创建容器。 6. b) 启动容器。
Execute application code.
执行应用程序代码。

Application Workflow in Hadoop YARN

Hadoop YARN 中的应用程序工作流

The steps involved in the application workflow are as follows:

应用程序工作流中涉及的步骤如下：

Client submits an application.
客户端提交应用程序。
ResourceManager allocates a container to start ApplicationManager.
ResourceManager 分配一个容器以启动 ApplicationManager。
ApplicationManager registers with ResourceManager.
ApplicationManager向 ResourceManager 注册。
ApplicationManager requests containers from ResourceManager.
ApplicationManager向 ResourceManager 请求容器。
ApplicationManager instructs NodeManager to launch containers.
ApplicationManager指示 NodeManager 启动容器。
Application code is executed within the container.
应用程序代码在容器内执行。
Client monitors application status via ResourceManager/ApplicationManager.
客户端通过 ResourceManager/ApplicationManager 监控应用程序状态。
ApplicationManager unregisters with ResourceManager.
ApplicationManager从 ResourceManager 注销。

🗃️ MapReduce Framework

🗃️ MapReduce 框架

MapReduce is the processing layer of Hadoop, designed for processing large volumes of data in parallel by dividing tasks into independent units.

MapReduce 是 Hadoop 的处理层，旨在通过将任务划分为独立的单元来并行处理大量数据。

Traditional Parallel Processing Challenges

传统并行处理的挑战

Critical Path Problem: Delays in any machine affect the entire job.
Reliability Issues: Machine failures during processing can result in data loss.
Equal Split Issue: Difficulty in dividing data into equal parts for distribution across machines.
Single Point of Failure: If one machine fails to provide output, overall results become unattainable.
Result Aggregation: Need for a method to combine results from multiple machines.
关键路径问题 (Critical Path Problem)：任何机器的延迟都会影响整个作业。
可靠性问题 (Reliability Issues)：处理过程中的机器故障可能导致数据丢失。
平均分配问题 (Equal Split Issue)：难以将数据平均分配到多台机器上。
单点故障 (Single Point of Failure)：如果一台机器无法提供输出，则无法获得总体结果。
结果聚合 (Result Aggregation)：需要一种方法来合并来自多台机器的结果。

Advantages of MapReduce

MapReduce 的优势

Efficient Data Processing: Processes data in parallel.
Fault Tolerance: Handles reliability issues effectively.
高效数据处理 (Efficient Data Processing)：并行处理数据。
容错性 (Fault Tolerance)：有效处理可靠性问题。

MapReduce Structure

MapReduce 结构

Phase	Description
Mapper	Processes data blocks across nodes, accepting key-value pairs as input and producing output in the same format.
Shuffle and Sort	Removes duplicate values and groups different values based on similar keys, outputting key-value pairs.
Reducer	Receives key-value pairs from mappers, aggregates intermediate data into fewer tuples, and writes the final output.

阶段	描述
映射器 (Mapper)	处理节点上的数据块，接受键值对作为输入，并以相同格式生成输出。
混洗和排序 (Shuffle and Sort)	根据相似的键删除重复值并对不同值进行分组，输出键值对。
化简器 (Reducer)	从映射器接收键值对，将中间数据聚合成更少的元组，并写入最终输出。

Example of MapReduce Execution

MapReduce 执行示例

To count word occurrences in text files:

统计文本文件中单词出现的次数：

Map Phase: Each mapper processes data blocks, producing key-value pairs of words and their counts.
映射阶段 (Map Phase)：每个映射器处理数据块，生成单词及其计数的键值对。
Shuffle and Sort Phase: Combines counts for identical words.
混洗和排序阶段 (Shuffle and Sort Phase)：合并相同单词的计数。
Reduce Phase: Aggregates counts for final output.
化简阶段 (Reduce Phase)：聚合计数以获得最终输出。

🧩 Components of Hadoop

🧩 Hadoop 的组件

Key Modules of Hadoop

Hadoop 的关键模块

Hadoop1.x: Consists of HDFS and MapReduce modules.
Hadoop2.x: Includes Common, HDFS, MapReduce, and YARN modules, enhancing scalability and resource utilization.
Hadoop1.x：由 HDFS 和 MapReduce 模块组成。
Hadoop2.x：包括 Common、HDFS、MapReduce 和 YARN 模块，增强了可扩展性和资源利用率。

HDFS Architecture Key Points

HDFS 架构要点

NameNode
名称节点 (NameNode)
DataNode
数据节点 (DataNode)
SecondaryNameNode
第二名称节点 (SecondaryNameNode)

YARN Architecture Key Points

YARN 架构要点

NodeManager
节点管理器 (NodeManager)
Application Master
应用程序主管 (Application Master)
Container
容器 (Container)

MapReduce Architecture Key Points

MapReduce 架构要点

Mapper
映射器 (Mapper)
Shuffle and Sort
混洗和排序 (Shuffle and Sort)
Reducer
化简器 (Reducer)