Hadoop Composition and Structure
Hadoop 的组成和结构
🗂️ Introduction to Hadoop
🗂️ Hadoop 简介
“Hadoop is a distributed System infrastructure developed by the Apache Foundation and designed by Doug Cutting, inspired by Map/Reduce and Google File System (GFS) developed by Google Lab.”
“Hadoop 是一个由 Apache 基金会开发的分布式系统基础设施,由 Doug Cutting 设计,其灵感来源于谷歌实验室开发的 Map/Reduce 和谷歌文件系统 (GFS)。”
Core Architecture of Hadoop
Hadoop 的核心架构
- The core architecture consists of:
- MapReduce programming model
- HDFS (Hadoop Distributed File System)
- 核心架构包括:
- MapReduce 编程模型
- HDFS (Hadoop 分布式文件系统)
Learning Objectives
学习目标
- Components of Hadoop
- Hadoop 的组件
- Hadoop HDFS architecture
- Hadoop HDFS 架构
- Hadoop YARN architecture
- Hadoop YARN 架构
- Hadoop MapReduce architecture
- Hadoop MapReduce 架构
🖥️ Hadoop Architecture
🖥️ Hadoop 架构
Master-Slave Topology
主从拓扑结构
- Master Node: Assigns tasks and manages resources.
- Slave Nodes: Perform actual computing and store real data.
- 主节点 (Master Node): 分配任务并管理资源。
- 从节点 (Slave Nodes): 执行实际计算并存储真实数据。
| Node Type | Function |
|---|---|
| Master Node | Assigns tasks, manages resources |
| Slave Nodes | Perform computations, store actual data |
| 节点类型 | 功能 |
|---|---|
| 主节点 (Master Node) | 分配任务,管理资源 |
| 从节点 (Slave Nodes) | 执行计算,存储实际数据 |
🔍 Hadoop Components
🔍 Hadoop 组件
- Hadoop HDFS: Stores data across slave machines.
- Hadoop HDFS: 在从机器上存储数据。
- Hadoop YARN: Manages resources in the Hadoop cluster.
- Hadoop YARN: 管理 Hadoop 集群中的资源。
- Hadoop MapReduce: Processes data in a distributed fashion.
- Hadoop MapReduce: 以分布式方式处理数据。
📂 Hadoop HDFS Architecture
📂 Hadoop HDFS 架构
Overview of HDFS
HDFS 概述
- HDFS: The primary storage unit in the Hadoop Ecosystem.
- Function: Quick data access and scalability.
- HDFS: Hadoop 生态系统中的主要存储单元。
- 功能: 快速数据访问和可扩展性。
Core Components of HDFS
HDFS 的核心组件
- NameNode (Master Node):
- Stores metadata in RAM and disk.
- Secondary NameNode:
- Acts as a buffer, storing copies of NameNode metadata.
- DataNode (Slave Node):
- Stores actual data as blocks.
- 名称节点 (NameNode) (主节点):
- 在 RAM 和磁盘中存储元数据。
- 第二名称节点 (Secondary NameNode):
- 充当缓冲区,存储名称节点元数据的副本。
- 数据节点 (DataNode) (从节点):
- 以块的形式存储实际数据。
| Component | Role |
|---|---|
| NameNode | Centralizes metadata storage |
| Secondary NameNode | Buffers metadata updates |
| DataNode | Stores data blocks |
| 组件 | 角色 |
|---|---|
| 名称节点 (NameNode) | 集中存储元数据 |
| 第二名称节点 (Secondary NameNode) | 缓冲元数据更新 |
| 数据节点 (DataNode) | 存储数据块 |
⚙️ Client Responsibilities
⚙️ 客户端职责
Client:
Acts as a service provider that requests resources.
- File Sharing: Uploads files to HDFS by dividing them into blocks.
- Interacts with NameNode: Gets file location information.
- Interacts with DataNodes: Reads or writes data.
客户端 (Client):
充当请求资源的服务提供者。
- 文件共享 (File Sharing): 通过将文件分成块上传到 HDFS。
- 与名称节点交互 (Interacts with NameNode): 获取文件位置信息。
- 与数据节点交互 (Interacts with DataNodes): 读取或写入数据。
🗃️ HDFS Blocks
🗃️ HDFS 数据块
- Block Size: Default is 128 MB.
- Storage: Data divided into blocks and stored across slave machines.
- 数据块大小 (Block Size): 默认为 128 MB。
- 存储 (Storage): 数据被分成块并存储在从机器上。
🔄 Replication Management
🔄 复制管理
Fault Tolerance in HDFS
HDFS 中的容错
- Replication Factor: Default is 3; determines how many copies of each block are stored.
- Rules for Replication:
- No identical blocks on the same DataNode.
- For rack-aware clusters, replicas should not be on the same rack.
- 复制因子 (Replication Factor): 默认为 3;决定每个数据块存储多少个副本。
- 复制规则 (Rules for Replication):
- 同一个数据节点上不能有相同的块。
- 对于机架感知的集群,副本不应位于同一机架上。
| Scenario | Outcome |
|---|---|
| DataNode crashes | Data accessed from remaining replicas |
| All replicas on same rack | Not allowed; enforced by HDFS rules |
| 场景 | 结果 |
|---|---|
| 数据节点崩溃 (DataNode crashes) | 从剩余副本访问数据 |
| 所有副本在同一机架上 (All replicas on same rack) | 不允许;由 HDFS 规则强制执行 |
✅ Advantages and Disadvantages of HDFS
✅ HDFS 的优缺点
Advantages
优点
- Fault-Tolerant: Multiple copies ensure data availability.
- Handles Big Data: Suitable for gigabytes to petabytes of data.
- Streamlined Data Retrieval: Consistent access to data.
- Cost-Effective: Can be built on low-cost machines.
- 容错性 (Fault-Tolerant): 多个副本确保数据可用性。
- 处理大数据 (Handles Big Data): 适用于从 GB 到 PB 的数据。
- 简化的数据检索 (Streamlined Data Retrieval): 对数据的一致访问。
- 成本效益高 (Cost-Effective): 可以在低成本机器上构建。
Disadvantages
缺点
- Low-Latency Access: Not suitable for millisecond access.
- Small Files Storage: Inefficient for many small files due to high memory usage.
- Concurrent Writing Limitations: Only allows single-threaded writes; no random modifications.
- 低延迟访问 (Low-Latency Access): 不适合毫秒级访问。
- 小文件存储 (Small Files Storage): 由于内存使用率高,不适合存储大量小文件。
- 并发写入限制 (Concurrent Writing Limitations): 只允许单线程写入;不支持随机修改。
🔄 Hadoop YARN Overview
🔄 Hadoop YARN 概述
Introduction to YARN
YARN 简介
- YARN (Yet Another Resource Negotiator): Resource management layer introduced in Hadoop 2.0.
- Function: Manages resource allocation and job scheduling.
- YARN (另一种资源协调者): Hadoop 2.0 中引入的资源管理层。
- 功能: 管理资源分配和作业调度。
Transition from Hadoop v1.0 to v2.0
从 Hadoop v1.0 到 v2.0 的过渡
- In v1.0 (MapReduce Version 1):
- Job Tracker managed both processing and resource management.
- Scalability bottleneck due to a single Job Tracker.
- 在 v1.0 (MapReduce 版本 1) 中:
- Job Tracker 同时管理处理和资源管理。
- 由于单个 Job Tracker 导致可扩展性瓶颈。
- In v2.0:
- YARN separates resource management from MapReduce processing.
- Enables non-MapReduce jobs to run within the Hadoop framework.
- 在 v2.0 中:
- YARN 将资源管理与 MapReduce 处理分离。
- 使非 MapReduce 作业能够在 Hadoop 框架内运行。
| Version | Components | Limitations |
|---|---|---|
| v1.0 | Job Tracker, Task Trackers | Scalability bottleneck |
| v2.0 | ResourceManager, ApplicationMaster | Improved resource and job handling |
| 版本 | 组件 | 限制 |
|---|---|---|
| v1.0 | Job Tracker, Task Trackers | 可扩展性瓶颈 |
| v2.0 | ResourceManager, ApplicationMaster | 改进的资源和作业处理 |
🗂️ YARN Architecture
🗂️ YARN 架构
Overview of YARN
YARN 概述
YARN (Yet Another Resource Negotiator) is designed to separate resource management and job scheduling/monitoring functions into distinct daemons. This architecture enhances resource utilization and application performance.
YARN (另一种资源协调者) 旨在将资源管理和作业调度/监控功能分离到不同的守护进程中。这种架构提高了资源利用率和应用程序性能。
Core Components of YARN
YARN 的核心组件
| Component | Description |
|---|---|
| NodeManager (NM) | Monitors resource usage by containers and sends heartbeats to the ResourceManager. Resources include CPU, memory, disk, and network. |
| ApplicationMaster (AM) | Manages the resource needs of individual applications and interacts with the scheduler to acquire the necessary resources. Connects with NodeManager to execute and monitor tasks. |
| Container | Houses resources like RAM, CPU, and network bandwidth. Allocations are based on YARN’s calculations, granting applications rights to specific resource amounts. |
| ResourceManager (RM) | Manages resource allocation across the cluster and tracks resource availability. It consists of two main components: Scheduler and Application Manager. |
| 组件 | 描述 |
|---|---|
| 节点管理器 (NodeManager, NM) | 监控容器的资源使用情况,并向 ResourceManager 发送心跳。资源包括 CPU、内存、磁盘和网络。 |
| 应用程序主管 (ApplicationMaster, AM) | 管理单个应用程序的资源需求,并与调度器交互以获取必要的资源。连接 NodeManager 以执行和监控任务。 |
| 容器 (Container) | 包含 RAM、CPU 和网络带宽等资源。分配基于 YARN 的计算,授予应用程序特定资源量的权限。 |
| 资源管理器 (ResourceManager, RM) | 管理整个集群的资源分配并跟踪资源可用性。它由两个主要组件组成:调度器 (Scheduler) 和 应用程序管理器 (Application Manager)。 |
ResourceManager Components
ResourceManager 组件
- Scheduler: Allocates resources to various running applications based on their requirements. It does not monitor application statuses or reschedule tasks that fail due to errors.
- Application Manager: Accepts job submissions from clients and monitors/restarts ApplicationMasters in case of failure.
- 调度器 (Scheduler):根据各种正在运行的应用程序的需求为其分配资源。它不监控应用程序状态或重新调度因错误而失败的任务。
- 应用程序管理器 (Application Manager):接受来自客户端的作业提交,并在 ApplicationMasters 发生故障时对其进行监控/重启。
Application Submission Steps in Hadoop YARN
Hadoop YARN 中的应用程序提交流程
- Submit the job.
- 提交作业。
- Retrieve Application ID.
- 检索应用程序 ID。
- Create Application Submission Context.
- 创建应用程序提交上下文。
- a) Start Container Launch 4. b) Launch Application Master.
- a) 启动容器启动 4. b) 启动应用程序主管。
- Allocate Resources.
- 分配资源。
- a) Create Container. 6. b) Launch Container.
- a) 创建容器。 6. b) 启动容器。
- Execute application code.
- 执行应用程序代码。
Application Workflow in Hadoop YARN
Hadoop YARN 中的应用程序工作流
The steps involved in the application workflow are as follows:
应用程序工作流中涉及的步骤如下:
- Client submits an application.
- 客户端提交应用程序。
- ResourceManager allocates a container to start ApplicationManager.
- ResourceManager 分配一个容器以启动 ApplicationManager。
- ApplicationManager registers with ResourceManager.
- ApplicationManager向 ResourceManager 注册。
- ApplicationManager requests containers from ResourceManager.
- ApplicationManager向 ResourceManager 请求容器。
- ApplicationManager instructs NodeManager to launch containers.
- ApplicationManager指示 NodeManager 启动容器。
- Application code is executed within the container.
- 应用程序代码在容器内执行。
- Client monitors application status via ResourceManager/ApplicationManager.
- 客户端通过 ResourceManager/ApplicationManager 监控应用程序状态。
- ApplicationManager unregisters with ResourceManager.
- ApplicationManager从 ResourceManager 注销。
🗃️ MapReduce Framework
🗃️ MapReduce 框架
MapReduce is the processing layer of Hadoop, designed for processing large volumes of data in parallel by dividing tasks into independent units.
MapReduce 是 Hadoop 的处理层,旨在通过将任务划分为独立的单元来并行处理大量数据。
Traditional Parallel Processing Challenges
传统并行处理的挑战
- Critical Path Problem: Delays in any machine affect the entire job.
- Reliability Issues: Machine failures during processing can result in data loss.
- Equal Split Issue: Difficulty in dividing data into equal parts for distribution across machines.
- Single Point of Failure: If one machine fails to provide output, overall results become unattainable.
- Result Aggregation: Need for a method to combine results from multiple machines.
- 关键路径问题 (Critical Path Problem):任何机器的延迟都会影响整个作业。
- 可靠性问题 (Reliability Issues):处理过程中的机器故障可能导致数据丢失。
- 平均分配问题 (Equal Split Issue):难以将数据平均分配到多台机器上。
- 单点故障 (Single Point of Failure):如果一台机器无法提供输出,则无法获得总体结果。
- 结果聚合 (Result Aggregation):需要一种方法来合并来自多台机器的结果。
Advantages of MapReduce
MapReduce 的优势
- Efficient Data Processing: Processes data in parallel.
- Fault Tolerance: Handles reliability issues effectively.
- 高效数据处理 (Efficient Data Processing):并行处理数据。
- 容错性 (Fault Tolerance):有效处理可靠性问题。
MapReduce Structure
MapReduce 结构
| Phase | Description |
|---|---|
| Mapper | Processes data blocks across nodes, accepting key-value pairs as input and producing output in the same format. |
| Shuffle and Sort | Removes duplicate values and groups different values based on similar keys, outputting key-value pairs. |
| Reducer | Receives key-value pairs from mappers, aggregates intermediate data into fewer tuples, and writes the final output. |
| 阶段 | 描述 |
|---|---|
| 映射器 (Mapper) | 处理节点上的数据块,接受键值对作为输入,并以相同格式生成输出。 |
| 混洗和排序 (Shuffle and Sort) | 根据相似的键删除重复值并对不同值进行分组,输出键值对。 |
| 化简器 (Reducer) | 从映射器接收键值对,将中间数据聚合成更少的元组,并写入最终输出。 |
Example of MapReduce Execution
MapReduce 执行示例
To count word occurrences in text files:
统计文本文件中单词出现的次数:
- Map Phase: Each mapper processes data blocks, producing key-value pairs of words and their counts.
- 映射阶段 (Map Phase):每个映射器处理数据块,生成单词及其计数的键值对。
- Shuffle and Sort Phase: Combines counts for identical words.
- 混洗和排序阶段 (Shuffle and Sort Phase):合并相同单词的计数。
- Reduce Phase: Aggregates counts for final output.
- 化简阶段 (Reduce Phase):聚合计数以获得最终输出。
🧩 Components of Hadoop
🧩 Hadoop 的组件
Key Modules of Hadoop
Hadoop 的关键模块
- Hadoop1.x: Consists of HDFS and MapReduce modules.
- Hadoop2.x: Includes Common, HDFS, MapReduce, and YARN modules, enhancing scalability and resource utilization.
- Hadoop1.x:由 HDFS 和 MapReduce 模块组成。
- Hadoop2.x:包括 Common、HDFS、MapReduce 和 YARN 模块,增强了可扩展性和资源利用率。
HDFS Architecture Key Points
HDFS 架构要点
- NameNode
- 名称节点 (NameNode)
- DataNode
- 数据节点 (DataNode)
- SecondaryNameNode
- 第二名称节点 (SecondaryNameNode)
YARN Architecture Key Points
YARN 架构要点
- NodeManager
- 节点管理器 (NodeManager)
- Application Master
- 应用程序主管 (Application Master)
- Container
- 容器 (Container)
MapReduce Architecture Key Points
MapReduce 架构要点
- Mapper
- 映射器 (Mapper)
- Shuffle and Sort
- 混洗和排序 (Shuffle and Sort)
- Reducer
- 化简器 (Reducer)