🗄️ HBase and MapReduce

🗄️ HBase 和 MapReduce

🌐 Overview of MapReduce

🌐 MapReduce 概述

MapReduce is a programming framework designed to efficiently process large amounts of data across multiple machines.
MapReduce 是一个为在多台机器上高效处理大量数据而设计的编程框架。

It achieves scalability and performance by employing a divide-and-conquer approach where data is split into smaller chunks, processed in parallel, and then consolidated.
它通过采用分而治之的方法来实现可伸缩性和性能，即将数据分割成小块，并行处理，然后进行整合。

Key Purposes of MapReduce

MapReduce 的主要目的

Scalable Processing: Increases performance linearly with the addition of physical machines.
可扩展处理：随着物理机器的增加，性能呈线性增长。
Data Splitting: Operates on data from a distributed file system.
数据分割：在分布式文件系统的数据上运行。
Consolidation: Built-in mechanisms for combining processed data.
整合：内置了合并已处理数据的机制。

🎯 Objectives of the Chapter

🎯 本章目标

MapReduce API
MapReduce API
HBase, MapReduce, and the CLASSPATH
HBase、MapReduce 和 CLASSPATH
MapReduce Scan Caching
MapReduce 扫描缓存
Bundled HBase MapReduce Jobs
捆绑的 HBase MapReduce 作业
HBase as a MapReduce Job Data Source and Sink
HBase 作为 MapReduce 作业的数据源和数据汇
HBase MapReduce Examples
HBase MapReduce 示例

🛠️ MapReduce Process

🛠️ MapReduce 流程

Two Phases of MapReduce

MapReduce 的两个阶段

Map Phase:
Map 阶段：
- Splits input data and maps it to key-value pairs.
- 分割输入数据并将其映射为键值对。
Reduce Phase:
Reduce 阶段：
- Aggregates the mapped data to produce a final output.
- 聚合映射后的数据以产生最终输出。

Example: Word Counting Task

示例：单词计数任务

Jack is tasked with counting the occurrences of words in a novel. Instead of doing it alone, he can divide the task among 26 people, each taking a page:
杰克的任务是统计一本小说中单词出现的次数。他可以将任务分配给 26 个人，每人负责一页，而不是独自完成：

Map Phase: Each person processes their page and records each word on separate sheets.
Map 阶段：每个人处理自己的一页，并将每个单词记录在单独的纸上。
Reduce Phase: Users sort their sheets into boxes by the first letter of each word and then tally the counts.
Reduce 阶段：用户按每个单词的首字母将纸张分类放入盒子，然后统计总数。

This task illustrates fault tolerance in MapReduce, as if a person cannot complete their task, another can take their place.
这个任务展示了 MapReduce 的容错性，因为如果一个人无法完成任务，另一个人可以接替。

🌍 Real-World Applications

🌍 真实世界应用

Social Networking: Analyzing user activities to suggest potential friends.
社交网络：分析用户活动以推荐可能的朋友。
Booking Websites: Customizing offerings based on user historical data.
预订网站：根据用户历史数据定制产品。
Industrial Facilities: Using sensor data to optimize maintenance schedules.
工业设施：利用传感器数据优化维护计划。

⚙️ Traditional vs. MapReduce Processing

⚙️ 传统处理 vs. MapReduce 处理

In traditional distributed processing, the following challenges arise:
在传统的分布式处理中，会出现以下挑战：

Challenge挑战	Description描述
关键路径问题	一台机器的延迟会影响整个作业的持续时间。
Reliability Problem	Failures in machines handling data can complicate processing.
可靠性问题	处理数据的机器发生故障会使处理复杂化。
Equal Split Issue	Difficulty in evenly distributing data among machines to prevent overload or underutilization.
平均分割问题	难以在机器间均匀分配数据，以防止过载或利用不足。
Single Split Failure	Failure of a single machine to deliver results can halt the overall calculation.
单一切片失败	单个机器未能交付结果可能导致整个计算停止。
Aggregation Challenge	Need for a mechanism to aggregate results from multiple machines.
聚合挑战	需要一种机制来聚合来自多台机器的结果。

Advantages of MapReduce

MapReduce 的优势

Handles Fault Tolerance: Automatically manages machine failures.
处理容错：自动管理机器故障。
Flexibility: Allows developers to focus on code logic without worrying about system design issues.
灵活性：允许开发人员专注于代码逻辑，而无需担心系统设计问题。

📊 MapReduce Framework

📊 MapReduce 框架

Apache Hadoop MapReduce is a specific implementation for processing large datasets in parallel across a Hadoop cluster. The job configuration includes:
Apache Hadoop MapReduce 是一个用于在 Hadoop 集群中并行处理大型数据集的特定实现。作业配置包括：

Input and output key-value pairs
输入和输出的键值对
Map and reduce functions
Map 和 Reduce 函数
Storage locations for final results
最终结果的存储位置

Job Execution

作业执行

Each MapReduce job consists of a map phase and a reduce phase (though the reduce phase can be omitted).
每个 MapReduce 作业都包含一个 map 阶段和一个 reduce 阶段（尽管 reduce 阶段可以省略）。
The map tasks produce intermediate key-value pairs that the reduce tasks use as input.
Map 任务产生中间键值对，Reduce 任务将其用作输入。

Key-Value Pair Processing

键值对处理

Input: Set of key-value pairs
输入：一组键值对
Intermediate Output: Produced by map tasks
中间输出：由 map 任务产生
Final Output: Produced by reduce tasks
最终输出：由 reduce 任务产生

Example of Key-Value Processing

键值对处理示例

Input Key-Value Pairs (KV1): (string, string)
输入键值对 (KV1)：(字符串, 字符串)
Intermediate Output (KV2): (string, integer)
中间输出 (KV2)：(字符串, 整数)
Final Output (KV3): (integer, string)
最终输出 (KV3)：(整数, 字符串)

The keys do not need to be unique in the map output. A shuffle step sorts values by key for the reduce tasks.
在 map 输出中，键无需唯一。一个 shuffle 步骤会按键对值进行排序，以供 reduce 任务使用。

🔧 Job Configuration Properties

🔧 作业配置属性

Control various characteristics of MapReduce jobs via configuration properties, including:
通过配置属性控制 MapReduce 作业的各种特性，包括：

Input and output key-value pair types
输入和输出键值对的类型
Map and reduce functions
Map 和 Reduce 函数
Storage locations for final results
最终结果的存储位置

Input Data Management

输入数据管理

MapReduce input data is divided into splits and further into key-value pairs, enabling effective data processing.
MapReduce 输入数据被分成切片，并进一步分为键值对，从而实现有效的数据处理。

📝 Example: HBase Processing

📝 示例：HBase 处理

In HBase, a (line number, text) key-value pair is generated for each line of an input document. The map function produces (word, count) pairs for each word, and the reduce phase aggregates these to produce final counts, writing results to HDFS. The output format can be customized but generally separates keys and values with a tab and records with newlines.
在 HBase 中，会为输入文档的每一行生成一个（行号，文本）的键值对。Map 函数为每个单词生成（单词，计数）对，而 Reduce 阶段则聚合这些对以产生最终计数，并将结果写入 HDFS。输出格式可以自定义，但通常用制表符分隔键和值，用换行符分隔记录。

📊 MapReduce Overview

📊 MapReduce 概述

📄 Word Count Example

📄 单词计数示例

In the context of MapReduce (MR), a typical task might involve counting the occurrences of words in a text file.
在 MapReduce (MR) 的上下文中，一个典型的任务可能涉及计算文本文件中单词的出现次数。

Data Flow Process

数据流过程

Input Phase: The text file is split into input data.
输入阶段：文本文件被分割成输入数据。
Mapping Phase:
映射阶段：
- Three mappers process the input data in parallel.
- 三个映射器（Mapper）并行处理输入数据。
- Each mapper generates key-value pairs as output for each row of input.
- 每个映射器为每一行输入生成键值对作为输出。
Shuffling and Sorting Phase:
洗牌和排序阶段：
- The data is shuffled and sorted, grouping the same words together.
- 数据被洗牌和排序，将相同的单词分组在一起。
Reducing Phase:
归约阶段：
- The reducers combine each key-value pair and write the results to HDFS (Hadoop Distributed File System).
- 归约器（Reducer）合并每个键值对，并将结果写入 HDFS（Hadoop 分布式文件系统）。

Example Output

输出示例

The final output of the MapReduce job might look like this:
MapReduce 作业的最终输出可能如下所示：

Word单词	Count计数
CAR	3
CAT	2
DOG	2
RAT	2

🔍 Life Cycle of a MapReduce Job

🔍 MapReduce 作业的生命周期

Job Submission Process

作业提交过程

Job Client:
作业客户端：
- Prepares and submits the job to the Job Tracker.
- 准备作业并将其提交给作业跟踪器（Job Tracker）。
- Validates job configuration and generates input splits.
- 验证作业配置并生成输入分片。
Job Tracker:
作业跟踪器：
- Schedules and distributes map and reduce tasks among Task Trackers.
- 在任务跟踪器（Task Tracker）之间调度和分发 map 和 reduce 任务。
- Monitors job status and task completion.
- 监控作业状态和任务完成情况。
Task Tracker:
任务跟踪器：
- Manages tasks for worker nodes.
- 管理工作节点（worker node）的任务。
- Executes map or reduce tasks and reports status to the Job Tracker.
- 执行 map 或 reduce 任务，并向作业跟踪器报告状态。

Phases in a MapReduce Job

MapReduce 作业中的阶段

Phase阶段	Description描述
Input Split	The input is divided into fixed-size pieces called input splits, consumed by a single map.
输入分片	输入被分割成固定大小的片段，称为输入分片，由单个 map 任务消耗。
Mapping	Data from each split is processed to produce output values, typically as <word, frequency> pairs.
映射	处理每个分片的数据以产生输出值，通常是 <单词, 频率> 对。
Shuffling	Consolidates relevant records from the mapping output, grouping the same words with their frequency.
洗牌	整合映射输出中的相关记录，将相同的单词及其频率分组。
Reducing	Aggregates output values from shuffling to produce a summary of the dataset.
归约	聚合来自洗牌阶段的输出值，以生成数据集的摘要。

⚙️ Map Task and Reduce Task

⚙️ Map 任务和 Reduce 任务

Map Task

Map 任务

Purpose: To process each input split.
目的：处理每个输入分片。
Steps:
步骤：
- Fetches input data locally.
- 在本地获取输入数据。
- Applies the map function to create key-value pairs.
- 应用 map 函数创建键值对。
- Sorts and aggregates results.
- 对结果进行排序和聚合。
- Communicates progress to the Task Tracker.
- 向任务跟踪器（Task Tracker）通报进度。

Reduce Task

Reduce 任务

Purpose: To aggregate results from the mapping phase.
目的：聚合来自映射阶段的结果。
Steps:
步骤：
- Fetches map results locally.
- 在本地获取 map 结果。
- Sorts and merges results into a single set of (key, value-list) pairs.
- 将结果排序并合并成一个 (键, 值列表) 对的集合。
- Executes the reduce function on each pair.
- 对每对执行 reduce 函数。
- Saves final results to the output destination.
- 将最终结果保存到输出目的地。

🔑 Advantages of MapReduce

🔑 MapReduce 的优势

Parallel Processing:
并行处理：
- Jobs are divided among multiple nodes, allowing simultaneous work, which significantly reduces processing time.
- 作业被分配到多个节点上，允许同时工作，从而显著减少处理时间。
Data Locality:
数据本地性：
- MapReduce processes data where it resides, reducing the need to move large data sets across the network, thereby improving performance and avoiding bottlenecks.
- MapReduce 在数据所在的位置处理数据，减少了跨网络移动大量数据集的需求，从而提高了性能并避免了瓶颈。

Issues Overcome by MapReduce

MapReduce 克服的问题

Cost of Moving Data: Reduces network performance issues by processing data locally.
数据移动成本：通过本地处理数据减少网络性能问题。
Processing Time: Eliminates delays caused by single-unit processing.
处理时间：消除了单单元处理造成的延迟。
Master Node Burden: Distributes workload to prevent failure of a single node.
主节点负担：分配工作负载以防止单个节点故障。

These concepts collectively illustrate the power and efficiency of the MapReduce framework in handling large-scale data processing tasks.
这些概念共同说明了 MapReduce 框架在处理大规模数据处理任务方面的强大功能和效率。

🗂️ HBase and MapReduce

🗂️ HBase 和 MapReduce

Advantages of Data Processing with MapReduce

使用 MapReduce 进行数据处理的优势

Cost-Effectiveness:
成本效益：
- Moving the processing unit closer to the data reduces costs associated with data transmission.
- 将处理单元移近数据，减少了与数据传输相关的成本。
Reduced Processing Time:
减少处理时间：
- All nodes work with their part of the data in parallel, leading to faster processing.
- 所有节点并行处理各自的数据部分，从而加快了处理速度。
Balanced Workload:
均衡的工作负载：
- Each node processes only a part of the data, preventing any single node from becoming overburdened.
- 每个节点只处理一部分数据，防止任何单个节点变得过载。

Configuring MapReduce with HBase

使用 HBase 配置 MapReduce

To allow MapReduce jobs to access HBase, follow these steps:
要允许 MapReduce 作业访问 HBase，请按以下步骤操作：

Add Configuration File:
添加配置文件：
- Place hbase-site.xml in the $HADOOP_HOME/conf/ directory.
- 将 hbase-site.xml 放置在 $HADOOP_HOME/conf/ 目录中。
Add HBase JARs:
添加 HBase JAR 文件：
- Include HBase JAR files in the $HADOOP_HOME/lib/ directory.
- 将 HBase JAR 文件包含在 $HADOOP_HOME/lib/ 目录中。
Environment Variables:
环境变量：
- Alternatively, edit $HADOOP_HOME/conf/hadoop-env.sh to include HBase references in the HADOOP_CLASSPATH.
- 或者，编辑 $HADOOP_HOME/conf/hadoop-env.sh 以在 HADOOP_CLASSPATH 中包含 HBase 引用。
- Note: This approach is not recommended as it may pollute the Hadoop installation.
- 注意：不推荐使用此方法，因为它可能会污染 Hadoop 安装。
Restart Required:
需要重启：
- Restart the Hadoop cluster to utilize the HBase data.
- 重启 Hadoop 集群以利用 HBase 数据。

Example Command to Run HBase Row Counter

运行 HBase 行计数器的示例命令

1 2	$ HADOOP_CLASSPATH=`${HBASE_HOME}/bin/hbase classpath` ${HADOOP_HOME}/bin/hadoop jar ${HBASE_HOME}/hbase-server-VERSION.jar rowcounter usertable

Ensure to use the correct version of the HBase JAR for your system.
确保为您的系统使用正确版本的 HBase JAR。

Common Errors and Fixes

常见错误及修复

Class Not Found Error

类未找到错误

If you encounter:
如果您遇到：

1	java.lang.ClassNotFoundException: org.apache.hadoop.hbase.mapreduce.RowCounter$RowCounterMapper

Modify the command to use JARs from the target/ directory within the build environment:
修改命令以使用构建环境中 target/ 目录下的 JAR 文件：

1
2

$ HADOOP_CLASSPATH=${HBASE_HOME}/hbase-server/target/hbase-server-VERSION-SNAPSHOT.jar:`${HBASE_HOME}/bin/hbase classpath`
${HADOOP_HOME}/bin/hadoop jar ${HBASE_HOME}/hbase-server/target/hbase-server-VERSION-SNAPSHOT.jar rowcounter usertable

IllegalAccessError

非法访问错误

For errors like:
对于类似以下的错误：

1	java.lang.IllegalAccessError: class com.google.protobuf.ZeroCopyLiteralByteString cannot access its superclass com.google.protobuf.LiteralByteString

Ensure that hbase-protocol.jar is included in Hadoop’s classpath.
确保 hbase-protocol.jar 已包含在 Hadoop 的类路径中。

Job Launching Commands

作业启动命令

Here are commands to satisfy the new classloader requirements:
以下是满足新类加载器要求的命令：

1
2
3

$ HADOOP_CLASSPATH=/path/to/hbase-protocol.jar:/path/to/hbase/conf hadoop jar MyJob.jar MyJobMainClass
$ HADOOP_CLASSPATH=$(hbase mapredcp):/path/to/hbase/conf hadoop jar MyJob.jar MyJobMainClass
$ HADOOP_CLASSPATH=$(hbase classpath) hadoop jar MyJob.jar MyJobMainClass

MapReduce APIs Overview

MapReduce API 概述

Mapper Class

Mapper 类

Purpose: Maps input key-value pairs to intermediate key-value pairs.
目的：将输入键值对映射到中间键值对。
Methods:
方法：

Method Signature方法签名	Description描述
`void cleanup(Context context)`	Called once at the end of the task.
`void cleanup(Context context)`	在任务结束时调用一次。
`void map(KEYIN key, VALUEIN value, Context context)`	Called once for each key-value in the input split.
`void map(KEYIN key, VALUEIN value, Context context)`	对输入分片中的每个键值对调用一次。
`void run(Context context)`	Override to control the execution of the Mapper.
`void run(Context context)`	重写以控制 Mapper 的执行。
`void setup(Context context)`	Called once at the beginning of the task.
`void setup(Context context)`	在任务开始时调用一次。

Reducer Class

Reducer 类

Purpose: Reduces the set of intermediate values.
目的：归约中间值的集合。
Methods:
方法：

Method Signature方法签名	Description描述
`void cleanup(Context context)`	Called once at the end of the task.
`void cleanup(Context context)`	在任务结束时调用一次。
`void map(KEYIN key, Iterable<VALUEIN> values, Context context)`	Called once for each key.
`void map(KEYIN key, Iterable<VALUEIN> values, Context context)`	对每个键调用一次。
`void run(Context context)`	Used to control the tasks of the Reducer.
`void run(Context context)`	用于控制 Reducer 的任务。
`void setup(Context context)`	Called once at the beginning of the task.
`void setup(Context context)`	在任务开始时调用一次。

Job Class

Job 类

Purpose: Configures and submits the job.
目的：配置并提交作业。
Methods:
方法：

Method Signature方法签名	Description描述
`Counters getCounters()`	Gets the counters for the job.
`Counters getCounters()`	获取作业的计数器。
`long getFinishTime()`	Gets the finish time for the job.
`long getFinishTime()`	获取作业的完成时间。
`Job getInstance()`	Generates a new Job without any cluster.
`Job getInstance()`	生成一个没有任何集群的新作业。
`Job getInstance(Configuration conf)`	Generates a new Job with provided configuration.
`Job getInstance(Configuration conf)`	使用提供的配置生成一个新作业。
`String getJobFile()`	Gets the path of the submitted job configuration.
`String getJobFile()`	获取已提交作业配置的路径。
`String getJobName()`	Gets the user-specified job name.
`String getJobName()`	获取用户指定的作业名称。
`JobPriority getPriority()`	Gets the scheduling function of the job.
`JobPriority getPriority()`	获取作业的调度函数。
`void setJarByClass(Class<?> c)`	Sets the jar by providing the class name with .class extension.
`void setJarByClass(Class<?> c)`	通过提供带 .class 扩展名的类名来设置 jar 文件。
`void setJobName(String name)`	Sets the user-specified job name.
`void setJobName(String name)`	设置用户指定的作业名称。
`void setMapOutputKeyClass(Class<?> class)`	Sets the key class for the map output data.
`void setMapOutputKeyClass(Class<?> class)`	设置 map 输出数据的键类。
`void setMapOutputValueClass(Class<?> class)`	Sets the value class for the map output data.
`void setMapOutputValueClass(Class<?> class)`	设置 map 输出数据的值类。
`void setMapperClass(Class<? extends Mapper> class)`	Sets the Mapper for the job.
`void setMapperClass(Class<? extends Mapper> class)`	设置作业的 Mapper。
`void setNumReduceTasks(int tasks)`	Sets the number of reduce tasks for the job.
`void setNumReduceTasks(int tasks)`	设置作业的 reduce 任务数量。
`void setReducerClass(Class<? extends Reducer> class)`	Sets the Reducer for the job.
`void setReducerClass(Class<? extends Reducer> class)`	设置作业的 Reducer。

🗄️ HBase and MapReduce

🗄️ HBase 和 MapReduce

Getting Counters for the Job

获取作业的计数器

Method: `getCounters()`

方法：`getCounters()`

Purpose: Used to retrieve the counters for the job.
目的：用于检索作业的计数器。
Exception: Throws IllegalStateException.
异常：抛出 IllegalStateException。

Scan Caching in HBase

HBase 中的扫描缓存

Overview

概述

The TableMapReduceUtil restores the option to set scanner caching on the Scan object.
TableMapReduceUtil 恢复了在 Scan 对象上设置扫描器缓存的选项。
This functionality was affected by a bug in HBase versions 0.95, but it has been fixed in versions 0.98.5 and 0.96.3.
该功能在 HBase 0.95 版本中受到一个 bug 的影响，但在 0.98.5 和 0.96.3 版本中已修复。

Caching Priority Order

缓存优先级顺序

Caching settings defined on the Scan object.
在 Scan 对象上定义的缓存设置。
Settings specified via the configuration option hbase.client.scanner.caching (can be set in hbase-site.xml or using TableMapReduceUtil.setScannerCaching()).
通过配置选项 hbase.client.scanner.caching 指定的设置（可在 hbase-site.xml 中设置或使用 TableMapReduceUtil.setScannerCaching() 设置）。
Default value: HConstants.DEFAULT_HBASE_CLIENT_SCANNER_CACHING, which is set to 100.
默认值：HConstants.DEFAULT_HBASE_CLIENT_SCANNER_CACHING，其值设置为 100。

Optimization Considerations

优化注意事项

Balancing Act: Optimizing caching settings involves balancing:
平衡操作：优化缓存设置涉及平衡以下几点：
- Client wait time for results.
- 客户端等待结果的时间。
- Number of result sets needed.
- 所需结果集的数量。
Metaphor: Think of the scan as a shovel:
比喻：将扫描想象成一把铲子：
- A bigger cache is like a bigger shovel, allowing for more efficient digging.
- 更大的缓存就像一把更大的铲子，可以更高效地挖掘。
- A smaller cache requires more “shoveling” to fill the bucket, resulting in more requests.
- 较小的缓存需要更多的“铲土”来填满桶，从而导致更多的请求。

HBase Scan Class

HBase Scan 类

Class Hierarchy

类层次结构

java.lang.Object 
    org.apache.hadoop.hbase.client.Operation 
        org.apache.hadoop.hbase.client.OperationWithAttributes 
            org.apache.hadoop.hbase.client.Query 
                org.apache.hadoop.hbase.client.Scan

The Scan class is used to perform scan operations.
Scan 类用于执行扫描操作。
All operations are similar to Get, with the distinction that it can define optional startRow and stopRow.
所有操作都与 Get 类似，区别在于它可以定义可选的 startRow 和 stopRow。

Creating a Scan Instance

创建 Scan 实例

To retrieve all columns from all rows, instantiate Scan with no constraints.
要检索所有行的所有列，实例化不带任何约束的 Scan。
To constrain:
要进行约束：
- Use addFamily for specific column families.
- 对特定列族使用 addFamily。
- Use addColumn for specific columns.
- 对特定列使用 addColumn。
- Use setTimeRange for specific timestamp ranges.
- 对特定时间戳范围使用 setTimeRange。
- Use setTimestamp for a specific timestamp.
- 对特定时间戳使用 setTimestamp。
- Use readVersions(int) to limit versions of each column.
- 使用 readVersions(int) 限制每列的版本数。
- Use setBatch to control maximum values returned for each call to next().
- 使用 setBatch 控制每次调用 next() 返回的最大值数量。
- Use setLimit(int) to limit the number of rows returned.
- 使用 setLimit(int) 限制返回的行数。

Important Notes

重要说明

Attributes in Scan instances are updated as they run, which could affect cloning or reusing instances.
Scan 实例中的属性在运行时会更新，这可能会影响克隆或重用实例。

Bundled HBase MapReduce Jobs

捆绑的 HBase MapReduce 作业

Running Jobs

运行作业

To run bundled MapReduce jobs, execute:
要运行捆绑的 MapReduce 作业，请执行：

1	$ ${HADOOP_HOME}/bin/hadoop jar ${HBASE_HOME}/hbase-server-VERSION.jar <program_name>

Valid Program Names:
有效的程序名称：
- copytable: Export a table from local to peer cluster.
- copytable：将表从本地导出到对等集群。
- completebulkload: Complete a bulk data load.
- completebulkload：完成批量数据加载。
- export: Write table data to HDFS.
- export：将表数据写入 HDFS。
- import: Import data from an export.
- import：从导出文件导入数据。
- rowcounter: Count rows in an HBase table.
- rowcounter：计算 HBase 表中的行数。

Example Command

示例命令

1	$ ${HADOOP_HOME}/bin/hadoop jar ${HBASE_HOME}/hbase-server-VERSION.jar rowcounter myTable

HBase as a Data Source in MapReduce

HBase 作为 MapReduce 的数据源

Input Format

输入格式

The MapReduce application reads from log files in HDFS.
MapReduce 应用程序从 HDFS 中的日志文件读取数据。
The schema describes the data as [k1,v1] tuples, where:
模式将数据描述为 [k1,v1] 元组，其中：
- k1 is the line number.
- k1 是行号。
- v1 is the line content.
- v1 是行内容。
Code Example:
代码示例：

Configuration conf = new Configuration(); 
Job job = new Job(conf, "TimeSpent"); 
job.setInputFormatClass(TextInputFormat.class); 
job.setOutputFormatClass(TextOutputFormat.class);

Data Types:
数据类型：
- LongWritable for line number.
- LongWritable 用于行号。
- Text for line content.
- Text 用于行内容。

Mapping Over Data in HBase

在 HBase 中对数据进行映射

Use the Scan class to define the row-range for the MapReduce job.
使用 Scan 类为 MapReduce 作业定义行范围。
Example Scan Instance:
扫描实例示例：

1 2	Scan scan = new Scan(); scan.addColumn(Bytes.toBytes("twits"), Bytes.toBytes("twit"));

Mapper Implementation

Mapper 实现

The mapper consumes data as [rowkey:scan result] pairs.
mapper 以 [行键:扫描结果] 对的形式消费数据。
Use ImmutableBytesWritable and Result as types.
使用 ImmutableBytesWritable 和 Result 作为类型。
Map Method Example:
Map 方法示例：

1
2
3

protected void map (ImmutableBytesWritable rowkey, Result result, Context context) { 
  ...
}

Initializing the Job

初始化作业

Use TableMapReduceUtil to configure the job:
使用 TableMapReduceUtil 配置作业：

1	TableMapReduceUtil.initTableMapperJob("twits", scan, Map.class, ImmutableBytesWritable.class, Result.class, job);

This ties the job configuration with HBase-specific input formats.
这将作业配置与 HBase 特定的输入格式绑定在一起。

HBase as a Data Sink

HBase 作为数据汇

HBase Sink Class

HBase Sink 类

Class Definition:
类定义：

1	public class HBaseSink extends AbstractSink implements Configurable {

Reads events from a channel and writes them to HBase.
从通道读取事件并将其写入 HBase。
Configuration is pulled from the first hbase-site.xml in the classpath.
配置从类路径中的第一个 hbase-site.xml 文件中提取。

Mandatory Parameters

强制参数

Table: Name of the target table in HBase.
表：HBase 中目标表的名称。
Column-Family: Column family to write to.
列族：要写入的列族。

Transaction Commit Conditions

事务提交条件

Commits when:
在以下情况提交：
- Write buffer size is reached.
- 达到写入缓冲区大小时。
- Number of events in the current transaction reaches the batch size.
- 当前事务中的事件数量达到批处理大小时。

📊 MapReduce Word Count Program

📊 MapReduce 单词计数程序

Overview of the MapReduce Program

MapReduce 程序概述

The MapReduce program can be fundamentally divided into three parts:
MapReduce 程序基本上可以分为三个部分：

Mapper Phase Code
Mapper 阶段代码
Reducer Phase Code
Reducer 阶段代码
Driver Code
驱动程序代码

Mapper Phase Code

Mapper 阶段代码

Mapper Class

Mapper 类

public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> {
    public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        String line = value.toString();
        StringTokenizer tokenizer = new StringTokenizer(line);
        while (tokenizer.hasMoreTokens()) {
            value.set(tokenizer.nextToken());
            context.write(value, new IntWritable(1));
        }
    }
}

Input and Output of the Mapper

Mapper 的输入和输出

Input:
输入：
- Key: Offset of each line in the text file (type: LongWritable)
- 键：文本文件中每行的偏移量（类型：LongWritable）
- Value: Each individual line (type: Text)
- 值：每一行（类型：Text）
Output:
输出：
- Key: Tokenized words (type: Text)
- 键：分词后的单词（类型：Text）
- Value: Hardcoded value of 1 (type: IntWritable)
- 值：硬编码的值 1（类型：IntWritable）

Mapper Input Mapper 输入	Mapper Output Mapper 输出
Offset: 0	“DOG” → 1
偏移量: 0	“DOG” → 1
Line: “DOG CAT RAT”	“CAT” → 1
行: “DOG CAT RAT”	“CAT” → 1
	“RAT” → 1
	“RAT” → 1

Reducer Phase Code

Reducer 阶段代码

Reducer Class

Reducer 类

public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> {
    public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
        int sum = 0;
        for (IntWritable x : values) {
            sum += x.get();
        }
        context.write(key, new IntWritable(sum));
    }
}

Input and Output of the Reducer

Reducer 的输入和输出

Input:
输入：
- Key: Unique words generated after the sorting and shuffling phase (type: Text)
- 键：排序和洗牌阶段后生成的唯一单词（类型：Text）
- Value: List of integers corresponding to each key (type: IntWritable)
- 值：与每个键对应的整数列表（类型：IntWritable）
Output:
输出：
- Key: All unique words present in the input text file (type: Text)
- 键：输入文本文件中出现的所有唯一单词（类型：Text）
- Value: Number of occurrences of each unique word (type: IntWritable)
- 值：每个唯一单词的出现次数（类型：IntWritable）

Reducer Input Reducer 输入	Reducer Output Reducer 输出
Key: “CAR”	“CAR” → 3
键: “CAR”	“CAR” → 3
Values: [1, 1, 1]
值: [1, 1, 1]
Key: “CAT”	“CAT” → 2
键: “CAT”	“CAT” → 2
Values: [1, 1]
值: [1, 1]

Driver Code

驱动程序代码

public static void main(String[] args) throws Exception {
    Configuration conf = new Configuration();
    Job job = new Job(conf, "My Word Count Program");
    job.setJarByClass(WordCount.class);
    job.setMapperClass(Map.class);
    job.setReducerClass(Reduce.class);
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(IntWritable.class);
    job.setInputFormatClass(TextInputFormat.class);
    job.setOutputFormatClass(TextOutputFormat.class);
    Path outputPath = new Path(args[1]);
    
    FileInputFormat.addInputPath(job, new Path(args[0]));
    FileOutputFormat.setOutputPath(job, new Path(args[1]));
    outputPath.getFileSystem(conf).delete(outputPath);
    
    System.exit(job.waitForCompletion(true) ? 0 : 1);
}

Key Components of the Driver Code

驱动程序代码的关键组件

Configuration: Sets the configuration for the MapReduce job.
配置：为 MapReduce 作业设置配置。
Job Setup: Defines the job name, mapper, reducer, input/output classes, and paths.
作业设置：定义作业名称、mapper、reducer、输入/输出类和路径。
Execution: The job is executed, and the application will exit based on the job’s success or failure.
执行：执行作业，应用程序将根据作业的成功或失败退出。

Command to Run MapReduce Code

运行 MapReduce 代码的命令

To execute the MapReduce code, use the following command:
要执行 MapReduce 代码，请使用以下命令：

1	hadoop jar hadoop-mapreduce-example.jar WordCount /sample/input /sample/output

Important Concepts in MapReduce

MapReduce 中的重要概念

MapReduce Phases

MapReduce 阶段

Input Splits
输入分片
Input
输入
Mapping
映射
Shuffling
洗牌
Reducing
归约
Final Output
最终输出

Common Questions

常见问题

Which class maps the input key-value pairs to a set of intermediate key-value pairs?
哪个类将输入键值对映射到一组中间键值对？
- Answer: Mapper Class
- 答案：Mapper 类
Which class reduces the set of intermediate values?
哪个类对中间值的集合进行归约？
- Answer: Reducer Class
- 答案：Reducer 类
Which class is used to configure the job and submits it?
哪个类用于配置作业并提交它？
- Answer: Job Class
- 答案：Job 类
Correct sequence of phases in MapReduce?
MapReduce 中各阶段的正确顺序是？
- Answer: Input Splits, Input, Mapping, Shuffling, Reducer, Final Output
- 答案：输入分片、输入、映射、洗牌、归约、最终输出

Note on HBase and MapReduce Integration

关于 HBase 和 MapReduce 集成的说明

HBase does not guarantee atomic commits on multiple rows leading to potential duplicates if a failure occurs during batch writes.
HBase 不保证对多行的原子提交，如果在批量写入期间发生故障，可能会导致重复数据。
The serializer in the pipeline is responsible for handling duplicates.
流水线中的序列化器负责处理重复数据。

🗂️ MapReduce Overview

🗂️ MapReduce 概述

🛠️ MapReduce API

The MapReduce API defines classes and methods used for programming in the MapReduce framework. Key components include:
MapReduce API 定义了用于在 MapReduce 框架中编程的类和方法。关键组件包括：

Mapper Class:
Mapper 类：
- The role of the Mapper class is to map input key-value pairs to a set of intermediate key-value pairs.
- Mapper 类 的作用是将输入键值对映射到一组中间键值对。
- It transforms input records into intermediate records.
- 它将输入记录转换为中间记录。
Reducer Class:
Reducer 类：
- The Reducer class is responsible for reducing the set of intermediate values into a final output.
- Reducer 类 负责将中间值的集合归约为最终输出。
Job Class:
Job 类：
- The Job class configures the job, submits it, controls execution, and queries the state.
- Job 类 配置作业、提交作业、控制执行并查询状态。
- Once the job is submitted, invoking the set method throws an IllegalStateException.
- 作业提交后，调用 set 方法会抛出 IllegalStateException。

📊 MapReduce Phases

📊 MapReduce 阶段

MapReduce is a software framework and programming model used for processing vast amounts of data. It operates in two main phases:
MapReduce 是一个用于处理海量数据的软件框架和编程模型。它主要分两个阶段运行：

Map Phase:
Map 阶段：
- Processes input data and produces intermediate key-value pairs.
- 处理输入数据并产生中间键值对。
Reduce Phase:
Reduce 阶段：
- Takes intermediate values to produce final output.
- 接收中间值以产生最终输出。

Working Process of MapReduce

MapReduce 的工作流程

The MapReduce process consists of four execution phases:
MapReduce 过程包括四个执行阶段：

Input Splits: The input data is divided into manageable pieces.
输入分片：输入数据被分成可管理的小块。
Mapping: Each input split is processed by the Mapper to generate intermediate key-value pairs.
映射：每个输入分片由 Mapper 处理，以生成中间键值对。
Shuffling: The intermediate data is shuffled to group by key, preparing it for the Reducer.
洗牌：对中间数据进行洗牌，按键分组，为 Reducer 做准备。
Reducing: The Reducer processes the grouped data to produce the final output.
归约：Reducer 处理分组后的数据以产生最终输出。

🗂️ MapReduce Architecture

🗂️ MapReduce 架构

Phase阶段	Description描述
Input Splits	Divides the job into manageable tasks.
输入分片	将作业划分为可管理的任务。
Mapping	Transforms input data into intermediate key-value pairs.
映射	将输入数据转换为中间键值对。
Shuffling	Groups intermediate data by key for the Reducer.
洗牌	按键对中间数据进行分组，以供 Reducer 使用。
Reducing	Final processing to produce output key-value pairs.
归约	进行最终处理以产生输出键值对。

🔍 Organizing Work in MapReduce

🔍 在 MapReduce 中组织工作

Hadoop divides jobs into tasks, categorized as:
Hadoop 将作业分为任务，分类如下：

Map Tasks: Includes Input Splits and Mapping.
Map 任务：包括输入分片和映射。
Reduce Tasks: Includes Shuffling and Reducing.
Reduce 任务：包括洗牌和归约。

🗄️ MapReduce Data Handling

🗄️ MapReduce 数据处理

Scan Caching: Refers to the number of rows cached before returning results to clients.
扫描缓存：指在向客户端返回结果之前缓存的行数。

Bundled HBase MapReduce Jobs

捆绑的 HBase MapReduce 作业

The HBase JAR serves as a Driver for various bundled MapReduce jobs.
HBase JAR 作为各种捆绑的 MapReduce 作业的驱动程序。

🔗 HBase Integration with MapReduce

🔗 HBase 与 MapReduce 集成

HBase as a Data Source: For instance, in a MapReduce application, lines are read from log files stored in HDFS, acting as the data source.
HBase 作为数据源：例如，在 MapReduce 应用程序中，从存储在 HDFS 中的日志文件中读取行，这些日志文件充当数据源。
HBase as a Sink: HBase can also serve as a sink, reading events from a channel and writing them to HBase.
HBase 作为数据汇：HBase 也可以作为数据汇，从通道读取事件并将其写入 HBase。