🗄️ HBase and MapReduce
🗄️ HBase 和 MapReduce
🌐 Overview of MapReduce
🌐 MapReduce 概述
MapReduce is a programming framework designed to efficiently process large amounts of data across multiple machines.
MapReduce 是一个为在多台机器上高效处理大量数据而设计的编程框架。
It achieves scalability and performance by employing a divide-and-conquer approach where data is split into smaller chunks, processed in parallel, and then consolidated.
它通过采用分而治之的方法来实现可伸缩性和性能,即将数据分割成小块,并行处理,然后进行整合。
Key Purposes of MapReduce
MapReduce 的主要目的
- Scalable Processing: Increases performance linearly with the addition of physical machines.
- 可扩展处理:随着物理机器的增加,性能呈线性增长。
- Data Splitting: Operates on data from a distributed file system.
- 数据分割:在分布式文件系统的数据上运行。
- Consolidation: Built-in mechanisms for combining processed data.
- 整合:内置了合并已处理数据的机制。
🎯 Objectives of the Chapter
🎯 本章目标
- MapReduce API
- MapReduce API
- HBase, MapReduce, and the CLASSPATH
- HBase、MapReduce 和 CLASSPATH
- MapReduce Scan Caching
- MapReduce 扫描缓存
- Bundled HBase MapReduce Jobs
- 捆绑的 HBase MapReduce 作业
- HBase as a MapReduce Job Data Source and Sink
- HBase 作为 MapReduce 作业的数据源和数据汇
- HBase MapReduce Examples
- HBase MapReduce 示例
🛠️ MapReduce Process
🛠️ MapReduce 流程
Two Phases of MapReduce
MapReduce 的两个阶段
- Map Phase:
- Map 阶段:
- Splits input data and maps it to key-value pairs.
- 分割输入数据并将其映射为键值对。
- Reduce Phase:
- Reduce 阶段:
- Aggregates the mapped data to produce a final output.
- 聚合映射后的数据以产生最终输出。
Example: Word Counting Task
示例:单词计数任务
Jack is tasked with counting the occurrences of words in a novel. Instead of doing it alone, he can divide the task among 26 people, each taking a page:
杰克的任务是统计一本小说中单词出现的次数。他可以将任务分配给 26 个人,每人负责一页,而不是独自完成:
- Map Phase: Each person processes their page and records each word on separate sheets.
- Map 阶段:每个人处理自己的一页,并将每个单词记录在单独的纸上。
- Reduce Phase: Users sort their sheets into boxes by the first letter of each word and then tally the counts.
- Reduce 阶段:用户按每个单词的首字母将纸张分类放入盒子,然后统计总数。
This task illustrates fault tolerance in MapReduce, as if a person cannot complete their task, another can take their place.
这个任务展示了 MapReduce 的容错性,因为如果一个人无法完成任务,另一个人可以接替。
🌍 Real-World Applications
🌍 真实世界应用
- Social Networking: Analyzing user activities to suggest potential friends.
- 社交网络:分析用户活动以推荐可能的朋友。
- Booking Websites: Customizing offerings based on user historical data.
- 预订网站:根据用户历史数据定制产品。
- Industrial Facilities: Using sensor data to optimize maintenance schedules.
- 工业设施:利用传感器数据优化维护计划。
⚙️ Traditional vs. MapReduce Processing
⚙️ 传统处理 vs. MapReduce 处理
In traditional distributed processing, the following challenges arise:
在传统的分布式处理中,会出现以下挑战:
| Challenge挑战 | Description描述 |
|---|---|
| 关键路径问题 | 一台机器的延迟会影响整个作业的持续时间。 |
| Reliability Problem | Failures in machines handling data can complicate processing. |
| 可靠性问题 | 处理数据的机器发生故障会使处理复杂化。 |
| Equal Split Issue | Difficulty in evenly distributing data among machines to prevent overload or underutilization. |
| 平均分割问题 | 难以在机器间均匀分配数据,以防止过载或利用不足。 |
| Single Split Failure | Failure of a single machine to deliver results can halt the overall calculation. |
| 单一切片失败 | 单个机器未能交付结果可能导致整个计算停止。 |
| Aggregation Challenge | Need for a mechanism to aggregate results from multiple machines. |
| 聚合挑战 | 需要一种机制来聚合来自多台机器的结果。 |
Advantages of MapReduce
MapReduce 的优势
- Handles Fault Tolerance: Automatically manages machine failures.
- 处理容错:自动管理机器故障。
- Flexibility: Allows developers to focus on code logic without worrying about system design issues.
- 灵活性:允许开发人员专注于代码逻辑,而无需担心系统设计问题。
📊 MapReduce Framework
📊 MapReduce 框架
Apache Hadoop MapReduce is a specific implementation for processing large datasets in parallel across a Hadoop cluster. The job configuration includes:
Apache Hadoop MapReduce 是一个用于在 Hadoop 集群中并行处理大型数据集的特定实现。作业配置包括:
- Input and output key-value pairs
- 输入和输出的键值对
- Map and reduce functions
- Map 和 Reduce 函数
- Storage locations for final results
- 最终结果的存储位置
Job Execution
作业执行
- Each MapReduce job consists of a map phase and a reduce phase (though the reduce phase can be omitted).
- 每个 MapReduce 作业都包含一个 map 阶段和一个 reduce 阶段(尽管 reduce 阶段可以省略)。
- The map tasks produce intermediate key-value pairs that the reduce tasks use as input.
- Map 任务产生中间键值对,Reduce 任务将其用作输入。
Key-Value Pair Processing
键值对处理
- Input: Set of key-value pairs
- 输入:一组键值对
- Intermediate Output: Produced by map tasks
- 中间输出:由 map 任务产生
- Final Output: Produced by reduce tasks
- 最终输出:由 reduce 任务产生
Example of Key-Value Processing
键值对处理示例
- Input Key-Value Pairs (KV1): (string, string)
- 输入键值对 (KV1):(字符串, 字符串)
- Intermediate Output (KV2): (string, integer)
- 中间输出 (KV2):(字符串, 整数)
- Final Output (KV3): (integer, string)
- 最终输出 (KV3):(整数, 字符串)
The keys do not need to be unique in the map output. A shuffle step sorts values by key for the reduce tasks.
在 map 输出中,键无需唯一。一个 shuffle 步骤会按键对值进行排序,以供 reduce 任务使用。
🔧 Job Configuration Properties
🔧 作业配置属性
Control various characteristics of MapReduce jobs via configuration properties, including:
通过配置属性控制 MapReduce 作业的各种特性,包括:
- Input and output key-value pair types
- 输入和输出键值对的类型
- Map and reduce functions
- Map 和 Reduce 函数
- Storage locations for final results
- 最终结果的存储位置
Input Data Management
输入数据管理
MapReduce input data is divided into splits and further into key-value pairs, enabling effective data processing.
MapReduce 输入数据被分成切片,并进一步分为键值对,从而实现有效的数据处理。
📝 Example: HBase Processing
📝 示例:HBase 处理
In HBase, a (line number, text) key-value pair is generated for each line of an input document. The map function produces (word, count) pairs for each word, and the reduce phase aggregates these to produce final counts, writing results to HDFS. The output format can be customized but generally separates keys and values with a tab and records with newlines.
在 HBase 中,会为输入文档的每一行生成一个(行号,文本)的键值对。Map 函数为每个单词生成(单词,计数)对,而 Reduce 阶段则聚合这些对以产生最终计数,并将结果写入 HDFS。输出格式可以自定义,但通常用制表符分隔键和值,用换行符分隔记录。
📊 MapReduce Overview
📊 MapReduce 概述
📄 Word Count Example
📄 单词计数示例
In the context of MapReduce (MR), a typical task might involve counting the occurrences of words in a text file.
在 MapReduce (MR) 的上下文中,一个典型的任务可能涉及计算文本文件中单词的出现次数。
Data Flow Process
数据流过程
- Input Phase: The text file is split into input data.
- 输入阶段:文本文件被分割成输入数据。
- Mapping Phase:
- 映射阶段:
- Three mappers process the input data in parallel.
- 三个映射器(Mapper)并行处理输入数据。
- Each mapper generates key-value pairs as output for each row of input.
- 每个映射器为每一行输入生成键值对作为输出。
- Shuffling and Sorting Phase:
- 洗牌和排序阶段:
- The data is shuffled and sorted, grouping the same words together.
- 数据被洗牌和排序,将相同的单词分组在一起。
- Reducing Phase:
- 归约阶段:
- The reducers combine each key-value pair and write the results to HDFS (Hadoop Distributed File System).
- 归约器(Reducer)合并每个键值对,并将结果写入 HDFS(Hadoop 分布式文件系统)。
Example Output
输出示例
The final output of the MapReduce job might look like this:
MapReduce 作业的最终输出可能如下所示:
| Word单词 | Count计数 |
|---|---|
| CAR | 3 |
| CAT | 2 |
| DOG | 2 |
| RAT | 2 |
🔍 Life Cycle of a MapReduce Job
🔍 MapReduce 作业的生命周期
Job Submission Process
作业提交过程
Job Client:
作业客户端:
- Prepares and submits the job to the Job Tracker.
- 准备作业并将其提交给作业跟踪器(Job Tracker)。
- Validates job configuration and generates input splits.
- 验证作业配置并生成输入分片。
Job Tracker:
作业跟踪器:
- Schedules and distributes map and reduce tasks among Task Trackers.
- 在任务跟踪器(Task Tracker)之间调度和分发 map 和 reduce 任务。
- Monitors job status and task completion.
- 监控作业状态和任务完成情况。
Task Tracker:
任务跟踪器:
- Manages tasks for worker nodes.
- 管理工作节点(worker node)的任务。
- Executes map or reduce tasks and reports status to the Job Tracker.
- 执行 map 或 reduce 任务,并向作业跟踪器报告状态。
Phases in a MapReduce Job
MapReduce 作业中的阶段
| Phase阶段 | Description描述 |
|---|---|
| Input Split | The input is divided into fixed-size pieces called input splits, consumed by a single map. |
| 输入分片 | 输入被分割成固定大小的片段,称为输入分片,由单个 map 任务消耗。 |
| Mapping | Data from each split is processed to produce output values, typically as <word, frequency> pairs. |
| 映射 | 处理每个分片的数据以产生输出值,通常是 <单词, 频率> 对。 |
| Shuffling | Consolidates relevant records from the mapping output, grouping the same words with their frequency. |
| 洗牌 | 整合映射输出中的相关记录,将相同的单词及其频率分组。 |
| Reducing | Aggregates output values from shuffling to produce a summary of the dataset. |
| 归约 | 聚合来自洗牌阶段的输出值,以生成数据集的摘要。 |
⚙️ Map Task and Reduce Task
⚙️ Map 任务和 Reduce 任务
Map Task
Map 任务
- Purpose: To process each input split.
- 目的:处理每个输入分片。
- Steps:
- 步骤:
- Fetches input data locally.
- 在本地获取输入数据。
- Applies the map function to create key-value pairs.
- 应用 map 函数创建键值对。
- Sorts and aggregates results.
- 对结果进行排序和聚合。
- Communicates progress to the Task Tracker.
- 向任务跟踪器(Task Tracker)通报进度。
Reduce Task
Reduce 任务
- Purpose: To aggregate results from the mapping phase.
- 目的:聚合来自映射阶段的结果。
- Steps:
- 步骤:
- Fetches map results locally.
- 在本地获取 map 结果。
- Sorts and merges results into a single set of (key, value-list) pairs.
- 将结果排序并合并成一个 (键, 值列表) 对的集合。
- Executes the reduce function on each pair.
- 对每对执行 reduce 函数。
- Saves final results to the output destination.
- 将最终结果保存到输出目的地。
🔑 Advantages of MapReduce
🔑 MapReduce 的优势
Parallel Processing:
并行处理:
- Jobs are divided among multiple nodes, allowing simultaneous work, which significantly reduces processing time.
- 作业被分配到多个节点上,允许同时工作,从而显著减少处理时间。
Data Locality:
数据本地性:
- MapReduce processes data where it resides, reducing the need to move large data sets across the network, thereby improving performance and avoiding bottlenecks.
- MapReduce 在数据所在的位置处理数据,减少了跨网络移动大量数据集的需求,从而提高了性能并避免了瓶颈。
Issues Overcome by MapReduce
MapReduce 克服的问题
- Cost of Moving Data: Reduces network performance issues by processing data locally.
- 数据移动成本:通过本地处理数据减少网络性能问题。
- Processing Time: Eliminates delays caused by single-unit processing.
- 处理时间:消除了单单元处理造成的延迟。
- Master Node Burden: Distributes workload to prevent failure of a single node.
- 主节点负担:分配工作负载以防止单个节点故障。
These concepts collectively illustrate the power and efficiency of the MapReduce framework in handling large-scale data processing tasks.
这些概念共同说明了 MapReduce 框架在处理大规模数据处理任务方面的强大功能和效率。
🗂️ HBase and MapReduce
🗂️ HBase 和 MapReduce
Advantages of Data Processing with MapReduce
使用 MapReduce 进行数据处理的优势
Cost-Effectiveness:
成本效益:
- Moving the processing unit closer to the data reduces costs associated with data transmission.
- 将处理单元移近数据,减少了与数据传输相关的成本。
Reduced Processing Time:
减少处理时间:
- All nodes work with their part of the data in parallel, leading to faster processing.
- 所有节点并行处理各自的数据部分,从而加快了处理速度。
Balanced Workload:
均衡的工作负载:
- Each node processes only a part of the data, preventing any single node from becoming overburdened.
- 每个节点只处理一部分数据,防止任何单个节点变得过载。
Configuring MapReduce with HBase
使用 HBase 配置 MapReduce
To allow MapReduce jobs to access HBase, follow these steps:
要允许 MapReduce 作业访问 HBase,请按以下步骤操作:
Add Configuration File:
添加配置文件:
- Place
hbase-site.xmlin the$HADOOP_HOME/conf/directory. - 将
hbase-site.xml放置在$HADOOP_HOME/conf/目录中。
- Place
Add HBase JARs:
添加 HBase JAR 文件:
- Include HBase JAR files in the
$HADOOP_HOME/lib/directory. - 将 HBase JAR 文件包含在
$HADOOP_HOME/lib/目录中。
- Include HBase JAR files in the
Environment Variables:
环境变量:
- Alternatively, edit
$HADOOP_HOME/conf/hadoop-env.shto include HBase references in theHADOOP_CLASSPATH. - 或者,编辑
$HADOOP_HOME/conf/hadoop-env.sh以在HADOOP_CLASSPATH中包含 HBase 引用。 - Note: This approach is not recommended as it may pollute the Hadoop installation.
- 注意:不推荐使用此方法,因为它可能会污染 Hadoop 安装。
- Alternatively, edit
Restart Required:
需要重启:
- Restart the Hadoop cluster to utilize the HBase data.
- 重启 Hadoop 集群以利用 HBase 数据。
Example Command to Run HBase Row Counter
运行 HBase 行计数器的示例命令
1 | $ HADOOP_CLASSPATH=`${HBASE_HOME}/bin/hbase classpath` |
- Ensure to use the correct version of the HBase JAR for your system.
- 确保为您的系统使用正确版本的 HBase JAR。
Common Errors and Fixes
常见错误及修复
Class Not Found Error
类未找到错误
If you encounter:
如果您遇到:
1 | java.lang.ClassNotFoundException: org.apache.hadoop.hbase.mapreduce.RowCounter$RowCounterMapper |
Modify the command to use JARs from the target/ directory within the build environment:
修改命令以使用构建环境中 target/ 目录下的 JAR 文件:
1 | $ HADOOP_CLASSPATH=${HBASE_HOME}/hbase-server/target/hbase-server-VERSION-SNAPSHOT.jar:`${HBASE_HOME}/bin/hbase classpath` |
IllegalAccessError
非法访问错误
For errors like:
对于类似以下的错误:
1 | java.lang.IllegalAccessError: class com.google.protobuf.ZeroCopyLiteralByteString cannot access its superclass com.google.protobuf.LiteralByteString |
- Ensure that
hbase-protocol.jaris included in Hadoop’s classpath. - 确保
hbase-protocol.jar已包含在 Hadoop 的类路径中。
Job Launching Commands
作业启动命令
Here are commands to satisfy the new classloader requirements:
以下是满足新类加载器要求的命令:
1 | $ HADOOP_CLASSPATH=/path/to/hbase-protocol.jar:/path/to/hbase/conf hadoop jar MyJob.jar MyJobMainClass |
MapReduce APIs Overview
MapReduce API 概述
Mapper Class
Mapper 类
- Purpose: Maps input key-value pairs to intermediate key-value pairs.
- 目的:将输入键值对映射到中间键值对。
- Methods:
- 方法:
| Method Signature方法签名 | Description描述 |
|---|---|
void cleanup(Context context) |
Called once at the end of the task. |
void cleanup(Context context) |
在任务结束时调用一次。 |
void map(KEYIN key, VALUEIN value, Context context) |
Called once for each key-value in the input split. |
void map(KEYIN key, VALUEIN value, Context context) |
对输入分片中的每个键值对调用一次。 |
void run(Context context) |
Override to control the execution of the Mapper. |
void run(Context context) |
重写以控制 Mapper 的执行。 |
void setup(Context context) |
Called once at the beginning of the task. |
void setup(Context context) |
在任务开始时调用一次。 |
Reducer Class
Reducer 类
- Purpose: Reduces the set of intermediate values.
- 目的:归约中间值的集合。
- Methods:
- 方法:
| Method Signature方法签名 | Description描述 |
|---|---|
void cleanup(Context context) |
Called once at the end of the task. |
void cleanup(Context context) |
在任务结束时调用一次。 |
void map(KEYIN key, Iterable<VALUEIN> values, Context context) |
Called once for each key. |
void map(KEYIN key, Iterable<VALUEIN> values, Context context) |
对每个键调用一次。 |
void run(Context context) |
Used to control the tasks of the Reducer. |
void run(Context context) |
用于控制 Reducer 的任务。 |
void setup(Context context) |
Called once at the beginning of the task. |
void setup(Context context) |
在任务开始时调用一次。 |
Job Class
Job 类
- Purpose: Configures and submits the job.
- 目的:配置并提交作业。
- Methods:
- 方法:
| Method Signature方法签名 | Description描述 |
|---|---|
Counters getCounters() |
Gets the counters for the job. |
Counters getCounters() |
获取作业的计数器。 |
long getFinishTime() |
Gets the finish time for the job. |
long getFinishTime() |
获取作业的完成时间。 |
Job getInstance() |
Generates a new Job without any cluster. |
Job getInstance() |
生成一个没有任何集群的新作业。 |
Job getInstance(Configuration conf) |
Generates a new Job with provided configuration. |
Job getInstance(Configuration conf) |
使用提供的配置生成一个新作业。 |
String getJobFile() |
Gets the path of the submitted job configuration. |
String getJobFile() |
获取已提交作业配置的路径。 |
String getJobName() |
Gets the user-specified job name. |
String getJobName() |
获取用户指定的作业名称。 |
JobPriority getPriority() |
Gets the scheduling function of the job. |
JobPriority getPriority() |
获取作业的调度函数。 |
void setJarByClass(Class<?> c) |
Sets the jar by providing the class name with .class extension. |
void setJarByClass(Class<?> c) |
通过提供带 .class 扩展名的类名来设置 jar 文件。 |
void setJobName(String name) |
Sets the user-specified job name. |
void setJobName(String name) |
设置用户指定的作业名称。 |
void setMapOutputKeyClass(Class<?> class) |
Sets the key class for the map output data. |
void setMapOutputKeyClass(Class<?> class) |
设置 map 输出数据的键类。 |
void setMapOutputValueClass(Class<?> class) |
Sets the value class for the map output data. |
void setMapOutputValueClass(Class<?> class) |
设置 map 输出数据的值类。 |
void setMapperClass(Class<? extends Mapper> class) |
Sets the Mapper for the job. |
void setMapperClass(Class<? extends Mapper> class) |
设置作业的 Mapper。 |
void setNumReduceTasks(int tasks) |
Sets the number of reduce tasks for the job. |
void setNumReduceTasks(int tasks) |
设置作业的 reduce 任务数量。 |
void setReducerClass(Class<? extends Reducer> class) |
Sets the Reducer for the job. |
void setReducerClass(Class<? extends Reducer> class) |
设置作业的 Reducer。 |
🗄️ HBase and MapReduce
🗄️ HBase 和 MapReduce
Getting Counters for the Job
获取作业的计数器
Method: getCounters()
方法:getCounters()
- Purpose: Used to retrieve the counters for the job.
- 目的:用于检索作业的计数器。
- Exception: Throws
IllegalStateException. - 异常:抛出
IllegalStateException。
Scan Caching in HBase
HBase 中的扫描缓存
Overview
概述
- The
TableMapReduceUtilrestores the option to set scanner caching on theScanobject. TableMapReduceUtil恢复了在Scan对象上设置扫描器缓存的选项。- This functionality was affected by a bug in HBase versions 0.95, but it has been fixed in versions 0.98.5 and 0.96.3.
- 该功能在 HBase 0.95 版本中受到一个 bug 的影响,但在 0.98.5 和 0.96.3 版本中已修复。
Caching Priority Order
缓存优先级顺序
- Caching settings defined on the
Scanobject. - 在
Scan对象上定义的缓存设置。 - Settings specified via the configuration option
hbase.client.scanner.caching(can be set inhbase-site.xmlor usingTableMapReduceUtil.setScannerCaching()). - 通过配置选项
hbase.client.scanner.caching指定的设置(可在hbase-site.xml中设置或使用TableMapReduceUtil.setScannerCaching()设置)。 - Default value:
HConstants.DEFAULT_HBASE_CLIENT_SCANNER_CACHING, which is set to 100. - 默认值:
HConstants.DEFAULT_HBASE_CLIENT_SCANNER_CACHING,其值设置为 100。
Optimization Considerations
优化注意事项
- Balancing Act: Optimizing caching settings involves balancing:
- 平衡操作:优化缓存设置涉及平衡以下几点:
- Client wait time for results.
- 客户端等待结果的时间。
- Number of result sets needed.
- 所需结果集的数量。
- Metaphor: Think of the scan as a shovel:
- 比喻:将扫描想象成一把铲子:
- A bigger cache is like a bigger shovel, allowing for more efficient digging.
- 更大的缓存就像一把更大的铲子,可以更高效地挖掘。
- A smaller cache requires more “shoveling” to fill the bucket, resulting in more requests.
- 较小的缓存需要更多的“铲土”来填满桶,从而导致更多的请求。
HBase Scan Class
HBase Scan 类
Class Hierarchy
类层次结构
1 | java.lang.Object |
- The
Scanclass is used to perform scan operations. Scan类用于执行扫描操作。- All operations are similar to
Get, with the distinction that it can define optionalstartRowandstopRow. - 所有操作都与
Get类似,区别在于它可以定义可选的startRow和stopRow。
Creating a Scan Instance
创建 Scan 实例
- To retrieve all columns from all rows, instantiate
Scanwith no constraints. - 要检索所有行的所有列,实例化不带任何约束的
Scan。 - To constrain:
- 要进行约束:
- Use
addFamilyfor specific column families. - 对特定列族使用
addFamily。 - Use
addColumnfor specific columns. - 对特定列使用
addColumn。 - Use
setTimeRangefor specific timestamp ranges. - 对特定时间戳范围使用
setTimeRange。 - Use
setTimestampfor a specific timestamp. - 对特定时间戳使用
setTimestamp。 - Use
readVersions(int)to limit versions of each column. - 使用
readVersions(int)限制每列的版本数。 - Use
setBatchto control maximum values returned for each call tonext(). - 使用
setBatch控制每次调用next()返回的最大值数量。 - Use
setLimit(int)to limit the number of rows returned. - 使用
setLimit(int)限制返回的行数。
- Use
Important Notes
重要说明
- Attributes in
Scaninstances are updated as they run, which could affect cloning or reusing instances. Scan实例中的属性在运行时会更新,这可能会影响克隆或重用实例。
Bundled HBase MapReduce Jobs
捆绑的 HBase MapReduce 作业
Running Jobs
运行作业
- To run bundled MapReduce jobs, execute:
- 要运行捆绑的 MapReduce 作业,请执行:
1 | $ ${HADOOP_HOME}/bin/hadoop jar ${HBASE_HOME}/hbase-server-VERSION.jar <program_name> |
- Valid Program Names:
- 有效的程序名称:
copytable: Export a table from local to peer cluster.copytable:将表从本地导出到对等集群。completebulkload: Complete a bulk data load.completebulkload:完成批量数据加载。export: Write table data to HDFS.export:将表数据写入 HDFS。import: Import data from an export.import:从导出文件导入数据。rowcounter: Count rows in an HBase table.rowcounter:计算 HBase 表中的行数。
Example Command
示例命令
1 | $ ${HADOOP_HOME}/bin/hadoop jar ${HBASE_HOME}/hbase-server-VERSION.jar rowcounter myTable |
HBase as a Data Source in MapReduce
HBase 作为 MapReduce 的数据源
Input Format
输入格式
- The MapReduce application reads from log files in HDFS.
- MapReduce 应用程序从 HDFS 中的日志文件读取数据。
- The schema describes the data as [k1,v1] tuples, where:
- 模式将数据描述为 [k1,v1] 元组,其中:
k1is the line number.k1是行号。v1is the line content.v1是行内容。
- Code Example:
- 代码示例:
1 | Configuration conf = new Configuration(); |
- Data Types:
- 数据类型:
LongWritablefor line number.LongWritable用于行号。Textfor line content.Text用于行内容。
Mapping Over Data in HBase
在 HBase 中对数据进行映射
- Use the
Scanclass to define the row-range for the MapReduce job. - 使用
Scan类为 MapReduce 作业定义行范围。 - Example Scan Instance:
- 扫描实例示例:
1 | Scan scan = new Scan(); |
Mapper Implementation
Mapper 实现
- The mapper consumes data as [rowkey:scan result] pairs.
- mapper 以 [行键:扫描结果] 对的形式消费数据。
- Use
ImmutableBytesWritableandResultas types. - 使用
ImmutableBytesWritable和Result作为类型。 - Map Method Example:
- Map 方法示例:
1 | protected void map (ImmutableBytesWritable rowkey, Result result, Context context) { |
Initializing the Job
初始化作业
- Use
TableMapReduceUtilto configure the job: - 使用
TableMapReduceUtil配置作业:
1 | TableMapReduceUtil.initTableMapperJob("twits", scan, Map.class, ImmutableBytesWritable.class, Result.class, job); |
- This ties the job configuration with HBase-specific input formats.
- 这将作业配置与 HBase 特定的输入格式绑定在一起。
HBase as a Data Sink
HBase 作为数据汇
HBase Sink Class
HBase Sink 类
- Class Definition:
- 类定义:
1 | public class HBaseSink extends AbstractSink implements Configurable { |
- Reads events from a channel and writes them to HBase.
- 从通道读取事件并将其写入 HBase。
- Configuration is pulled from the first
hbase-site.xmlin the classpath. - 配置从类路径中的第一个
hbase-site.xml文件中提取。
Mandatory Parameters
强制参数
- Table: Name of the target table in HBase.
- 表:HBase 中目标表的名称。
- Column-Family: Column family to write to.
- 列族:要写入的列族。
Transaction Commit Conditions
事务提交条件
- Commits when:
- 在以下情况提交:
- Write buffer size is reached.
- 达到写入缓冲区大小时。
- Number of events in the current transaction reaches the batch size.
- 当前事务中的事件数量达到批处理大小时。
📊 MapReduce Word Count Program
📊 MapReduce 单词计数程序
Overview of the MapReduce Program
MapReduce 程序概述
The MapReduce program can be fundamentally divided into three parts:
MapReduce 程序基本上可以分为三个部分:
- Mapper Phase Code
- Mapper 阶段代码
- Reducer Phase Code
- Reducer 阶段代码
- Driver Code
- 驱动程序代码
Mapper Phase Code
Mapper 阶段代码
Mapper Class
Mapper 类
1 | public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> { |
Input and Output of the Mapper
Mapper 的输入和输出
Input:
输入:
- Key: Offset of each line in the text file (type:
LongWritable) - 键:文本文件中每行的偏移量(类型:
LongWritable) - Value: Each individual line (type:
Text) - 值:每一行(类型:
Text)
- Key: Offset of each line in the text file (type:
Output:
输出:
- Key: Tokenized words (type:
Text) - 键:分词后的单词(类型:
Text) - Value: Hardcoded value of 1 (type:
IntWritable) - 值:硬编码的值 1(类型:
IntWritable)
- Key: Tokenized words (type:
| Mapper Input Mapper 输入 | Mapper Output Mapper 输出 |
|---|---|
| Offset: 0 | “DOG” → 1 |
| 偏移量: 0 | “DOG” → 1 |
| Line: “DOG CAT RAT” | “CAT” → 1 |
| 行: “DOG CAT RAT” | “CAT” → 1 |
| “RAT” → 1 | |
| “RAT” → 1 |
Reducer Phase Code
Reducer 阶段代码
Reducer Class
Reducer 类
1 | public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { |
Input and Output of the Reducer
Reducer 的输入和输出
Input:
输入:
- Key: Unique words generated after the sorting and shuffling phase (type:
Text) - 键:排序和洗牌阶段后生成的唯一单词(类型:
Text) - Value: List of integers corresponding to each key (type:
IntWritable) - 值:与每个键对应的整数列表(类型:
IntWritable)
- Key: Unique words generated after the sorting and shuffling phase (type:
Output:
输出:
- Key: All unique words present in the input text file (type:
Text) - 键:输入文本文件中出现的所有唯一单词(类型:
Text) - Value: Number of occurrences of each unique word (type:
IntWritable) - 值:每个唯一单词的出现次数(类型:
IntWritable)
- Key: All unique words present in the input text file (type:
| Reducer Input Reducer 输入 | Reducer Output Reducer 输出 |
|---|---|
| Key: “CAR” | “CAR” → 3 |
| 键: “CAR” | “CAR” → 3 |
| Values: [1, 1, 1] | |
| 值: [1, 1, 1] | |
| Key: “CAT” | “CAT” → 2 |
| 键: “CAT” | “CAT” → 2 |
| Values: [1, 1] | |
| 值: [1, 1] |
Driver Code
驱动程序代码
1 | public static void main(String[] args) throws Exception { |
Key Components of the Driver Code
驱动程序代码的关键组件
- Configuration: Sets the configuration for the MapReduce job.
- 配置:为 MapReduce 作业设置配置。
- Job Setup: Defines the job name, mapper, reducer, input/output classes, and paths.
- 作业设置:定义作业名称、mapper、reducer、输入/输出类和路径。
- Execution: The job is executed, and the application will exit based on the job’s success or failure.
- 执行:执行作业,应用程序将根据作业的成功或失败退出。
Command to Run MapReduce Code
运行 MapReduce 代码的命令
To execute the MapReduce code, use the following command:
要执行 MapReduce 代码,请使用以下命令:
1 | hadoop jar hadoop-mapreduce-example.jar WordCount /sample/input /sample/output |
Important Concepts in MapReduce
MapReduce 中的重要概念
MapReduce Phases
MapReduce 阶段
- Input Splits
- 输入分片
- Input
- 输入
- Mapping
- 映射
- Shuffling
- 洗牌
- Reducing
- 归约
- Final Output
- 最终输出
Common Questions
常见问题
- Which class maps the input key-value pairs to a set of intermediate key-value pairs?
- 哪个类将输入键值对映射到一组中间键值对?
- Answer: Mapper Class
- 答案:Mapper 类
- Which class reduces the set of intermediate values?
- 哪个类对中间值的集合进行归约?
- Answer: Reducer Class
- 答案:Reducer 类
- Which class is used to configure the job and submits it?
- 哪个类用于配置作业并提交它?
- Answer: Job Class
- 答案:Job 类
- Correct sequence of phases in MapReduce?
- MapReduce 中各阶段的正确顺序是?
- Answer: Input Splits, Input, Mapping, Shuffling, Reducer, Final Output
- 答案:输入分片、输入、映射、洗牌、归约、最终输出
Note on HBase and MapReduce Integration
关于 HBase 和 MapReduce 集成的说明
- HBase does not guarantee atomic commits on multiple rows leading to potential duplicates if a failure occurs during batch writes.
- HBase 不保证对多行的原子提交,如果在批量写入期间发生故障,可能会导致重复数据。
- The serializer in the pipeline is responsible for handling duplicates.
- 流水线中的序列化器负责处理重复数据。
🗂️ MapReduce Overview
🗂️ MapReduce 概述
🛠️ MapReduce API
🛠️ MapReduce API
The MapReduce API defines classes and methods used for programming in the MapReduce framework. Key components include:
MapReduce API 定义了用于在 MapReduce 框架中编程的类和方法。关键组件包括:
Mapper Class:
Mapper 类:
- The role of the Mapper class is to map input key-value pairs to a set of intermediate key-value pairs.
- Mapper 类 的作用是将输入键值对映射到一组中间键值对。
- It transforms input records into intermediate records.
- 它将输入记录转换为中间记录。
Reducer Class:
Reducer 类:
- The Reducer class is responsible for reducing the set of intermediate values into a final output.
- Reducer 类 负责将中间值的集合归约为最终输出。
Job Class:
Job 类:
- The Job class configures the job, submits it, controls execution, and queries the state.
- Job 类 配置作业、提交作业、控制执行并查询状态。
- Once the job is submitted, invoking the set method throws an
IllegalStateException. - 作业提交后,调用 set 方法会抛出
IllegalStateException。
📊 MapReduce Phases
📊 MapReduce 阶段
MapReduce is a software framework and programming model used for processing vast amounts of data. It operates in two main phases:
MapReduce 是一个用于处理海量数据的软件框架和编程模型。它主要分两个阶段运行:
Map Phase:
Map 阶段:
- Processes input data and produces intermediate key-value pairs.
- 处理输入数据并产生中间键值对。
Reduce Phase:
Reduce 阶段:
- Takes intermediate values to produce final output.
- 接收中间值以产生最终输出。
Working Process of MapReduce
MapReduce 的工作流程
The MapReduce process consists of four execution phases:
MapReduce 过程包括四个执行阶段:
- Input Splits: The input data is divided into manageable pieces.
- 输入分片:输入数据被分成可管理的小块。
- Mapping: Each input split is processed by the Mapper to generate intermediate key-value pairs.
- 映射:每个输入分片由 Mapper 处理,以生成中间键值对。
- Shuffling: The intermediate data is shuffled to group by key, preparing it for the Reducer.
- 洗牌:对中间数据进行洗牌,按键分组,为 Reducer 做准备。
- Reducing: The Reducer processes the grouped data to produce the final output.
- 归约:Reducer 处理分组后的数据以产生最终输出。
🗂️ MapReduce Architecture
🗂️ MapReduce 架构
| Phase阶段 | Description描述 |
|---|---|
| Input Splits | Divides the job into manageable tasks. |
| 输入分片 | 将作业划分为可管理的任务。 |
| Mapping | Transforms input data into intermediate key-value pairs. |
| 映射 | 将输入数据转换为中间键值对。 |
| Shuffling | Groups intermediate data by key for the Reducer. |
| 洗牌 | 按键对中间数据进行分组,以供 Reducer 使用。 |
| Reducing | Final processing to produce output key-value pairs. |
| 归约 | 进行最终处理以产生输出键值对。 |
🔍 Organizing Work in MapReduce
🔍 在 MapReduce 中组织工作
Hadoop divides jobs into tasks, categorized as:
Hadoop 将作业分为任务,分类如下:
- Map Tasks: Includes Input Splits and Mapping.
- Map 任务:包括输入分片和映射。
- Reduce Tasks: Includes Shuffling and Reducing.
- Reduce 任务:包括洗牌和归约。
🗄️ MapReduce Data Handling
🗄️ MapReduce 数据处理
- Scan Caching: Refers to the number of rows cached before returning results to clients.
- 扫描缓存:指在向客户端返回结果之前缓存的行数。
Bundled HBase MapReduce Jobs
捆绑的 HBase MapReduce 作业
The HBase JAR serves as a Driver for various bundled MapReduce jobs.
HBase JAR 作为各种捆绑的 MapReduce 作业的驱动程序。
🔗 HBase Integration with MapReduce
🔗 HBase 与 MapReduce 集成
- HBase as a Data Source: For instance, in a MapReduce application, lines are read from log files stored in HDFS, acting as the data source.
- HBase 作为数据源:例如,在 MapReduce 应用程序中,从存储在 HDFS 中的日志文件中读取行,这些日志文件充当数据源。
- HBase as a Sink: HBase can also serve as a sink, reading events from a channel and writing them to HBase.
- HBase 作为数据汇:HBase 也可以作为数据汇,从通道读取事件并将其写入 HBase。