🗺️ MapReduce Design Patterns

🗺️ MapReduce设计模式

What is a MapReduce Design Pattern?

什么是MapReduce设计模式？

“A MapReduce design pattern is a template for solving a common and general data manipulation problem with MapReduce.”

“MapReduce设计模式是使用MapReduce解决通用数据处理问题的模板。”

General Approach: It is not specific to a domain such as text processing or graph analysis.
通用方法：它不特定于某个领域，如文本处理或图形分析。
Purpose: To utilize tried and true design principles to build better software.
目的：利用经过验证的设计原则来构建更好的软件。
Focus: Writing MapReduce programs in various situations.
重点：在各种情况下编写MapReduce程序。

Key Topics Covered

涵盖的关键主题

Overview of MapReduce Design Patterns
MapReduce设计模式概述
Maximum Minimum Pattern
最大最小值模式
Filter Pattern
过滤模式
Top N Pattern
Top N模式
Distinct Pattern
去重模式
Data Organization Pattern
数据组织模式
Join Pattern
连接模式

Importance of MapReduce Design Patterns

MapReduce设计模式的重要性

Templates for Solutions: Design patterns serve as reusable templates for solving specific problems, saving time across different domains.
解决方案模板：设计模式作为解决特定问题的可重用模板，在不同领域节省时间。
Control and Management: They provide techniques for controlling execution and managing data flow.
控制和管理：它们提供控制执行和管理数据流的技术。

Constraints of the MapReduce Paradigm

MapReduce范式的约束

The programmer has limited control over: 程序员对以下方面的控制有限：

Where a mapper or reducer runs.
mapper或reducer在哪里运行。
When a mapper or reducer begins or finishes.
mapper或reducer何时开始或结束。
Which input key-value pairs are processed by a specific mapper.
特定mapper处理哪些输入键值对。
Which intermediate key-value pairs are processed by a specific reducer.
特定reducer处理哪些中间键值对。

Why Use MapReduce Design Patterns?

为什么使用MapReduce设计模式？

Complex Data Structures: Ability to construct complex data structures as key/value pairs.
复杂数据结构：能够构造复杂的数据结构作为键/值对。
Partial Results Management: Store and communicate partial results with specified field initialization and termination code.
部分结果管理：通过指定的字段初始化和终止代码存储和传递部分结果。
State Preservation: Maintain state in mappers and reducers across multiple input or intermediate keys.
状态保持：在多个输入或中间键之间在mapper和reducer中维护状态。
Control Sorting and Partitioning: Manage the sort order of intermediate keys and control key space partitioning for reducers.
控制排序和分区：管理中间键的排序顺序并控制reducer的键空间分区。

Design Patterns Overview

设计模式概述

Definition of a Design Pattern

设计模式的定义

Provide a high-level aggregate view of datasets when visual inspection is not feasible.
当视觉检查不可行时，提供数据集的高级聚合视图。
Group similar data together for operations like calculating statistics, indexing, counting, etc.
将相似数据分组在一起进行统计计算、索引、计数等操作。
Apply new datasets for quick understanding of important elements.
应用新数据集快速理解重要元素。

Examples of Design Patterns

设计模式示例

Website Traffic Analysis: Number of hits per hour per location on a website.
网站流量分析：网站每个位置每小时的点击次数。
Blog Comments: Average length of comments per user.
博客评论：每个用户评论的平均长度。
Salaries: Top ten salaries per profession region-wise.
薪资：按地区按职业排名前十的薪资。

Summarization Patterns

汇总模式

Summarization patterns are widely used to group similar data and perform operations such as: 汇总模式广泛用于对相似数据进行分组并执行以下操作：

Calculating minimum, maximum, count, average, median, standard deviation.
计算最小值、最大值、计数、平均值、中位数、标准差。
Building indexes or simple counting based on keys.
基于键构建索引或简单计数。

Example Operations

示例操作

Calculate total revenue by country.
按国家计算总收入。
Average login frequency of users.
用户平均登录频率。
Minimum and maximum user counts by state.
按州统计用户数量的最小值和最大值。

WordCount Example

WordCount示例

The WordCount program is often the first program written by beginners in MapReduce, serving as an introduction to the framework.
WordCount程序通常是MapReduce初学者编写的第一个程序，作为框架的入门介绍。
Patterns from WordCount can be used for various applications like counting populations or total crimes.
WordCount的模式可用于各种应用，如人口统计或犯罪总数统计。

Minimum and Maximum Calculation

最小值和最大值计算

After mapping, the reducer iterates through all key values to find minimum and maximum in the grouped keys.
映射后，reducer遍历所有键值以在分组键中找到最小值和最大值。

Writables

可写对象

Custom writables help avoid issues with delimiter-based data splitting in reducers.
自定义可写对象有助于避免reducer中基于分隔符的数据分割问题。

SQL Example

SQL示例

To find maximum and minimum values: 查找最大值和最小值：

1	SELECT max(num), min(num), count(*) FROM table GROUP BY condition;

Max, Min, and Count - Data Flow

最大值、最小值和计数 - 数据流

Map Stage

Map阶段

Input Key: LongWritable (line number).
输入键：LongWritable（行号）。
Input Value: Text (specific value).
输入值：Text（具体值）。
Output Key: MaxAndMinValue (serialized object).
输出键：MaxAndMinValue（序列化对象）。
Output Value: NullWritable (acts as a placeholder).
输出值：NullWritable（作为占位符）。

Map Task Implementation

Map任务实现

static class MapTask extends Mapper<LongWritable, Text, MaxAndMinValue, NullWritable> {
    long min = Long.MIN_VALUE;
    long max = Long.MAX_VALUE;
    int total = 0;

    @Override
    protected void cleanup(Context context) throws IOException, InterruptedException {
        MaxAndMinValue value = new MaxAndMinValue(min, max, total);
        context.write(value, NullWritable.get());
    }

    @Override
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        long currentValue = 0;
        try {
            currentValue = Long.parseLong(value.toString());
        } catch (NumberFormatException e) {
            e.printStackTrace();
            return;
        }
        if (currentValue > min) {
            min = currentValue;
        }
        if (currentValue < max) {
            max = currentValue;
        }
        total++;
    }
}

Reduce Stage

Reduce阶段

The reducer compares and processes the data output from the map phase.
reducer比较和处理从map阶段输出的数据。

static class ReduceTask extends Reducer<MaxAndMinValue, NullWritable, NullWritable, MaxAndMinValue> {
    long min = Long.MIN_VALUE;
    long max = Long.MAX_VALUE;
    int total = 0;

    @Override
    protected void cleanup(Context context) throws IOException, InterruptedException {
        MaxAndMinValue value = new MaxAndMinValue(min, max, total);
        context.write(NullWritable.get(), value);
    }

    @Override
    protected void reduce(MaxAndMinValue key, Iterable<NullWritable> values, Context context) throws IOException, InterruptedException {
        if (key.getMin().get() < max) {
            max = key.getMin().get();
        }
        if (key.getMax().get() > min) {
            min = key.getMax().get();
        }
        total += key.getTotal().get();
    }
}

Main Function Implementation

主函数实现

public static void main(String[] args) throws Exception {
    Configuration conf = new Configuration();
    String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
    if (otherArgs.length != 2) {
        System.err.println("Usage: wordcount <in> <out>");
        System.exit(2);
    }
    Job job = Job.getInstance(conf);
    job.setJobName("MaxAndMinValue");
    job.setJarByClass(MaxAndMinJob.class);
    job.setMapperClass(MapTask.class);
    job.setReducerClass(ReduceTask.class);
    job.setMapOutputKeyClass(MaxAndMinValue.class);
    job.setMapOutputValueClass(NullWritable.class);
    job.setOutputKeyClass(NullWritable.class);
    job.setOutputValueClass(MaxAndMinValue.class);
}

📊 MapReduce Design Patterns

📊 MapReduce设计模式

🌐 Overview of MapReduce

🌐 MapReduce概述

MapReduce is a programming model for processing large data sets with a distributed algorithm on a cluster.

MapReduce是一种在集群上使用分布式算法处理大数据集的编程模型。

It consists of two primary tasks: the Map task, which processes input data and generates intermediate key-value pairs, and the Reduce task, which merges the intermediate values based on their keys.

它包含两个主要任务：Map任务，处理输入数据并生成中间键值对；Reduce任务，基于键合并中间值。

🗂️ Filtering Pattern

🗂️ 过滤模式

Definition

定义

The filtering pattern is used to retain data that meets specific conditions and discard data that does not. It is particularly useful in scenarios involving data cleansing.

过滤模式用于保留满足特定条件的数据并丢弃不满足条件的数据。它在涉及数据清洗的场景中特别有用。

Characteristics

特征

No Reduction Needed: The filtering pattern does not require aggregation operations, as it solely focuses on filtering records.
不需要归约：过滤模式不需要聚合操作，因为它只专注于过滤记录。
Mapper Functionality: Each input record is processed by the mapper, which filters records based on a specified condition.
Mapper功能：每个输入记录都由mapper处理，mapper根据指定条件过滤记录。
Output: The output key-value pair remains the same as the input, resulting in fewer stages of data transmission between map and reduce.
输出：输出键值对与输入保持相同，导致map和reduce之间的数据传输阶段更少。

SQL Equivalent

SQL等价语句

1	SELECT * FROM table WHERE digits <= 6;

Example Implementation

示例实现

protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
    boolean matches = value.toString().matches("[0-6]*||-[0-6]*");
    if (matches) {
        context.write(NullWritable.get(), value);
    }
}

🌟 Bloom Filtering

🌟 布隆过滤

Definition

定义

Bloom filtering is an advanced filtering technique that applies a unique evaluation function to each record. 布隆过滤是一种高级过滤技术，对每个记录应用唯一的评估函数。

Conditions of Use

使用条件

Data can be divided into records.
数据可以分为记录。
Features that can be extracted from records are identified as hot spot values.
从记录中提取的特征被识别为热点值。
Pre-defined hot spots should exist.
应该存在预定义的热点。
It can tolerate some errors.
它可以容忍一些错误。

Implementation Steps

实现步骤

Training Hotspot Dataset: Place the training dataset in HDFS.
训练热点数据集：将训练数据集放置在HDFS中。
Loading Bloom Filter: Load the bloom filter from the distributed cache during the map task execution.
加载布隆过滤器：在map任务执行期间从分布式缓存加载布隆过滤器。

📈 Top N Design Pattern

📈 Top N设计模式

Purpose

目的

The Top N design pattern is used to extract a limited number of top records based on some criteria from a larger dataset.

Top N设计模式用于从更大的数据集中根据某些标准提取有限数量的顶级记录。

Processing Flow

处理流程

Mapper

Uses a TreeMap to maintain the top K records.
使用TreeMap维护前K个记录。
Removes the lowest record when the size exceeds K.
当大小超过K时删除最低记录。

public static class TopNMapper extends Mapper<LongWritable, Text, NullWritable, Text> {
    TreeMap<Integer, String> treeMap = new TreeMap<Integer, String>();

    @Override
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        try {
            int k = Integer.parseInt(value.toString());
            treeMap.put(k, value.toString());
            if (treeMap.size() > 10) {
                treeMap.remove(treeMap.firstKey());
            }
        } catch (NumberFormatException e) {
            return;
        }
    }
}

Reducer

Collects values and maintains top records similarly to the mapper.
收集值并类似于mapper维护顶级记录。

public static class TopNReducer extends Reducer<NullWritable, Text, NullWritable, Text> {
    TreeMap<Integer, String> treeMap = new TreeMap<Integer, String>();

    @Override
    protected void reduce(NullWritable key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
        for (Text value : values) {
            int k = Integer.parseInt(value.toString());
            treeMap.put(k, value.toString());
            if (treeMap.size() > 10) {
                treeMap.remove(treeMap.firstKey());
            }
        }
    }
}

📋 Distinct Pattern

📋 去重模式

Purpose

目的

The distinct pattern is applied to remove duplicates from the dataset.

去重模式用于从数据集中删除重复项。

SQL Equivalent

SQL等价语句

1	SELECT DISTINCT * FROM table;

Implementation

实现

Mapper:

1
2
3

protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
    context.write(value, NullWritable.get());
}

Reducer:

1
2
3

protected void reduce(Text key, Iterable<NullWritable> values, Context context) throws IOException, InterruptedException {
    context.write(NullWritable.get(), key);
}

🔗 Join Pattern

🔗 连接模式

Purpose

目的

The join pattern combines records from two or more datasets based on a common key.

连接模式基于公共键组合两个或多个数据集的记录。

Implementation Steps

实现步骤

Define Input Paths:
定义输入路径：

1
2

MultipleInputs.addInputPath(job, new Path("D:\\c7_input_files\\users.txt"), TextInputFormat.class, UserJoinCommentMapper.class);
MultipleInputs.addInputPath(job, new Path("D:\\c7_input_files\\comment.txt"), TextInputFormat.class, UserJoinCommentMapper.class);

Output Path:
输出路径：

1	FileOutputFormat.setOutputPath(job, new Path("D:\\c7_output_files\\user-out\\"));

Usage Scenarios

使用场景

Internal joins where both lists are non-empty.
内连接，其中两个列表都非空。
External joins where one list may be empty, ensuring that the non-empty list is outputted alongside the empty list.
外连接，其中一个列表可能为空，确保非空列表与空列表一起输出。

📊 Data Organization Pattern

📊 数据组织模式

Definition

定义

Data organization refers to reorganizing data based on specific criteria, such as gender classification.

数据组织是指根据特定标准重新组织数据，如性别分类。

Implementation

实现

Customize a partitioner class for sending data according to set conditions.
自定义分区器类，根据设定条件发送数据。
Ensure the number of reducers matches the number of partitions defined in the partitioner.
确保reducer数量与分区器中定义的分区数量匹配。

Example Use Case

示例用例

Join two datasets based on a foreign key, combining user information with behavior records. 基于外键连接两个数据集，将用户信息与行为记录结合。

📂 MapReduce Design Patterns

📂 MapReduce设计模式

🗂️ Introduction to MapReduce

🗂️ MapReduce介绍

Definition of MapReduce Design Patterns

MapReduce设计模式定义

“MapReduce design pattern is a template that uses MapReduce to solve conventional data processing problems. The pattern is not limited to specific areas.”

“MapReduce设计模式是使用MapReduce解决传统数据处理问题的模板。该模式不限于特定领域。”

Components of MapReduce

MapReduce组件

Mapper: Processes input data and produces intermediate key-value pairs.
Mapper：处理输入数据并产生中间键值对。
Reducer: Aggregates the intermediate key-value pairs to produce final output.
Reducer：聚合中间键值对以产生最终输出。

🗺️ Mapper Function

🗺️ Mapper函数

The Mapper function processes data by reading input data and producing key-value pairs based on certain conditions.

Mapper函数通过读取输入数据并基于某些条件产生键值对来处理数据。

Mapper Code Example

Mapper代码示例

protected void map(LongWritable key, Text value, Mapper<LongWritable, Text, Text, Text>.Context context) throws IOException, InterruptedException {
    // Get the file path where each row of data is located
    // 获取每行数据所在的文件路径
    String path = context.getInputSplit().toString();
    
    // Determine the data source users
    // 判断数据源users
    if (path.contains("users")) {
        String[] split = value.toString().split(","); // Cut the string with "," into a string array
        String userId = split[0];
        String username = split[1];
        context.write(new Text(userId), new Text("u#" + username)); // Output key-value pair for users
    }
    
    // Judgment data source comment
    // 判断数据源comment
    if (path.contains("comment")) {
        String[] split = value.toString().split(",");
        String commentId = split[0];
        String commentInfo = split[1];
        String userId = split[2];
        context.write(new Text(userId), new Text("c#" + commentId + "," + commentInfo)); // Output key-value pair for comments
    }
}

🔄 Reducer Function

🔄 Reducer函数

The Reducer function processes the intermediate key-value pairs produced by the Mapper. Reducer函数处理由Mapper产生的中间键值对。

Reducer Code Example

Reducer代码示例

protected void reduce(Text key, Iterable<Text> values, Reducer<Text, Text, NullWritable, Text>.Context context) throws IOException, InterruptedException {
    String username = null;
    List<String> comments = new ArrayList<String>(); // Store comment information for the same user
    
    for (Text value : values) {
        if (value.toString().startsWith("u#")) {
            username = value.toString().substring(2); // Extract username
        }
        
        if (value.toString().startsWith("c#")) {
            String comment = value.toString().substring(2);
            comments.add(comment); // Add comment to the list
        }
    }
    
    for (String comment : comments) {
        context.write(NullWritable.get(), new Text(comment + "," + username)); // Output final key-value pairs
    }
}

🛠️ Output Data

🛠️ 输出数据

The output of the MapReduce process consists of the final key-value pairs generated by the Reducer. MapReduce过程的输出包含由Reducer生成的最终键值对。

📊 MapReduce Design Patterns Overview

📊 MapReduce设计模式概述

Design Pattern Type设计模式类型	Description描述
Maximum and Minimum	Identify max/min values.
最大最小值	识别最大/最小值。
Filter Pattern	Filter data based on conditions.
过滤模式	基于条件过滤数据。
Top N Pattern	Retrieve top N results from data.
Top N模式	从数据中检索前N个结果。
Distinct Pattern	Identify unique entries in the dataset.
去重模式	识别数据集中的唯一条目。
Data Organization Pattern	Organize data for efficient processing.
数据组织模式	组织数据以提高处理效率。
Join Pattern	Combine data from different sources.
连接模式	组合来自不同来源的数据。

“A design pattern is a common template for solving problems. Understanding design patterns can help solve problems quickly. However, it cannot solve 100% of the problems in practice.”

“设计模式是解决问题的通用模板。理解设计模式可以帮助快速解决问题。但是，在实践中它不能解决100%的问题。”