🖥️ RPC Protocol & Performance Optimization

🖥️ RPC协议与性能优化

📜 RPC Definition

📜 RPC定义

“Remote Procedure calls allow one computer program to call subroutines of another computer remotely, without worrying about the underlying network communication details.”

“远程过程调用允许一个计算机程序远程调用另一个计算机的子程序，而无需担心底层网络通信细节。”

RPC is critical in distributed network communications.
RPC在分布式网络通信中至关重要。
It is used for inter-process interactions in Hadoop, such as between Namenode and Datanode and between Jobtracker and TaskTracker.
它用于Hadoop中的进程间交互，例如Namenode和Datanode之间以及Jobtracker和TaskTracker之间。
The RPC protocol relies on transport protocols, such as TCP or UDP, to facilitate communication.
RPC协议依赖传输协议（如TCP或UDP）来促进通信。

🚀 RPC Features

🚀 RPC特性

Feature特性	Description描述
Transparency	Remotely invoking a program on another machine is similar to calling a local method.
透明性	远程调用另一台机器上的程序类似于调用本地方法。
High Performance	The RPC server can concurrently process multiple requests from clients.
高性能	RPC服务器可以并发处理来自客户端的多个请求。

🛠️ RPC Procedure

🛠️ RPC过程

Client/Server Model

客户端/服务器模型

Client: The requestor that sends a call message with parameters to the server.
客户端：向服务器发送带参数的调用消息的请求者。
Server: The service provider that processes the request and sends back a reply.
服务器：处理请求并发送回复的服务提供者。

Call Process Steps

调用过程步骤

The client sends a call message to the server and waits for a reply.
客户端向服务器发送调用消息并等待回复。
The server processes the call when it arrives, evaluates the result, and sends back a reply.
服务器在调用到达时处理调用，评估结果，并发送回复。
The client receives the reply message and continues execution based on the result.
客户端接收回复消息并根据结果继续执行。

RPC Layers

RPC层

Serialization Layer: Utilizes serialized classes or custom writable types for communication.
序列化层：利用序列化类或自定义可写类型进行通信。
Function Call Layer: Implements function calls through dynamic proxies and Java reflection mechanisms.
函数调用层：通过动态代理和Java反射机制实现函数调用。
Network Transport Layer: Employs TCP/IP-based socket mechanisms.
网络传输层：采用基于TCP/IP的套接字机制。
Server-side Framework Layer: Uses Java NIO and an event-driven I/O model to enhance concurrency.
服务器端框架层：使用Java NIO和事件驱动I/O模型来增强并发性。

🔄 Dynamic Proxies

🔄 动态代理

Dynamic proxies allow access to another object while hiding the concrete details of the actual object.

动态代理允许访问另一个对象，同时隐藏实际对象的具体细节。

Currently, only interface implementations are supported in Java’s dynamic proxies.

目前，Java的动态代理仅支持接口实现。

🏗️ Hadoop RPC Design Techniques

🏗️ Hadoop RPC设计技术

Technique 技术	Description 描述
Dynamic Agent	Processes client information sent to the server.
动态代理	处理发送到服务器的客户端信息。
Reflection	Facilitates dynamic loading of classes.
反射	促进类的动态加载。
Serialization	Handles data transmission and storage between nodes.
序列化	处理节点间的数据传输和存储。
Non-blocking Asynchronous I/O (NIO)	Used for communication between RPC clients and servers.
非阻塞异步I/O (NIO)	用于RPC客户端和服务器之间的通信。

Hadoop RPC Logic Diagram Overview

Hadoop RPC逻辑图概述

Client: Initiates requests.
客户端：发起请求。
Listener Thread: Listens for client requests and creates connection objects.
监听器线程：监听客户端请求并创建连接对象。
Reader Thread Pool: Reads connections and processes requests.
读取器线程池：读取连接并处理请求。
Handler Thread Pool: Processes calls from the call queue and retrieves results.
处理器线程池：处理调用队列中的调用并检索结果。

📝 Hadoop RPC Usage Method

📝 Hadoop RPC使用方法

Interfaces

接口

Hadoop RPC provides two key interfaces: Hadoop RPC提供两个关键接口：

public static <T> ProtocolProxy<T> getProxy(...): Constructs a client proxy object to send RPC requests.
public static <T> ProtocolProxy<T> getProxy(...): 构造客户端代理对象以发送RPC请求。
public static Server RPC.Builder(Configuration).build(): Constructs a server object to handle requests.
public static Server RPC.Builder(Configuration).build(): 构造服务器对象以处理请求。

Steps for Using Hadoop RPC

使用Hadoop RPC的步骤

RPC Protocol Definition: Define the communication interface and methods.
RPC协议定义：定义通信接口和方法。

public interface ClientProtocol extends org.apache.hadoop.ipc.VersionedProtocol {
    public static final long versionID = 1L;
    String echo(String value) throws IOException;
    int add(int v1, int v2) throws IOException;
}

Implementing RPC Protocol: Implement the protocol in a Java interface.
实现RPC协议：在Java接口中实现协议。

public static class ClientProtocolImpl implements ClientProtocol {
    public long getProtocolVersion(String protocol, long clientVersion) {
        return ClientProtocol.versionID;
    }
    public ProtocolSignature getProtocolSignature(String protocol, long clientVersion, int hashcode) {
        return new ProtocolSignature(ClientProtocol.versionID, null);
    }
    public String echo(String value) throws IOException {
        return value;
    }
    public int add(int v1, int v2) throws IOException {
        return v1 + v2;
    }
}

Constructing and Starting RPC Server:
构造并启动RPC服务器：

Server server = new RPC.Builder(conf)
    .setProtocol(ClientProtocol.class)
    .setInstance(new ClientProtocolImpl())
    .setBindAddress(ADDRESS)
    .setPort(160201)
    .setNumHandlers(5)
    .build();
server.start();

Constructing RPC Client and Sending Requests:
构造RPC客户端并发送请求：

InetSocketAddress addr = new InetSocketAddress(ADDRESS, 160201);
proxy = (ClientProtocol) RPC.getProxy(ClientProtocol.class, ClientProtocol.versionID, addr, conf);
int result = proxy.add(5, 6);
String echoResult = proxy.echo("result");

Important Note

重要说明

Ensure that the versionID field of the RPC client matches that of the server; otherwise, the server will not respond to the client’s requests.

确保RPC客户端的versionID字段与服务器的匹配；否则，服务器不会响应客户端的请求。

🕸️ Dynamic Proxy in RPC Mechanism

🕸️ RPC机制中的动态代理

RPC Communication

RPC通信

“Hadoop uses RPC mechanism to communicate with nodes in the network, and dynamic proxy mechanism is used to achieve some special processing in RPC.”

“Hadoop使用RPC机制与网络中的节点通信，动态代理机制用于在RPC中实现一些特殊处理。”

What is RPC?

什么是RPC？

RPC (Remote Procedure Call) allows a program to cause a procedure to execute in another address space as if it were a local procedure call.
**RPC（远程过程调用）**允许程序使过程在另一个地址空间中执行，就像本地过程调用一样。
It simplifies the process of communication between different nodes in a network.
它简化了网络中不同节点之间的通信过程。

Dynamic Proxy Mechanism

动态代理机制

Definition

定义

“Dynamic proxy is a mechanism for forwarding requests for special processing, and ‘dynamic’ should refer to ‘runtime’.”

“动态代理是一种转发请求进行特殊处理的机制，’动态’应该指的是’运行时’。”

Characteristics

特征

It allows for special processing of requests, such as:
它允许对请求进行特殊处理，例如：
- Restricting direct access to certain classes.
限制对某些类的直接访问。
- Implementing additional functionality during method calls.
在方法调用期间实现额外功能。

Key Components

关键组件

Proxy Class: Used to generate proxy instances.
代理类：用于生成代理实例。
InvocationHandler Interface: Contains the method that defines how method calls on proxy instances are handled.
调用处理器接口：包含定义如何处理代理实例上的方法调用的方法。

Important Classes and Interfaces

重要的类和接口

public class Proxy implements java.io.Serializable {
    public static Class<?> getProxyClass(ClassLoader loader, Class<?>... interfaces);
    public static Object newProxyInstance(ClassLoader loader, Class<?>[] interfaces, InvocationHandler h);
}

public interface InvocationHandler {
    public Object invoke(Object proxy, Method method, Object[] args) throws Throwable;
}

Methods Explained

方法说明

Method 方法	Description 描述
getProxyClass	Generates proxy classes based on the provided interfaces to “impersonate” real objects.
getProxyClass	基于提供的接口生成代理类来”冒充”真实对象。
newProxyInstance	Creates proxy objects with an associated `InvocationHandler` that processes requests.
newProxyInstance	创建具有关联`InvocationHandler`的代理对象，用于处理请求。
invoke	Handles method calls on proxy instances, executing the method on the associated `InvocationHandler`.
invoke	处理代理实例上的方法调用，在关联的`InvocationHandler`上执行方法。

Example of Dynamic Proxy Creation

动态代理创建示例

Define an Interface:
定义接口：

public interface Hello {
    void sayHello(String to);
    void print(String p);
}

Implement the Interface:
实现接口：

public class HelloImpl implements Hello {
    @Override
    public void sayHello(String to) {
        System.out.println("sayHello-to:" + to);
    }
    @Override
    public void print(String p) {
        System.out.println("print:" + p);
    }
}

Create an InvocationHandler:
创建调用处理器：

public class HelloHandler implements InvocationHandler {
    private Object obj;
    public HelloHandler(Object ob) {
        this.obj = ob;
    }
    @Override
    public Object invoke(Object proxy, Method method, Object[] args) throws Throwable {
        return method.invoke(obj, args);
    }
}

Use the Proxy:
使用代理：

public class ProxyTest {
    public static void main(String[] args) {
        Hello impl = new HelloImpl();
        HelloHandler handler = new HelloHandler(impl);
        Hello hello = (Hello) Proxy.newProxyInstance(impl.getClass().getClassLoader(), impl.getClass().getInterfaces(), handler);
        hello.sayHello("hello proxy");
    }
}

Advantages of Dynamic Proxy

动态代理的优势

Access Control: Provides access to another object while hiding the details of the actual object.
访问控制：提供对另一个对象的访问，同时隐藏实际对象的细节。
Special Processing: Allows additional processing for requests, enabling flexibility in handling method calls.
特殊处理：允许对请求进行额外处理，使方法调用处理具有灵活性。

📊 Hadoop Performance Optimization

📊 Hadoop性能优化

Configuration for File Optimization

文件优化配置

Key Properties in Configuration Files

配置文件中的关键属性

Configuration File配置文件	Property 属性	Default Value默认值	Description 描述
hdfs-site.xml	dfs.block.size	128MB	Can be changed to 256 MB.
hdfs-site.xml	dfs.block.size	128MB	可以更改为256 MB。
	dfs.namenode.handler.count	10	Number of threads for the Namenode server.
	dfs.namenode.handler.count	10	Namenode服务器的线程数。
	dfs.datanode.handler.count	3	Number of threads used on the DataNode.
	dfs.datanode.handler.count	3	DataNode上使用的线程数。
	dfs.datanode.max.xcievers	256	Maximum number of files processed simultaneously.
	dfs.datanode.max.xcievers	256	同时处理的最大文件数。
	fs.safemode.threshold.pct	0.999f	Defines the percentage of blocks that must be available to leave safe mode.
	fs.safemode.threshold.pct	0.999f	定义退出安全模式必须可用的块百分比。
	dfs.permission.enabled	true	Indicates if permission checks are enabled.
	dfs.permission.enabled	true	指示是否启用权限检查。
core-site.xml	hadoop.tmp.dir	/tmp	Default temporary file directory.
core-site.xml	hadoop.tmp.dir	/tmp	默认临时文件目录。
	fs.defaultFS	NameNodeURI	The URI of the default filesystem.
	fs.defaultFS	NameNodeURI	默认文件系统的URI。
mapred-site.xml	io.sort.factor	10	Maximum number of simultaneous streams that can be sorted.
mapred-site.xml	io.sort.factor	10	可以排序的最大同时流数。
	mapreduce.jobhistory.address	0.0.0.0:10020	IP address and port of the job history server.
	mapreduce.jobhistory.address	0.0.0.0:10020	作业历史服务器的IP地址和端口。
	mapreduce.framework.name	local	The framework used for MapReduce jobs (local, classic, or yarn).
	mapreduce.framework.name	local	用于MapReduce作业的框架（local、classic或yarn）。
yarn-site.xml	yarn.resourcemanager.recovery.enabled	true	Automatically recovers the ResourceManager fault.
yarn-site.xml	yarn.resourcemanager.recovery.enabled	true	自动恢复ResourceManager故障。

Hadoop Archive (HAR)

Hadoop归档

“Hadoop Archive is a facility that packs small files into one compact HDFS block to avoid memory wastage of NameNode.”

“Hadoop归档是一种将小文件打包到一个紧凑的HDFS块中的工具，以避免NameNode的内存浪费。”

Purpose

目的

Prevents excessive memory usage by consolidating multiple small files into a single archive.
通过将多个小文件合并到单个归档中来防止过度的内存使用。

HAR Syntax

HAR语法

1	hadoop archive -archiveName NAME -p <parent path> <src>* <dest>

Example

示例

Archive all small files under a directory /NIIT/data into /NIIT/outputdir/data.har:
将目录/NIIT/data下的所有小文件归档到/NIIT/outputdir/data.har：

1	hadoop archive -archiveName data.har -p /niit/data/ /niit/outputdir/

🗂️ HAR File Archive

🗂️ HAR文件归档

Accessing HAR Files

访问HAR文件

HAR (Hadoop Archive) files can be accessed through two formats: HAR（Hadoop归档）文件可以通过两种格式访问：

Format 1: har://scheme-hostname:port/archivepath/fileinarchive
格式1： har://scheme-hostname:port/archivepath/fileinarchive
Format 2: har:///archivepath/fileinarchive
格式2： har:///archivepath/fileinarchive

To view the files in the HAR file archive, use the following command: 要查看HAR文件归档中的文件，请使用以下命令：

1	hadoop dfs -ls har:///niit/data/data.har

Important Points About HAR Files

关于HAR文件的重要要点

Point 要点	Description 描述
File Deletion	After archiving small files, the original files will not be deleted automatically; users must delete them manually.
文件删除	归档小文件后，原始文件不会自动删除；用户必须手动删除它们。
Creating HAR Files	The process of creating HAR files runs a MapReduce job, requiring a Hadoop cluster to execute the command.
创建HAR文件	创建HAR文件的过程运行MapReduce作业，需要Hadoop集群来执行命令。

Limitations of HAR Files

HAR文件的限制

Creation of HAR files duplicates the original files, necessitating disk space equivalent to the size of the original files. After creation, original files can be deleted to free up space.
创建HAR文件会复制原始文件，需要与原始文件大小相等的磁盘空间。创建后，可以删除原始文件以释放空间。
To add or remove files from an archive, the archive must be re-created.
要从归档中添加或删除文件，必须重新创建归档。
HAR files require numerous map tasks, which can be inefficient.
HAR文件需要大量的map任务，这可能效率低下。

📂 Input Using Large Files

📂 使用大文件输入

JVM Management in Hadoop

Hadoop中的JVM管理

By default, Hadoop launches a new JVM for each map or reduce job, running tasks in parallel.
默认情况下，Hadoop为每个map或reduce作业启动一个新的JVM，并行运行任务。
However, for lightweight jobs that run only a few seconds, the JVM startup process presents significant overhead.
然而，对于只运行几秒钟的轻量级作业，JVM启动过程会带来显著的开销。
Hadoop allows JVM reuse to run mappers/reducers serially instead of in parallel.
Hadoop允许JVM重用以串行而不是并行运行mappers/reducers。
This setting applies only to tasks in the same job; different jobs will always use separate JVMs.
此设置仅适用于同一作业中的任务；不同作业将始终使用单独的JVM。

Enabling JVM Reuse

启用JVM重用

To configure Hadoop to reuse JVM for mappers/reducers: 配置Hadoop为mappers/reducers重用JVM：

Modify the configuration file located at $HADOOP_HOME/etc/hadoop/mapred-site.xml:
修改位于$HADOOP_HOME/etc/hadoop/mapred-site.xml的配置文件：

<property>
  <name>mapred.job.reuse.jvm.num.tasks</name>
  <value>-1</value>
</property>

Default value is 1; setting it to -1 allows unlimited reuse.
默认值为1；设置为-1允许无限制重用。

Disadvantages of JVM Reuse

JVM重用的缺点

Task Slot Occupation: Turning on JVM reuse occupies the task slot until the task completes.
任务槽占用： 启用JVM重用会占用任务槽直到任务完成。
If some reduce tasks take significantly longer, the reserved slots will remain idle and cannot be utilized by other jobs.
如果某些reduce任务需要更长时间，保留的槽将保持空闲状态，无法被其他作业利用。
JVM reuse does not work for tasks belonging to different jobs; those require separate JVMs.
JVM重用不适用于属于不同作业的任务；这些任务需要单独的JVM。

🔍 RPC Protocol & Performance Optimization

🔍 RPC协议与性能优化

Features of RPC

RPC的特性

Transparency: Remote calls appear like local method calls.
透明性： 远程调用看起来像本地方法调用。
High Performance: The RPC server can handle multiple client requests concurrently.
高性能： RPC服务器可以并发处理多个客户端请求。
Remote Subroutine Calls: Allows one program to call another’s subroutines without needing to manage network communication details.
远程子程序调用： 允许一个程序调用另一个程序的子程序，而无需管理网络通信细节。
Fault-Tolerance: Supports fault-tolerance mechanisms.
容错性： 支持容错机制。

Design Requirements of RPC

RPC的设计要求

Layer 层	Description 描述
Data Layer	Only supports object transfer of String type.
数据层	仅支持String类型的对象传输。
Function Call Layer	Implements function calls through dynamic proxy and Java reflection.
函数调用层	通过动态代理和Java反射实现函数调用。
Network Transport Layer	Utilizes a socket mechanism based on TCP/IP.
网络传输层	利用基于TCP/IP的套接字机制。
Server-side Framework Layer	Uses Java NIO and an event-driven I/O model to enhance concurrent processing capabilities.
服务器端框架层	使用Java NIO和事件驱动I/O模型来增强并发处理能力。

Configuration Optimization Choices

配置优化选择

Option 选项	Description 描述
dfs.block.size	Adjusts the size of the blocks in HDFS.
dfs.block.size	调整HDFS中块的大小。
dfs.namenode.handler.count	Sets the number of RPC handlers for the NameNode.
dfs.namenode.handler.count	设置NameNode的RPC处理器数量。
dfs.datanode.max.xcievers	Controls the maximum concurrent connections to DataNodes.
dfs.datanode.max.xcievers	控制与DataNodes的最大并发连接数。
dfs.datanode.data.dir	Specifies the directories for DataNode data storage.
dfs.datanode.data.dir	指定DataNode数据存储的目录。

Using Large Files for Optimization

使用大文件进行优化

MapReduce Fragmentation: Each fragment needs to start a Map; too many small files can overwhelm resources.
MapReduce分片： 每个片段需要启动一个Map；太多小文件可能会压垮资源。
Merging Small Files: Small files can be merged into a large file using Sequence File format.
合并小文件： 小文件可以使用Sequence File格式合并成大文件。
Size Control: Ensure the merged file size allows efficient Map processing without overloading single Map tasks.
大小控制： 确保合并文件大小允许高效的Map处理而不会使单个Map任务过载。

JVM Reuse Misconceptions

JVM重用误解

Default Task Count: Default execution on a JVM is 1 task.
默认任务数： JVM上的默认执行是1个任务。
Modifying Task Count: The number of sequential tasks can be changed via mapred.job.reuse.jvm.num.tasks.
修改任务数： 可以通过mapred.job.reuse.jvm.num.tasks更改顺序任务的数量。
Parallel Execution Misunderstanding: A and B tasks cannot be executed in parallel on a single JVM.
并行执行误解： A和B任务不能在单个JVM上并行执行。
Slot Occupation: Activating JVM reuse will occupy the task slot until completion.
槽占用： 激活JVM重用将占用任务槽直到完成。