📦Hadoop Serialization and Compression
📦Hadoop 序列化和压缩
Serialization
序列化
“Serialization is the process that converts any object state or data into a series of bits which can be easily stored in memory or file formats.”
“序列化是将任何对象状态或数据转换为一系列比特的过程,这些比特可以很容易地存储在内存或文件格式中。”
Key Concepts
关键概念
- Container: A data structure or object used to store data.
- 容器:用于存储数据的数据结构或对象。
- Serialization: Converting a container into a byte stream for transfer over a network.
- 序列化:将容器转换为字节流以便通过网络传输。
- Deserialization: The reverse process, converting a byte stream back into a container.
- 反序列化:相反的过程,将字节流转换回容器。
Considerations in Serialization
序列化中的注意事项
- Data Style: Choose between nested or flat style data.
- 数据风格:选择嵌套式或扁平式数据。
- Flat Style: Common formats include text files and CSV files.
- 扁平风格:常见格式包括文本文件和 CSV 文件。
- Nested Style: Formats include YAML and JSON.
- 嵌套风格:格式包括 YAML 和 JSON。
Common Serialization Formats
常见序列化格式
| Format | Description |
|---|---|
| YAML | Human-readable data serialization format. |
| JSON | Lightweight data interchange format that’s easy to read and write. |
| CSV | Comma-separated values for tabular data. |
| XML | Markup language that defines rules for encoding documents. |
| 格式 | 描述 |
|---|---|
| YAML | 人类可读的数据序列化格式。 |
| JSON | 轻量级数据交换格式,易于读写。 |
| CSV | 用于表格数据的逗号分隔值。 |
| XML | 定义文档编码规则的标记语言。 |
Deserialization
反序列化
“Deserialization is the process of constructing a data structure or object from a series of bytes.”
“反序列化是从一系列字节构造数据结构或对象的过程。”
Functionality
功能
- Serialization and deserialization work together to transform data objects to/from a portable format.
- 序列化和反序列化协同工作,将数据对象转换为可移植格式或从可移植格式转换回来。
- They enable saving and recreating the state of objects across different locations.
- 它们使得能够在不同位置保存和重新创建对象的状态。
Data Serialization in Distributed Systems
分布式系统中的数据序列化
Importance
重要性
Serialization allows for efficient data transfer across different systems and environments, particularly in distributed systems where data may be stored in different locations.
序列化允许在不同系统和环境之间高效传输数据,尤其是在数据可能存储在不同位置的分布式系统中。
Benefits
好处
- Structure: Helps avoid reading incomplete or incorrectly classified data.
- 结构:有助于避免读取不完整或错误分类的数据。
- Portability: Facilitates data transfer across various systems and languages.
- 可移植性:便于跨各种系统和语言传输数据。
- Versioning: Allows applying version numbers for lifecycle management.
- 版本控制:允许应用版本号进行生命周期管理。
Use Cases
使用案例
- Adding key/value objects to maps.
- 向映射中添加键/值对象。
- Processing entries within maps.
- 处理映射中的条目。
- Sending messages across systems.
- 跨系统发送消息。
Data Serialization in Big Data
大数据中的数据序列化
“Big data systems often include technologies/data that are described as ‘schema less.’”
“大数据系统通常包含被描述为‘无模式’的技术/数据。”
Advantages of Serialization in Big Data
序列化在大数据中的优势
- Structure: Imposes schema or criteria on data structures.
- 结构:对数据结构强制施加模式或标准。
- Portability: Ensures uniformity for data from different sources.
- 可移植性:确保来自不同来源的数据的统一性。
- Versioning: Manages changes in data over time.
- 版本控制:管理数据随时间发生的变化。
📦Hadoop Serialization Process
📦Hadoop 序列化过程
Remote Procedure Calls (RPCs)
远程过程调用 (RPC)
- In Hadoop, communication between components occurs via RPCs.
- 在 Hadoop 中,组件之间的通信通过 RPC 进行。
- The caller process serializes the function name and its arguments into a byte stream before sending it.
- 调用者进程在发送函数名及其参数之前将其序列化为字节流。
Writable Interface
Writable 接口
Serialization and deserialization in Hadoop are performed using the Writable interface.
Hadoop 中的序列化和反序列化是使用 Writable 接口执行的。
- Methods:
- 方法:
void write (DataOutput out): Serializes the object.void write (DataOutput out):序列化对象。void readFields (DataInput in): Deserializes the object.void readFields (DataInput in):反序列化对象。
WritableComparable Interface
WritableComparable 接口
- Inherits from the Writable interface.
- 继承自 Writable 接口。
- Facilitates serialization, deserialization, and comparison of values.
- 便于值的序列化、反序列化和比较。
Examples of WritableComparable Wrappers
WritableComparable 包装类示例
| Wrapper Class | Data Type |
|---|---|
| IntWritable | Wraps an int data point |
| BooleanWritable | Wraps a boolean type |
| VIntWritable | Variable-length integer |
| LongWritable | Wraps a long integer |
| VLongWritable | Variable-length long integer |
| 包装类 | 数据类型 |
|---|---|
| IntWritable | 包装一个 int 数据点 |
| BooleanWritable | 包装一个布尔类型 |
| VIntWritable | 可变长度整数 |
| LongWritable | 包装一个长整型 |
| VLongWritable | 可变长度长整型 |
Serialization Example in Hadoop
Hadoop 中的序列化示例
Code Snippet
代码片段
The following method serializes a Writable type as a stream of bytes:
以下方法将 Writable 类型序列化为字节流:
Java
1 | public static String serializeToByteString(Writable writable) throws IOException { |
Main Method Example
Main 方法示例
Java
1 | public static void main(String[] args) throws IOException { |
Output Explanation
输出解释
- The IntWritable class uses a fixed length of four bytes to represent an integer.
- IntWritable 类使用固定长度的四个字节来表示一个整数。
- The VIntWritable class uses variable-length encoding, making it more efficient for smaller integers.
- VIntWritable 类使用可变长度编码,使其对于较小的整数更高效。
Hadoop Serialization
Hadoop 序列化
VIntWritable and LongWritable
VIntWritable 和 LongWritable
- VIntWritable: The number of bytes it uses depends on the value of the payload. For example, for the number 100, VIntWritable uses only a single byte.
- VIntWritable:它使用的字节数取决于有效负载的值。例如,对于数字 100,VIntWritable 仅使用一个字节。
- LongWritable: There is a similar difference in serialized values of LongWritable and VLongWritable.
- LongWritable:LongWritable 和 VLongWritable 的序列化值也存在类似的差异。
Text as Writable
作为 Writable 的 Text
“Text is a Writable version of the String type. It represents a collection of UTF-8 characters.”
“Text 是 String 类型的 Writable 版本。它表示 UTF-8 字符的集合。”
- The Text class in Hadoop is mutable compared to Java’s String class.
- 与 Java 的 String 类相比,Hadoop 中的 Text 类是可变的。
Why Writable Interface?
为什么使用 Writable 接口?
A question arises: Why does Hadoop use the Writable interface and not rely on Java serialization?
一个问题出现了:为什么 Hadoop 使用 Writable 接口而不依赖 Java 序列化?
Java Serialization Example
Java 序列化示例
To illustrate Java serialization, we use the following method:
为了说明 Java 序列化,我们使用以下方法:
Java
1 | public static String javaSerializeToByteString(Object o) throws IOException { |
Java Serialization Output
Java 序列化输出
Here’s an example of how to serialize integers:
以下是如何序列化整数的示例:
Java
1 | public static void main(String[] args) throws IOException { |
Comparison: Hadoop vs. Java Serialization
比较:Hadoop 与 Java 序列化
| Aspect | Hadoop Writable | Java Serialization |
|---|---|---|
| Size of Serialized Value | Smaller | Larger |
| Class Metadata | No added class-related metadata | Tags every serialized value with metadata |
| Learning Curve | Steeper for newcomers | Easier for those familiar with Java |
| Dependency | Locked into Java programming | Language-agnostic |
| 方面 | Hadoop Writable | Java 序列化 |
|---|---|---|
| 序列化值的大小 | 更小 | 更大 |
| 类元数据 | 不添加与类相关的元数据 | 为每个序列化值标记元数据 |
| 学习曲线 | 对新手更陡峭 | 对熟悉 Java 的人更容易 |
| 依赖性 | 锁定于 Java 编程 | 语言无关 |
Record IO and Avro
Record IO 和 Avro
- Record IO: Introduced within Hadoop, it featured a record definition language and a compiler for Writable classes. This feature has been deprecated.
- Record IO:在 Hadoop 中引入,它具有记录定义语言和 Writable 类的编译器。此功能已被弃用。
- Avro: Suggested as the alternative for serialization in Hadoop.
- Avro:建议作为 Hadoop 中序列化的替代方案。
📈Data Compression in Hadoop
📈Hadoop 中的数据压缩
Importance of Compression
压缩的重要性
When considering Hadoop, large files are typically stored in HDFS. Reducing file size can help decrease storage requirements and network data transfer.
在考虑 Hadoop 时,大文件通常存储在 HDFS 中。减小文件大小有助于减少存储需求和网络数据传输。
Compression Stages in Hadoop
Hadoop 中的压缩阶段
- Input Files: Compressing input files reduces storage space in HDFS. The files will decompress automatically during MapReduce processing.
- 输入文件:压缩输入文件可减少 HDFS 中的存储空间。文件将在 MapReduce 处理期间自动解压缩。
- Map Output: Compressing intermediate map output reduces data transfer to the reducer node.
- Map 输出:压缩中间的 map 输出可减少到 reducer 节点的数据传输。
- Output Files: Compressing the output of MapReduce jobs also conserves space.
- 输出文件:压缩 MapReduce 作业的输出也可节省空间。
Hadoop Compression Formats
Hadoop 压缩格式
| Format | Algorithm | Compression Ratio | Speed | File Extension | Splitable |
|---|---|---|---|---|---|
| gzip | DEFLATE | Very high | Faster | .gz | No |
| LZO | LZO | Higher | Quick | .lzo | Yes if indexed |
| snappy | Snappy | Higher | Quick | .snappy | No |
| bzip2 | bzip2 | Highest | Slow | .bz2 | Yes |
| 格式 | 算法 | 压缩率 | 速度 | 文件扩展名 | 可分割 |
|---|---|---|---|---|---|
| gzip | DEFLATE | 非常高 | 较快 | .gz | 否 |
| LZO | LZO | 较高 | 快速 | .lzo | 如果已索引则可分割 |
| snappy | Snappy | 较高 | 快速 | .snappy | 否 |
| bzip2 | bzip2 | 最高 | 慢 | .bz2 | 是 |
Format Descriptions
格式描述
- gzip: Based on DEFLATE algorithm. Suitable for files under 130M.
- gzip:基于 DEFLATE 算法。适用于 130M 以下的文件。
- LZO: Decompresses twice as fast as gzip; ideal for larger files (>200M).
- LZO:解压缩速度是 gzip 的两倍;非常适合较大的文件(>200M)。
- Snappy: Aims for speed over maximum compression; useful for large map outputs.
- Snappy:追求速度而非最大压缩率;适用于大型 map 输出。
- bzip2: High-quality compression, suitable for low-speed requirements with high compression rates.
- bzip2:高质量压缩,适用于压缩率要求高但速度要求低的场景。
Example: Using CompressionCodecFactory
示例:使用 CompressionCodecFactory
Java
1 | CompressionCodecFactory factory = new CompressionCodecFactory(conf); |
📁Handling Small Files in HDFS
📁处理 HDFS 中的小文件
Hadoop’s HDFS and MapReduce are designed for large data files. Small files can be inefficient as each occupies a block (default size is 128M).
Hadoop 的 HDFS 和 MapReduce 专为大型数据文件设计。小文件效率低下,因为每个小文件都会占用一个数据块(默认大小为 128M)。
Solution for Small Files
小文件的解决方案
Use SequenceFile and MapFile as containers for unified storage.
使用 SequenceFile 和 MapFile 作为容器进行统一存储。
- SequenceFile: A binary file that serializes <key, value> pairs directly into the file.
- SequenceFile:一种二进制文件,将 <键, 值> 对直接序列化到文件中。
Key Points
关键点
- Key: Arbitrary Writable
- 键:任意 Writable 类型
- Value: Arbitrary Writable
- 值:任意 Writable 类型
This guide provides a comprehensive overview of Hadoop serialization and compression, enabling students to understand the key concepts and implementations discussed in the lecture.
本指南全面概述了 Hadoop 序列化和压缩,使学生能够理解讲座中讨论的关键概念和实现。
🗃️Sequence Files and Compression in Hadoop
🗃️Hadoop 中的 Sequence File 和压缩
Sequence File Overview
Sequence File 概述
“Sequence files in Hadoop are a flat file consisting of binary key-value pairs.”
“Hadoop 中的 Sequence File 是由二进制键值对组成的扁平文件。”
Key Characteristics
主要特征
- Unsorted Keys: The keys in a sequence file are not sorted.
- 未排序的键:Sequence File 中的键是未排序的。
- Record Structure: Each record consists of:
- 记录结构:每条记录包括:
- Record Length
- 记录长度
- Key Length
- 键长度
- Key
- 键
- Value
- 值
- Sync Markers: These markers are used to identify boundaries within the file, allowing for efficient reading of blocks.
- 同步标记:这些标记用于识别文件内的边界,从而可以高效地读取数据块。
Compression of Sequence Files
Sequence File 的压缩
Sequence files support three types of compression:
Sequence File 支持三种类型的压缩:
| Compression Type | Description |
|---|---|
| NONE | Records are not compressed. |
| RECORD | Only the value in each record is compressed. |
| BLOCK | All records in a block are compressed. |
| 压缩类型 | 描述 |
|---|---|
| NONE | 记录不压缩。 |
| RECORD | 每条记录中只有值被压缩。 |
| BLOCK | 块中的所有记录都被压缩。 |
Importance of Sync Markers
同步标记的重要性
- Sync markers enable the record reader to navigate between blocks seamlessly, which is critical for processing in a big data environment.
- 同步标记使记录读取器能够无缝地在块之间导航,这对于大数据环境中的处理至关重要。
Block-Level Compression
块级压缩
Efficiency: Block-level compression is preferred over record-level compression due to its ability to utilize similarity between records.
效率:块级压缩优于记录级压缩,因为它能够利用记录之间的相似性。
Structure: A block-level compressed sequence file contains:
结构
:一个块级压缩的 Sequence File 包含:
- Number of records (uncompressed)
- 记录数(未压缩)
- Compressed key lengths
- 压缩的键长度
- Compressed keys
- 压缩的键
- Compressed value lengths
- 压缩的值长度
- Actual compressed values
- 实际压缩的值
Advantages of Sequence File Format
Sequence File 格式的优点
- Compression Customization: Supports both record-based and block-based compression, with block-level being more efficient.
- 压缩定制:支持基于记录和基于块的压缩,其中块级压缩更高效。
- Localization Task Support: Files can be segmented, improving data localization for MapReduce tasks.
- 本地化任务支持:文件可以分段,从而改善 MapReduce 任务的数据本地化。
- Ease of Use: Simplifies the modification of business logic due to its API support provided by the Hadoop framework.
- 易用性:由于 Hadoop 框架提供的 API 支持,简化了业务逻辑的修改。
Creating a Sequence File
创建 Sequence File
Steps to Create a Sequence File:
创建 Sequence File 的步骤:
- Set Up Configuration: Initialize the configuration settings.
- 设置配置:初始化配置设置。
- Obtain File System: Access the HDFS or local file system.
- 获取文件系统:访问 HDFS 或本地文件系统。
- Set Output Path: Define where the sequence file will be stored.
- 设置输出路径:定义 Sequence File 将存储的位置。
- Create Writer: Use
SequenceFile.createWriterto create the writer object. - 创建写入器:使用
SequenceFile.createWriter创建写入器对象。 - Append Data: Use
SequenceFile.Writer.appendto add records. - 追加数据:使用
SequenceFile.Writer.append添加记录。 - Finalize: Close the writer after finishing data writing.
- 完成:数据写入完成后关闭写入器。
Example Code
示例代码
Java
1 | public class CreateSequenceFile { |
Reading a Sequence File
读取 Sequence File
Steps to Read a Sequence File:
读取 Sequence File 的步骤:
- Set Up Configuration: Initialize the configuration settings.
- 设置配置:初始化配置设置。
- Set Reading Path: Define the path of the sequence file to read.
- 设置读取路径:定义要读取的 Sequence File 的路径。
- Create Reader: Use
SequenceFile.Readerto read the file. - 创建读取器:使用
SequenceFile.Reader读取文件。 - Iterate Over Records: Use a loop to read key-value pairs from the sequence file.
- 遍历记录:使用循环从 Sequence File 中读取键值对。
Example Code
示例代码
Java
1 | public class ReadSequenceFile { |
MapFile Class
MapFile 类
Overview: MapFile files are essentially ordered SequenceFiles that maintain key-value pairs in a sorted order.
概述:MapFile 文件本质上是有序的 SequenceFile,它们以排序的方式维护键值对。
Index File: Each MapFile consists of:
索引文件
:每个 MapFile 包括:
- A data file for storing key-value pairs.
- 一个用于存储键值对的数据文件。
- An index file for storing keys and their corresponding offsets.
- 一个用于存储键及其对应偏移量的索引文件。
Example Code for Creating a MapFile
创建 MapFile 的示例代码
Java
1 | Configuration conf = new Configuration(); |