📦Hadoop Serialization and Compression

📦Hadoop 序列化和压缩

Serialization

序列化

“Serialization is the process that converts any object state or data into a series of bits which can be easily stored in memory or file formats.”

“序列化是将任何对象状态或数据转换为一系列比特的过程，这些比特可以很容易地存储在内存或文件格式中。”

Key Concepts

关键概念

Container: A data structure or object used to store data.
容器：用于存储数据的数据结构或对象。
Serialization: Converting a container into a byte stream for transfer over a network.
序列化：将容器转换为字节流以便通过网络传输。
Deserialization: The reverse process, converting a byte stream back into a container.
反序列化：相反的过程，将字节流转换回容器。

Considerations in Serialization

序列化中的注意事项

Data Style: Choose between nested or flat style data.
数据风格：选择嵌套式或扁平式数据。
- Flat Style: Common formats include text files and CSV files.
- 扁平风格：常见格式包括文本文件和 CSV 文件。
- Nested Style: Formats include YAML and JSON.
- 嵌套风格：格式包括 YAML 和 JSON。

Common Serialization Formats

常见序列化格式

Format	Description
YAML	Human-readable data serialization format.
JSON	Lightweight data interchange format that’s easy to read and write.
CSV	Comma-separated values for tabular data.
XML	Markup language that defines rules for encoding documents.

格式	描述
YAML	人类可读的数据序列化格式。
JSON	轻量级数据交换格式，易于读写。
CSV	用于表格数据的逗号分隔值。
XML	定义文档编码规则的标记语言。

Deserialization

反序列化

“Deserialization is the process of constructing a data structure or object from a series of bytes.”

“反序列化是从一系列字节构造数据结构或对象的过程。”

Functionality

功能

Serialization and deserialization work together to transform data objects to/from a portable format.
序列化和反序列化协同工作，将数据对象转换为可移植格式或从可移植格式转换回来。
They enable saving and recreating the state of objects across different locations.
它们使得能够在不同位置保存和重新创建对象的状态。

Data Serialization in Distributed Systems

分布式系统中的数据序列化

Importance

重要性

Serialization allows for efficient data transfer across different systems and environments, particularly in distributed systems where data may be stored in different locations.

序列化允许在不同系统和环境之间高效传输数据，尤其是在数据可能存储在不同位置的分布式系统中。

Benefits

好处

Structure: Helps avoid reading incomplete or incorrectly classified data.
结构：有助于避免读取不完整或错误分类的数据。
Portability: Facilitates data transfer across various systems and languages.
可移植性：便于跨各种系统和语言传输数据。
Versioning: Allows applying version numbers for lifecycle management.
版本控制：允许应用版本号进行生命周期管理。

Use Cases

使用案例

Adding key/value objects to maps.
向映射中添加键/值对象。
Processing entries within maps.
处理映射中的条目。
Sending messages across systems.
跨系统发送消息。

Data Serialization in Big Data

大数据中的数据序列化

“Big data systems often include technologies/data that are described as ‘schema less.’”

“大数据系统通常包含被描述为‘无模式’的技术/数据。”

Advantages of Serialization in Big Data

序列化在大数据中的优势

Structure: Imposes schema or criteria on data structures.
结构：对数据结构强制施加模式或标准。
Portability: Ensures uniformity for data from different sources.
可移植性：确保来自不同来源的数据的统一性。
Versioning: Manages changes in data over time.
版本控制：管理数据随时间发生的变化。

📦Hadoop Serialization Process

📦Hadoop 序列化过程

Remote Procedure Calls (RPCs)

远程过程调用 (RPC)

In Hadoop, communication between components occurs via RPCs.
在 Hadoop 中，组件之间的通信通过 RPC 进行。
The caller process serializes the function name and its arguments into a byte stream before sending it.
调用者进程在发送函数名及其参数之前将其序列化为字节流。

Writable Interface

Writable 接口

Serialization and deserialization in Hadoop are performed using the Writable interface.

Hadoop 中的序列化和反序列化是使用 Writable 接口执行的。

Methods:
方法：
- void write (DataOutput out): Serializes the object.
- void write (DataOutput out)：序列化对象。
- void readFields (DataInput in): Deserializes the object.
- void readFields (DataInput in)：反序列化对象。

WritableComparable Interface

WritableComparable 接口

Inherits from the Writable interface.
继承自 Writable 接口。
Facilitates serialization, deserialization, and comparison of values.
便于值的序列化、反序列化和比较。

Examples of WritableComparable Wrappers

WritableComparable 包装类示例

Wrapper Class	Data Type
IntWritable	Wraps an int data point
BooleanWritable	Wraps a boolean type
VIntWritable	Variable-length integer
LongWritable	Wraps a long integer
VLongWritable	Variable-length long integer

包装类	数据类型
IntWritable	包装一个 int 数据点
BooleanWritable	包装一个布尔类型
VIntWritable	可变长度整数
LongWritable	包装一个长整型
VLongWritable	可变长度长整型

Serialization Example in Hadoop

Hadoop 中的序列化示例

Code Snippet

代码片段

The following method serializes a Writable type as a stream of bytes:

以下方法将 Writable 类型序列化为字节流：

Java

public static String serializeToByteString(Writable writable) throws IOException {
    ByteArrayOutputStream outputStream = new ByteArrayOutputStream();
    DataOutputStream dataOutputStream = new DataOutputStream(outputStream);
    writable.write(dataOutputStream);
    dataOutputStream.close();
    byte[] byteArray = outputStream.toByteArray();
    return StringUtils.byteToHexString(byteArray);
}

Main Method Example

Main 方法示例

Java

public static void main(String[] args) throws IOException {
    IntWritable intWritable = new IntWritable();
    VIntWritable vIntWritable = new VIntWritable();
    LongWritable longWritable = new LongWritable();
    VLongWritable vLongWritable = new VLongWritable();

    int smallInt = 100;
    int mediumInt = 1048576;
    long bigInt = 4589938592L;

    System.out.println("smallInt serialized value using IntWritable");
    intWritable.set(smallInt);
    System.out.println(serializeToByteString(intWritable));

    System.out.println("mediumInt serialized value using IntWritable");
    intWritable.set(mediumInt);
    System.out.println(serializeToByteString(intWritable));

    System.out.println("bigInt serialized value using LongWritable");
    longWritable.set(bigInt);
    System.out.println(serializeToByteString(longWritable));
}

Output Explanation

输出解释

The IntWritable class uses a fixed length of four bytes to represent an integer.
IntWritable 类使用固定长度的四个字节来表示一个整数。
The VIntWritable class uses variable-length encoding, making it more efficient for smaller integers.
VIntWritable 类使用可变长度编码，使其对于较小的整数更高效。

Hadoop Serialization

Hadoop 序列化

VIntWritable and LongWritable

VIntWritable 和 LongWritable

VIntWritable: The number of bytes it uses depends on the value of the payload. For example, for the number 100, VIntWritable uses only a single byte.
VIntWritable：它使用的字节数取决于有效负载的值。例如，对于数字 100，VIntWritable 仅使用一个字节。
LongWritable: There is a similar difference in serialized values of LongWritable and VLongWritable.
LongWritable：LongWritable 和 VLongWritable 的序列化值也存在类似的差异。

Text as Writable

作为 Writable 的 Text

“Text is a Writable version of the String type. It represents a collection of UTF-8 characters.”

“Text 是 String 类型的 Writable 版本。它表示 UTF-8 字符的集合。”

The Text class in Hadoop is mutable compared to Java’s String class.
与 Java 的 String 类相比，Hadoop 中的 Text 类是可变的。

Why Writable Interface?

为什么使用 Writable 接口？

A question arises: Why does Hadoop use the Writable interface and not rely on Java serialization?

一个问题出现了：为什么 Hadoop 使用 Writable 接口而不依赖 Java 序列化？

Java Serialization Example

Java 序列化示例

To illustrate Java serialization, we use the following method:

为了说明 Java 序列化，我们使用以下方法：

Java

public static String javaSerializeToByteString(Object o) throws IOException {
    ByteArrayOutputStream outputStream = new ByteArrayOutputStream();
    ObjectOutputStream objectOutputStream = new ObjectOutputStream(outputStream);
    objectOutputStream.writeObject(o);
    objectOutputStream.close();

    byte[] byteArray = outputStream.toByteArray();
    return StringUtils.byteToHexString(byteArray);
}

Java Serialization Output

Java 序列化输出

Here’s an example of how to serialize integers:

以下是如何序列化整数的示例：

Java

public static void main(String[] args) throws IOException {
    int smallInt = 100;
    int mediumInt = 1048576;
    long bigInt = 4589938592L;

    System.out.println("smallInt serialized value using Java serializer");
    System.out.println(javaSerializeToByteString(new Integer(smallInt)));

    System.out.println("mediumInt serialized value using Java serializer");
    System.out.println(javaSerializeToByteString(new Integer(mediumInt)));

    System.out.println("bigInt serialized value using Java serializer");
    System.out.println(javaSerializeToByteString(new Long(bigInt)));
}

Comparison: Hadoop vs. Java Serialization

比较：Hadoop 与 Java 序列化

Aspect	Hadoop Writable	Java Serialization
Size of Serialized Value	Smaller	Larger
Class Metadata	No added class-related metadata	Tags every serialized value with metadata
Learning Curve	Steeper for newcomers	Easier for those familiar with Java
Dependency	Locked into Java programming	Language-agnostic

方面	Hadoop Writable	Java 序列化
序列化值的大小	更小	更大
类元数据	不添加与类相关的元数据	为每个序列化值标记元数据
学习曲线	对新手更陡峭	对熟悉 Java 的人更容易
依赖性	锁定于 Java 编程	语言无关

Record IO and Avro

Record IO 和 Avro

Record IO: Introduced within Hadoop, it featured a record definition language and a compiler for Writable classes. This feature has been deprecated.
Record IO：在 Hadoop 中引入，它具有记录定义语言和 Writable 类的编译器。此功能已被弃用。
Avro: Suggested as the alternative for serialization in Hadoop.
Avro：建议作为 Hadoop 中序列化的替代方案。

📈Data Compression in Hadoop

📈Hadoop 中的数据压缩

Importance of Compression

压缩的重要性

When considering Hadoop, large files are typically stored in HDFS. Reducing file size can help decrease storage requirements and network data transfer.

在考虑 Hadoop 时，大文件通常存储在 HDFS 中。减小文件大小有助于减少存储需求和网络数据传输。

Compression Stages in Hadoop

Hadoop 中的压缩阶段

Input Files: Compressing input files reduces storage space in HDFS. The files will decompress automatically during MapReduce processing.
输入文件：压缩输入文件可减少 HDFS 中的存储空间。文件将在 MapReduce 处理期间自动解压缩。
Map Output: Compressing intermediate map output reduces data transfer to the reducer node.
Map 输出：压缩中间的 map 输出可减少到 reducer 节点的数据传输。
Output Files: Compressing the output of MapReduce jobs also conserves space.
输出文件：压缩 MapReduce 作业的输出也可节省空间。

Hadoop Compression Formats

Hadoop 压缩格式

Format	Algorithm	Compression Ratio	Speed	File Extension	Splitable
gzip	DEFLATE	Very high	Faster	.gz	No
LZO	LZO	Higher	Quick	.lzo	Yes if indexed
snappy	Snappy	Higher	Quick	.snappy	No
bzip2	bzip2	Highest	Slow	.bz2	Yes

格式	算法	压缩率	速度	文件扩展名	可分割
gzip	DEFLATE	非常高	较快	.gz	否
LZO	LZO	较高	快速	.lzo	如果已索引则可分割
snappy	Snappy	较高	快速	.snappy	否
bzip2	bzip2	最高	慢	.bz2	是

Format Descriptions

格式描述

gzip: Based on DEFLATE algorithm. Suitable for files under 130M.
gzip：基于 DEFLATE 算法。适用于 130M 以下的文件。
LZO: Decompresses twice as fast as gzip; ideal for larger files (>200M).
LZO：解压缩速度是 gzip 的两倍；非常适合较大的文件（>200M）。
Snappy: Aims for speed over maximum compression; useful for large map outputs.
Snappy：追求速度而非最大压缩率；适用于大型 map 输出。
bzip2: High-quality compression, suitable for low-speed requirements with high compression rates.
bzip2：高质量压缩，适用于压缩率要求高但速度要求低的场景。

Example: Using CompressionCodecFactory

示例：使用 CompressionCodecFactory

Java

1
2
3

CompressionCodecFactory factory = new CompressionCodecFactory(conf);
CompressionCodec codec = factory.getCodecByClassName("com.hadoop.compression.lzo.LzopCodec");
CompressionOutputStream compressionOutputStream = codec.createOutputStream(out);

📁Handling Small Files in HDFS

📁处理 HDFS 中的小文件

Hadoop’s HDFS and MapReduce are designed for large data files. Small files can be inefficient as each occupies a block (default size is 128M).

Hadoop 的 HDFS 和 MapReduce 专为大型数据文件设计。小文件效率低下，因为每个小文件都会占用一个数据块（默认大小为 128M）。

Solution for Small Files

小文件的解决方案

Use SequenceFile and MapFile as containers for unified storage.

使用 SequenceFile 和 MapFile 作为容器进行统一存储。

SequenceFile: A binary file that serializes <key, value> pairs directly into the file.
SequenceFile：一种二进制文件，将 <键, 值> 对直接序列化到文件中。

Key Points

关键点

Key: Arbitrary Writable
键：任意 Writable 类型
Value: Arbitrary Writable
值：任意 Writable 类型

This guide provides a comprehensive overview of Hadoop serialization and compression, enabling students to understand the key concepts and implementations discussed in the lecture.

本指南全面概述了 Hadoop 序列化和压缩，使学生能够理解讲座中讨论的关键概念和实现。

🗃️Sequence Files and Compression in Hadoop

🗃️Hadoop 中的 Sequence File 和压缩

Sequence File Overview

Sequence File 概述

“Sequence files in Hadoop are a flat file consisting of binary key-value pairs.”

“Hadoop 中的 Sequence File 是由二进制键值对组成的扁平文件。”

Key Characteristics

主要特征

Unsorted Keys: The keys in a sequence file are not sorted.
未排序的键：Sequence File 中的键是未排序的。
Record Structure: Each record consists of:
记录结构：每条记录包括：
- Record Length
- 记录长度
- Key Length
- 键长度
- Key
- 键
- Value
- 值
Sync Markers: These markers are used to identify boundaries within the file, allowing for efficient reading of blocks.
同步标记：这些标记用于识别文件内的边界，从而可以高效地读取数据块。

Compression of Sequence Files

Sequence File 的压缩

Sequence files support three types of compression:

Sequence File 支持三种类型的压缩：

Compression Type	Description
NONE	Records are not compressed.
RECORD	Only the value in each record is compressed.
BLOCK	All records in a block are compressed.

压缩类型	描述
NONE	记录不压缩。
RECORD	每条记录中只有值被压缩。
BLOCK	块中的所有记录都被压缩。

Importance of Sync Markers

同步标记的重要性

Sync markers enable the record reader to navigate between blocks seamlessly, which is critical for processing in a big data environment.
同步标记使记录读取器能够无缝地在块之间导航，这对于大数据环境中的处理至关重要。

Block-Level Compression

块级压缩

Efficiency: Block-level compression is preferred over record-level compression due to its ability to utilize similarity between records.
效率：块级压缩优于记录级压缩，因为它能够利用记录之间的相似性。
Structure: A block-level compressed sequence file contains:
结构

：一个块级压缩的 Sequence File 包含：
- Number of records (uncompressed)
- 记录数（未压缩）
- Compressed key lengths
- 压缩的键长度
- Compressed keys
- 压缩的键
- Compressed value lengths
- 压缩的值长度
- Actual compressed values
- 实际压缩的值

Advantages of Sequence File Format

Sequence File 格式的优点

Compression Customization: Supports both record-based and block-based compression, with block-level being more efficient.
压缩定制：支持基于记录和基于块的压缩，其中块级压缩更高效。
Localization Task Support: Files can be segmented, improving data localization for MapReduce tasks.
本地化任务支持：文件可以分段，从而改善 MapReduce 任务的数据本地化。
Ease of Use: Simplifies the modification of business logic due to its API support provided by the Hadoop framework.
易用性：由于 Hadoop 框架提供的 API 支持，简化了业务逻辑的修改。

Creating a Sequence File

创建 Sequence File

Steps to Create a Sequence File:

创建 Sequence File 的步骤：

Set Up Configuration: Initialize the configuration settings.
设置配置：初始化配置设置。
Obtain File System: Access the HDFS or local file system.
获取文件系统：访问 HDFS 或本地文件系统。
Set Output Path: Define where the sequence file will be stored.
设置输出路径：定义 Sequence File 将存储的位置。
Create Writer: Use SequenceFile.createWriter to create the writer object.
创建写入器：使用 SequenceFile.createWriter 创建写入器对象。
Append Data: Use SequenceFile.Writer.append to add records.
追加数据：使用 SequenceFile.Writer.append 添加记录。
Finalize: Close the writer after finishing data writing.
完成：数据写入完成后关闭写入器。

Example Code

示例代码

Java

public class CreateSequenceFile {
    private static final String[] DATA = { "One, two, buckle my shoe", "Three, four, shut the door", "Five, six, pick up sticks", "Seven, eight, lay them straight", "Nine, ten, a big fat hen" };

    public static void main(String[] args) throws IOException {
        String uri = args[0];
        Configuration conf = new Configuration();
        conf.set("fs.defaultFS", "hdfs://niit01:8020");
        Path path = new Path(uri);
        IntWritable key = new IntWritable();
        Text value = new Text();
        SequenceFile.Writer writer = null;

        try {
            writer = SequenceFile.createWriter(conf, Writer.file(path), Writer.keyClass(key.getClass()), Writer.valueClass(value.getClass()), Writer.compression(SequenceFile.CompressionType.RECORD));
            for (int i = 0; i < 100; i++) {
                key.set(100 - i);
                value.set(DATA[i % DATA.length]);
                System.out.printf("[%s]\t%s\t%s\n", writer.getLength(), key, value);
                writer.append(key, value);
            }
        } catch (Exception ex) {
            ex.printStackTrace();
        } finally {
            IOUtils.closeStream(writer);
        }
    }
}

Reading a Sequence File

读取 Sequence File

Steps to Read a Sequence File:

读取 Sequence File 的步骤：

Set Up Configuration: Initialize the configuration settings.
设置配置：初始化配置设置。
Set Reading Path: Define the path of the sequence file to read.
设置读取路径：定义要读取的 Sequence File 的路径。
Create Reader: Use SequenceFile.Reader to read the file.
创建读取器：使用 SequenceFile.Reader 读取文件。
Iterate Over Records: Use a loop to read key-value pairs from the sequence file.
遍历记录：使用循环从 Sequence File 中读取键值对。

Example Code

示例代码

Java

public class ReadSequenceFile {
    public static void main(String[] args) {
        SequenceFile.Reader readseq = null;
        String uri = args[0];
        Configuration conf = new Configuration();
        conf.set("fs.defaultFS", "hdfs://niit01:8020");
        Path path = new Path(uri);

        try {
            readseq = new SequenceFile.Reader(conf, Reader.file(path));
            Writable key = (Writable) ReflectionUtils.newInstance(readseq.getKeyClass(), conf);
            Writable value = (Writable) ReflectionUtils.newInstance(readseq.getValueClass(), conf);

            while (readseq.next(key, value)) {
                System.out.println(key.toString() + "\t" + value.toString());
            }
        } catch (IOException e) {
            e.printStackTrace();
        }
    }
}

MapFile Class

MapFile 类

Overview: MapFile files are essentially ordered SequenceFiles that maintain key-value pairs in a sorted order.
概述：MapFile 文件本质上是有序的 SequenceFile，它们以排序的方式维护键值对。
Index File: Each MapFile consists of:
索引文件

：每个 MapFile 包括：
- A data file for storing key-value pairs.
- 一个用于存储键值对的数据文件。
- An index file for storing keys and their corresponding offsets.
- 一个用于存储键及其对应偏移量的索引文件。

Example Code for Creating a MapFile

创建 MapFile 的示例代码

Java

Configuration conf = new Configuration();
FileSystem fs;
try {
    fs = FileSystem.get(conf);
    Path inputFile = new Path("c:\\data.txt");
    Path outputFile = new Path("c:\\mapfile.map");
    Text txtKey = new Text();
    Text txtValue = new Text();
    MapFile.Writer writer = null;
    FSDataInputStream inputStream = fs.open(inputFile);

    try {
        writer = new MapFile.Writer(conf, fs, outputFile.toString(), txtKey.getClass(), txtValue.getClass());
        writer.setIndexInterval(1); // Set index interval
        while (inputStream.available() > 0) {
            String strLineInInputFile = inputStream.readLine();
            String[] lstKeyValuePair = strLineInInputFile.split("\\t");
            txtKey.set(lstKeyValuePair[0]);
            txtValue.set(lstKeyValuePair[1]);
            writer.append(txtKey, txtValue);
        }
    } finally {
        IOUtils.closeStream(writer);
        System.out.println("Map file created successfully!!");
    }
} catch (IOException e) {
    e.printStackTrace();
}