Chapter 6: Hive Optimization and Miscellaneous

第 6 章：Hive 优化与杂项

One of the biggest challenges Hive users faces is the slow response time experienced by end users who are running ad hoc queries.

Hive 用户面临的最大挑战之一是运行即席查询（临时查询）的最终用户所经历的缓慢响应时间。

Hive Optimization

Hive 优化

Although Hive efficiently handles big data, default execution often leads to slow query response times. Optimization strategies help reduce unnecessary overhead, minimize I/O operations, improve resource utilization, and achieve faster execution.

尽管 Hive 能高效处理大数据，但默认的执行方式往往导致查询响应时间较慢。优化策略有助于减少不必要的开销，最小化 I/O 操作，提高资源利用率，并实现更快的执行速度。

Local Mode Optimization

本地模式优化

When data volume is small, running on a distributed cluster wastes time due to start-up latency. Local Mode lets Hive run the job on the machine itself, not across multiple nodes.

当数据量较小时，在分布式集群上运行会因为启动延迟而浪费时间。本地模式允许 Hive 在当前机器上独立运行作业，而不是跨多个节点运行。

Local Mode avoids cluster overhead if:

在以下情况下，本地模式可以避免集群开销：

Input size is below the configured threshold
- 输入大小低于配置的阈值
Few map tasks are needed
- 需要很少的 Map 任务
No reduce task or only 1 reduce task
- 没有 Reduce 任务或只有一个 Reduce 任务

This is ideal for lightweight ad hoc queries.

这对于轻量级的即席查询（临时查询）非常理想。

This enables automatic Local Mode

这将启用自动本地模式

Hive itself decides if the job should run locally based on the conditions

Hive 会根据条件自行决定作业是否应在本地运行

1	SET hive.exec.mode.local.auto=true;

This sets the maximum total size of input data for Local Mode

这设置了本地模式的最大输入数据总大小

So if input data is below 50 MB, the job MAY run locally

因此，如果输入数据低于 50 MB，作业可能会在本地运行

1	SET hive.exec.mode.local.auto.inputbytes.max=50000000;

This sets maximum number of input files

这设置了最大输入文件数

If total input files are less than 5 (or 4 if default), Hive may run locally

如果总输入文件少于 5 个（默认情况下为 4 个），Hive 可能会在本地运行

1	SET hive.exec.mode.local.auto.input.files.max=5;

JVM Reuse

JVM 重用

Hive launches separate JVMs for each Map/Reduce task, which creates performance overhead.

Hive 为每个 Map/Reduce 任务启动单独的 JVM，这会产生性能开销。

Imagine a Query needs:

想象一个查询需要：

50 map tasks
- 50 个 Map 任务
10 reduce tasks
- 10 个 Reduce 任务

Then Hive will start 60 JVM programs separately.

那么 Hive 将分别启动 60 个 JVM 程序。

Starting a JVM takes time and system resources.

启动 JVM 需要时间和系统资源。

JVM reuse allows the same JVM to run multiple tasks, reducing cost associated with process initialization. Best applied for workloads with many short-running tasks.

JVM 重用允许同一个 JVM 运行多个任务，从而减少与进程初始化相关的成本。最适用于包含许多短时间运行任务的工作负载。

Instead of closing and reopening every time, Hive keeps the JVM alive and sends next task to it.

Hive 不是每次都关闭并重新打开，而是保持 JVM 存活并将下一个任务发送给它。

1	SET mapred.job.reuse.jvm.num.tasks=10;

Or 或者

$HADOOP_HOME/etc/hadoop/mapred-site.xml
<property>
    <name>mapred.job.reuse.jvm.num.tasks</name>
    <value>10</value>
</property>

Parallel Execution

并行执行

Independent stages of a Hive query normally execute sequentially. Enabling parallel execution allows concurrent processing, reducing total runtime. Effectiveness increases when system resources are not fully utilized.

Hive 查询的独立阶段通常按顺序执行。启用并行执行允许并发处理，从而减少总运行时间。当系统资源未被充分利用时，效果会更好。

Imagine a Hive query:

想象一个 Hive 查询：

INSERT OVERWRITE TABLE sales_summary
SELECT region, SUM(amount)
FROM sales
GROUP BY region;

INSERT OVERWRITE TABLE product_summary
SELECT product_id, SUM(amount)
FROM sales
GROUP BY product_id;

These are two insert statements in one query, both reading from the same sales table.

这是一个查询中的两个插入语句，都从同一个 sales 表读取数据。

Without parallel execution

没有并行执行时

Hive executes:

Hive 执行：

Create sales_summary
- 创建 sales_summary
After finishing, create product_summary
- 完成后，创建 product_summary

Total time = Time of first insert + time of second insert

总时间 = 第一次插入的时间 + 第二次插入的时间

With Parallel Execution enabled

启用并行执行时

First enable the setting:

首先启用设置：

1 2	SET hive.exec.parallel=true; SET hive.exec.parallel.thread.number=2;

Now what Hive does:

现在 Hive 做的是：

Task 1 reads from sales and computes region summary
- 任务 1 从 sales 读取并计算区域汇总
Task 2 reads from sales and computes product summary
- 任务 2 从 sales 读取并计算产品汇总
Both run at the same time, independently
- 两者同时独立运行

Result:

结果：

Total execution is almost half the time
- 总执行时间几乎减半
Faster completion
- 更快的完成速度

Enable parallel execution

启用并行执行

1	SET hive.exec.parallel=true;

Hive can run up to 16 tasks in parallel at the same time inside a single query.

Hive 可以在单个查询中同时并行运行多达 16 个任务。

1	SET hive.exec.parallel.thread.number=16;

Map and Reduce Quantity Optimization

Map 和 Reduce 数量优化

Hive determines reducer count automatically based on data volume but tuning reducer count improves performance by:

Hive 根据数据量自动确定 Reducer 数量，但调整 Reducer 数量可以通过以下方式提高性能：

Preventing skew where one reducer receives a majority of the data
- 防止数据倾斜，即一个 Reducer 接收大部分数据
Reducing excessive reducer creation that leads to overhead
- 减少导致开销的过多 Reducer 创建

This fine-tuning is vital for large aggregation and join workloads.

这种微调对于大型聚合和连接工作负载至关重要。

Controls data size per reducer.

控制每个 Reducer 的数据大小。

hive.exec.reducers.bytes.per.reducer —default 1 G

hive.exec.reducers.bytes.per.reducer —默认 1 G

Maximum number of reducers allowed 允许的最大 Reducer 数量

hive.exec.reducers.max — default 999

hive.exec.reducers.max — 默认 999

Using EXPLAIN for Diagnostics

使用 EXPLAIN 进行诊断

EXPLAIN provides insight into execution plans without running queries. It reveals:

EXPLAIN 可以在不运行查询的情况下深入了解执行计划。它揭示了：

Logical operators (scan, select, group)
- 逻辑运算符（扫描、选择、分组）
Task breakdown (map vs reduce)
- 任务分解（Map 与 Reduce）
Stage dependencies and sequence
- 阶段依赖关系和顺序

This is essential for diagnosing unnecessary I/O, shuffle operations, or poorly structured joins.

这对于诊断不必要的 I/O、Shuffle 操作或结构不良的 Join 至关重要。

1	EXPLAIN SELECT SUM(number) FROM onecol;

Query Optimization Techniques

查询优化技术

Map join

Map Join

Map join is used when one of the join tables is small enough to fit in the memory, so it is very fast but limited.

当连接表中的一个足够小以至于可以放入内存时使用 Map Join，因此它非常快但受到限制。

Since Hive 0.7.0, Hive can convert map join automatically with the following settings:

自 Hive 0.7.0 起，Hive 可以使用以下设置自动转换 Map Join：

SET hive.auto.convert.join=true; --default false
SET hive.mapjoin.smalltable.filesize=600000000; --default 25M
SET hive.auto.convert.join.noconditionaltask=true; --default false. Set to true so that map join hint is not needed  --默认 false。设置为 true 以便不需要 map join 提示
SET hive.auto.convert.join.noconditionaltask.size=10000000; --The default value controls the size of table to fit in memory --默认值控制适合放入内存的表的大小

Bucket map join

分桶 Map Join

Bucket map join is a special type of map join applied on the bucket tables. To enable bucket map join, we need to enable the following settings:

分桶 Map Join 是应用于分桶表的一种特殊类型的 Map Join。要启用分桶 Map Join，我们需要启用以下设置：

1 2	SET hive.auto.convert.join=true; --default false SET hive.optimize.bucketmapjoin=true; --default false

In bucket map join, all the join tables must be bucket tables and join on buckets columns. In addition, the buckets number in bigger tables must be a multiple of the bucket number in the small tables.

在分桶 Map Join 中，所有连接表必须是分桶表，并且要在分桶列上进行连接。此外，大表中的桶数必须是小表中桶数的倍数。

Sort-Merge Joins (SMB)

排序-合并连接 (SMB)

It is a join Optimization technique, Used when both table are large, but bucketed and sorted in the same way.

这是一种连接优化技术，用于当两个表都很大，但以相同方式分桶和排序时。

Tables are bucketed
- 表已分桶
Tables have the same number of buckets
- 表具有相同数量的桶
Bucketing column in the join key.
- 分桶列必须出现在连接键中。
Tables are sorted on the join key.
- 表按连接键排序。

SET hive.input.format= org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat;
SET hive.auto.convert.sortmerge.join=true;
SET hive.optimize.bucketmapjoin=true;
SET hive.optimize.bucketmapjoin.sortedmerge=true;
SET hive.auto.convert.sortmerge.join.noconditionaltask=true;

Skew Join Optimization

倾斜连接优化

When key distribution is uneven, enable skew detection to redistribute heavy keys to additional reducers.

当键分布不均匀时，启用倾斜检测以将繁重的键重新分配给额外的 Reducer。

When one key has too many records (data skew).

当一个键有太多记录时（数据倾斜）。

These reduce shuffle and unnecessary reducer load.

这减少了 Shuffle 和不必要的 Reducer 负载。

1 2	SET hive.optimize.skewjoin=true; SET hive.skewjoin.key=100000;

Sorting Behavior

排序行为

ORDER BY: Global sort, single reducer, slow

ORDER BY：全局排序，单个 Reducer，速度慢

SORT BY: Partial sort; efficient for large datasets

SORT BY：部分排序；对大数据集高效

CLUSTER BY: Sorts and distributes based on same column, ideal for ETL pipelines

CLUSTER BY：基于同一列进行排序和分发，非常适合 ETL 管道

Partition-Based Optimization

基于分区的优化

Partitioning limits the data scanned during query execution.

分区限制了查询执行期间扫描的数据。

Static partitioning requires specifying values explicitly

静态分区需要显式指定值

Dynamic partitioning automatically assigns partitions during load when enabled

启用动态分区后，会在加载期间自动分配分区

Effective partitioning drastically reduces scan time and improves filtering performance.

有效的分区大大减少了扫描时间并提高了过滤性能。

Enable dynamic Partition

启用动态分区

1 2	SET hive.exec.dynamic.partition=true; SET hive.exec.dynamic.partition.mode=nonstrict;

Hive Execution Engines

Hive 执行引擎

Engine	Use Case	Performance Behavior
MapReduce	Legacy fallback	Writes intermediate output to disk; slower
Tez	Modern default	Minimizes disk I/O by passing intermediate data in memory

引擎	用例	性能表现
MapReduce	传统回退方案	将中间输出写入磁盘；较慢
Tez	现代默认方案	通过在内存中传递中间数据来最小化磁盘 I/O

Switching to Tez provides significant speed gains and supports more complex DAG-based execution flows.

切换到 Tez 可以显著提高速度，并支持更复杂的基于 DAG 的执行流。

1
2
3

set hive.execution.engine=tez;
set hive.prewarm.enabled=true;
set hive.prewarm.numcontainers=10;

Hive Security Features

Hive 安全特性

Hive supports multiple authorization models to secure multi-tenant environments:

Hive 支持多种授权模型以保护多租户环境：

Default Authorization
- 默认授权
Storage-Based Authorization
- 基于存储的授权
SQL Standards-Based Authorization
- 基于 SQL 标准的授权
Role-based access using GRANT and REVOKE
- 使用 GRANT 和 REVOKE 的基于角色的访问控制

Security ensures proper access control over tables, databases, and operations.

安全特性确保对表、数据库和操作的适当访问控制。

Mode	Description
Default Authorization Mode	Basic control (If user has access to HDFS location, they can read Hive table)
Storage-Based Authorization	Based on HDFS directories (If you don’t have permission to read the table folder in HDFS, Hive will not allow the query)
SQL Standards-Based Authorization	Fine-grained SQL-level access (It works like security in normal databases:) `GRANT SELECT ON table TO user;`, `REVOKE SELECT ON table FROM user;`
Managing Access through SQL	GRANT/REVOKE roles, Role = Group of permissions

模式	描述
默认授权模式	基本控制（如果用户有权访问 HDFS 位置，他们就可以读取 Hive 表）
基于存储的授权	基于 HDFS 目录（如果你没有读取 HDFS 中表文件夹的权限，Hive 将不允许查询）
基于 SQL 标准的授权	细粒度的 SQL 级访问（它的工作方式类似于普通数据库中的安全性：） `GRANT SELECT ON table TO user;`, `REVOKE SELECT ON table FROM user;`
通过 SQL 管理访问	GRANT/REVOKE 角色，角色 = 权限组