Hive Chapter 2 - Hive - Big Data | Dumpling's Blog = My Port

Table Management in Hive

Hive中的表管理

Table management in Hive involves creating databases and tables, defining schemas, and organizing data storage. In this section, we cover how to create and drop databases and tables, and how to manage table properties such as partitions and bucketing.

Hive中的表管理涉及创建数据库和表、定义模式（Schema）以及组织数据存储。在本节中，我们将介绍如何创建和删除数据库及表，以及如何管理分区和分桶等表属性。

Databases

数据库

Hive supports multiple databases (namespaces) to organize tables. You can create a new database using the CREATE DATABASE statement, optionally with properties.

Hive支持多个数据库（命名空间）来组织表。你可以使用 CREATE DATABASE 语句创建一个新数据库，并可选择性地添加属性。

For example:

例如：

1	CREATE DATABASE myhivebook;

You can inspect a database’s details (such as its location and properties) using the DESCRIBE DATABASE command. For instance, to see extended information about the database myhivebook:

你可以使用 DESCRIBE DATABASE 命令查看数据库的详细信息（如位置和属性）。例如，要查看关于数据库 myhivebook 的扩展信息：

1	DESCRIBE DATABASE EXTENDED myhivebook;

To delete a database, use the DROP DATABASE statement.

要删除数据库，请使用 DROP DATABASE 语句。

Note: By default, Hive uses RESTRICT mode, which prevents dropping a database if it contains any tables. You must first drop all tables in the database (or use the CASCADE option to force-drop everything).

注意：默认情况下，Hive 使用 RESTRICT 模式，如果数据库中包含任何表，则阻止删除该数据库。你必须先删除数据库中的所有表（或者使用 CASCADE 选项强制删除所有内容）。

For example:

例如：

1	DROP DATABASE myhivebook CASCADE;

This will remove the database and all its tables.

这将删除该数据库及其所有表。

Without CASCADE, the command will fail if the database is not empty.

如果不使用 CASCADE，若数据库非空，该命令将失败。

Creating Tables

创建表

To create a table in Hive, use the CREATE TABLE statement with the desired schema.

要在 Hive 中创建表，请使用带有所需模式的 CREATE TABLE 语句。

For example, to create a simple table:

例如，创建一个简单的表：

CREATE TABLE students (
  id INT,
  name STRING,
  age INT
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE;

This creates a table named students with three columns.

这将创建一个名为 students 的表，包含三列。

The ROW FORMAT DELIMITED clause with FIELDS TERMINATED BY ‘,’ indicates that the data is comma-separated (CSV format).
- ROW FORMAT DELIMITED 子句配合 FIELDS TERMINATED BY ‘,’ 表示数据是逗号分隔的（CSV格式）。
STORED AS TEXTFILE specifies the file format for storing data (in this case, plain text files, which is the default format).
- STORED AS TEXTFILE 指定了存储数据的文件格式（在此例中为纯文本文件，这也是默认格式）。
If a table with the same name already exists, Hive will throw an error.
- 如果同名的表已经存在，Hive 将会报错。

You can avoid this by using IF NOT EXISTS:

你可以通过使用 IF NOT EXISTS 来避免这种情况：

1	CREATE TABLE IF NOT EXISTS students (...columns...) ...;

External Tables and Locations

外部表与位置

By default, the above CREATE TABLE statement creates a managed (internal) table, which means Hive will manage the data’s lifecycle (more on that below).

默认情况下，上述 CREATE TABLE 语句创建一个托管表（内部表），这意味着 Hive 将管理数据的生命周期（下文会有更多介绍）。

If you want Hive to reference data at an existing location without moving it, you should create an external table.

如果你希望 Hive 引用现有位置的数据而不移动它，你应该创建一个外部表。

Use the EXTERNAL keyword and specify a LOCATION:

使用 EXTERNAL 关键字并指定 LOCATION（位置）：

CREATE EXTERNAL TABLE web_logs (
  user_id INT,
  activity STRING
)
ROW FORMAT DELIMITED 
FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE
LOCATION '/data/logs/2025/';

In this example, Hive will not move or delete data in /data/logs/2025/.

在这个例子中，Hive 不会移动或删除 /data/logs/2025/ 中的数据。

The table simply points to that directory.

该表仅仅是指向那个目录。

External tables are useful when the data is shared with other tools or should remain in its original location.

当数据需要与其他工具共享或应保留在原始位置时，外部表非常有用。

Partitioned Tables

分区表

Hive allows tables to be partitioned to organize data for faster queries.

Hive 允许对表进行分区，以便组织数据从而实现更快的查询。

You declare partitions at table creation using the PARTITIONED BY clause:

你在创建表时使用 PARTITIONED BY 子句声明分区：

CREATE TABLE sales (
  item_id INT,
  quantity INT,
  price DOUBLE
)
PARTITIONED BY (sale_year INT, sale_month STRING)
STORED AS TEXTFILE;

Here, the sales table is partitioned by year and month.

在这里，sales 表按年份和月份进行分区。

Hive will store each partition’s data in a separate directory, for example:

Hive 将把每个分区的数据存储在单独的目录中，例如：

1	/user/hive/warehouse/sales/sale_year=2025/sale_month=Jan/...

Each distinct combination of sale_year and sale_month values corresponds to a subdirectory containing that partition’s data files.

sale_year 和 sale_month 值的每种不同组合都对应一个包含该分区数据文件的子目录。

Partitioning helps prune data during queries — e.g., a query with WHERE sale_year = 2025 will read only the directory for 2025 instead of scanning all data.

分区有助于在查询期间修剪数据——例如，带有 WHERE sale_year = 2025 的查询将只读取 2025 年的目录，而不是扫描所有数据。

Bucketed Tables (Clustering)

分桶表（聚类）

Hive also supports bucketing (clustering) of data within a table or partition.

Hive 还支持在表或分区内对数据进行分桶（聚类）。

Bucketing distributes rows into a fixed number of “buckets” (files) based on the hash of a column.

分桶根据某列的哈希值将行分配到固定数量的“桶”（文件）中。

You specify bucketing in the table schema with CLUSTERED BY:

你在表结构中使用 CLUSTERED BY 指定分桶：

CREATE TABLE user_logs (
  user_id INT,
  action STRING,
  ts TIMESTAMP
)
CLUSTERED BY (user_id) INTO 4 BUCKETS
STORED AS TEXTFILE;

In this example, the data for user_logs will be split into 4 buckets according to the hash of user_id.

在这个例子中，user_logs 的数据将根据 user_id 的哈希值被拆分为 4 个桶。

Each bucket is stored as a separate file.

每个桶存储为一个单独的文件。

Bucketing can be combined with partitioning (e.g., partition by date, then bucket by user_id within each date partition) to further organize the data.

分桶可以与分区结合使用（例如，先按日期分区，然后在每个日期分区内按 user_id 分桶），以进一步组织数据。

We will discuss the benefits of bucketing below in the context of query performance.

我们将在下文中结合查询性能讨论分桶的好处。

Managed vs. External Tables

托管表 vs. 外部表

Hive tables can be managed (internal) or external, which affects how Hive handles the data storage:

Hive 表可以是托管表（内部表）或外部表，这会影响 Hive 处理数据存储的方式：

Managed (Internal) Tables: Hive manages both the table’s metadata (schema) and the actual data files.

托管表（内部表）：Hive 同时管理表的元数据（模式）和实际的数据文件。

The data is stored in Hive’s warehouse directory (controlled by the hive.metastore.warehouse.dir property in hive-site.xml).

数据存储在 Hive 的仓库目录中（由 hive-site.xml 中的 hive.metastore.warehouse.dir 属性控制）。

When you drop a managed table, Hive deletes both the metadata and the data files from HDFS.

当你删除一个托管表时，Hive 会从 HDFS 中同时删除元数据和数据文件。

Use managed tables when Hive alone owns the data. (By default, CREATE TABLE without the EXTERNAL keyword creates a managed table.)

当只有 Hive 拥有该数据时，请使用托管表。（默认情况下，不带 EXTERNAL 关键字的 CREATE TABLE 会创建托管表。）

External Tables: Hive only manages the metadata, and the data files remain in their original location (as specified by the LOCATION).

外部表：Hive 仅管理元数据，数据文件保留在其原始位置（由 LOCATION 指定）。

Dropping an external table will remove the table metadata from Hive, but will not delete the data in HDFS.

删除外部表将从 Hive 中移除表元数据，但不会删除 HDFS 中的数据。

External tables are appropriate when the data is used outside of Hive or should not be moved into the warehouse (for example, data shared with HBase or external ingestion processes).

当数据在 Hive 之外被使用，或者不应移动到仓库中时（例如，与 HBase 共享的数据或外部摄取流程的数据），外部表是合适的。

To create an external table, include the EXTERNAL keyword in the CREATE TABLE statement as shown above.

要创建外部表，请如上所示在 CREATE TABLE 语句中包含 EXTERNAL 关键字。

Summary of differences:

差异总结：

Managed table data is stored in Hive’s warehouse directory, and Hive fully controls it (no other system should write to it).

托管表数据存储在 Hive 的仓库目录中，Hive 对其拥有完全控制权（其他系统不应写入该目录）。

Dropping the table deletes the data files along with the metadata.

删除表会连同元数据一起删除数据文件。

External table data stays at an external location you specify.

外部表数据保留在你指定的外部位置。

Hive does not move or delete it, and other applications can use or modify that data.

Hive 不会移动或删除它，其他应用程序可以使用或修改该数据。

Dropping the table leaves the data untouched.

删除表后，数据保持不变。

Loading Data into Hive Tables

将数据加载到 Hive 表中

Once a table is created, you can load data into it.

表创建完成后，你可以向其中加载数据。

Hive provides a LOAD DATA command to import data from files into a table (for internal tables this will move or copy the file into Hive’s managed storage):

Hive 提供 LOAD DATA 命令将文件中的数据导入表中（对于内部表，这将移动或复制文件到 Hive 的托管存储中）：

1	LOAD DATA [LOCAL] INPATH '/path/to/file.csv' INTO TABLE students;

This command loads the file file.csv into the Hive table’s storage location (if LOCAL is specified, the path is treated as a local filesystem path; without it, the path is assumed to be on HDFS).

此命令将文件 file.csv 加载到 Hive 表的存储位置（如果指定了 LOCAL，路径将被视为本地文件系统路径；否则，假定路径位于 HDFS 上）。

You can also load data into a specific partition of a table:

你也可以将数据加载到表的特定分区中：

1	LOAD DATA INPATH 'hdfs:///data/sales/2025_Jan.csv' INTO TABLE sales PARTITION(sale_year=2025, sale_month='Jan');

This would move the given file into the partition directory for sale_year=2025 and sale_month='Jan' in the sales table.

这将把指定的文件移动到 sales 表中 sale_year=2025 和 sale_month='Jan' 的分区目录中。

Alternatively, you can use INSERT INTO or INSERT OVERWRITE with a SELECT query to load or transform data into Hive tables.

此外，你可以使用带有 SELECT 查询的 INSERT INTO 或 INSERT OVERWRITE 将数据加载或转换到 Hive 表中。

For example, after creating a partitioned table, you might insert data from an unpartitioned source table, as explained next.

例如，在创建分区表后，你可以从非分区源表中插入数据，如下所述。

Partitioning in Hive

Hive 中的分区

Partitioning is a technique to divide a table’s data into smaller parts based on the values of one or more partition columns.

分区是一种根据一个或多个分区列的值将表数据划分为更小部分的技术。

Each distinct value of a partition column (or combination of values for multi-column partitioning) corresponds to a separate directory of data in HDFS.

分区列的每个不同值（或多列分区的数值组合）对应 HDFS 中一个单独的数据目录。

Queries that filter on partition columns can retrieve data more efficiently by scanning only relevant partitions instead of the entire table.

在分区列上进行过滤的查询可以通过仅扫描相关分区而不是整个表来更高效地检索数据。

For example, if you partition a log table by date, Hive will store logs for each date in its own folder (e.g., .../logs/date=2025-10-09/).

例如，如果你按日期对日志表进行分区，Hive 将把每个日期的日志存储在自己的文件夹中（例如，.../logs/date=2025-10-09/）。

A query like SELECT * FROM logs WHERE date = ‘2025-10-09’ will only read files in the date=2025-10-09 partition directory, skipping other dates entirely.

像 SELECT * FROM logs WHERE date = ‘2025-10-09’ 这样的查询将只读取 date=2025-10-09 分区目录中的文件，完全跳过其他日期。

This can dramatically speed up query performance when working with large datasets.

在处理大型数据集时，这可以显著提高查询性能。

Static vs. Dynamic Partitioning

静态分区 vs. 动态分区

Hive supports two methods of adding data into partitions:

Hive 支持两种向分区添加数据的方法：

Static Partitioning

静态分区

With static partitioning, you specify the target partition values explicitly when loading or inserting data.

使用静态分区时，你在加载或插入数据时显式指定目标分区值。

The user (or ETL process) determines which partition each data load goes into.

用户（或 ETL 过程）决定每次数据加载进入哪个分区。

Key points about static partitioning include:

关于静态分区的要点包括：

You add each partition “statically” by specifying the partition key value(s) in the INSERT or LOAD command.

你可以通过在 INSERT 或 LOAD 命令中指定分区键值来“静态地”添加每个分区。

For example, if a table is partitioned by region, you might load a file into the partition for region='EU' explicitly.

例如，如果表按 region 分区，你可以显式地将文件加载到 region='EU' 的分区中。

Static partitioning is often preferred when you are loading large datasets file-by-file, as it gives you control to load one partition at a time.

当你逐个文件加载大型数据集时，静态分区通常是首选，因为它允许你控制一次加载一个分区。

It can be faster and use fewer resources than a fully dynamic load because each operation targets a single partition.

它可能比完全动态加载更快且消耗更少的资源，因为每个操作只针对一个分区。

You can create or alter partitions manually.

你可以手动创建或修改分区。

For instance, you can use ALTER TABLE … ADD PARTITION to add a new empty partition directory, then load data into it.

例如，你可以使用 ALTER TABLE … ADD PARTITION 添加一个新的空分区目录，然后向其中加载数据。

Similarly, partitions can be dropped or renamed as needed.

同样，可以根据需要删除或重命名分区。

A benefit of static partitioning is that you might derive the partition value from external information (such as a filename or date) without needing to parse each record in the data file.

静态分区的一个好处是，你可以从外部信息（如文件名或日期）推导出分区值，而无需解析数据文件中的每条记录。

For example, if your data files are named by date, you can use that date to decide the partition directory for the load.

例如，如果你的数据文件按日期命名，你可以使用该日期来决定加载的分区目录。

Hive operates in strict mode by default (hive.mapred.mode = strict in the configuration).

Hive 默认在严格模式下运行（配置中 hive.mapred.mode = strict）。

In strict mode, at least one partition column must be specified in any INSERT statement to a partitioned table (to prevent accidentally generating too many partitions in one query).

在严格模式下，对分区表的任何 INSERT 语句都必须指定至少一个分区列（以防止在一次查询中意外生成过多分区）。

This effectively means Hive expects static partitions unless you explicitly allow fully dynamic partitioning (described below).

这实际上意味着 Hive 期望使用静态分区，除非你显式允许完全动态分区（如下所述）。

Static partitioning example: Suppose you have the sales table partitioned by sale_year and sale_month as defined above.

静态分区示例：假设你有如上定义的按 sale_year 和 sale_month 分区的 sales 表。

Using static partitioning, to load data for January 2025 into this table, you could run:

使用静态分区，要将 2025 年 1 月的数据加载到该表中，你可以运行：

1 2	INSERT INTO TABLE sales PARTITION(sale_year=2025, sale_month='Jan') SELECT * FROM sales_staging WHERE year = 2025 AND month = 'Jan';

In this INSERT statement, we explicitly specify the target partition (sale_year=2025, sale_month='Jan').

在这个 INSERT 语句中，我们显式指定了目标分区 (sale_year=2025, sale_month='Jan')。

All data from the source (sales_staging) that matches the WHERE clause will go into that single partition.

所有来自源 (sales_staging) 且符合 WHERE 子句的数据都将进入该单一分区。

If that partition does not exist yet, Hive will create the directory for sale_year=2025/sale_month=Jan automatically.

如果该分区尚不存在，Hive 将自动创建 sale_year=2025/sale_month=Jan 目录。

Dynamic Partitioning

动态分区

With dynamic partitioning, Hive determines the partition destinations at runtime based on the data.

使用动态分区时，Hive 根据数据在运行时确定分区目标。

You do not explicitly specify the partition values in the insert; instead, one INSERT can create multiple partitions in one go.

你不在插入语句中显式指定分区值；相反，一个 INSERT 语句可以一次性创建多个分区。

Characteristics of dynamic partitioning include:

动态分区的特点包括：

A single INSERT … SELECT can populate multiple partitions in the target table.

单个 INSERT … SELECT 可以填充目标表中的多个分区。

The partition columns are included in the SELECT query, and Hive automatically routes each row to the appropriate partition directory based on those values.

分区列包含在 SELECT 查询中，Hive 根据这些值自动将每一行路由到相应的分区目录。

Dynamic partitioning is convenient when you have a large dataset and want to split it into many partitions without writing separate load statements for each partition.

当你有一个大型数据集并希望将其拆分为许多分区，而又不想为每个分区编写单独的加载语句时，动态分区非常方便。

For example, you can insert a full year’s data and let Hive create partitions per month (or per day) for you.

例如，你可以插入一整年的数据，让 Hive 为你按月（或按天）创建分区。

Dynamic partitioning may be slower than static partitioning for very large data loads, because the job handles many partitions in one pass.

对于非常大的数据加载，动态分区可能比静态分区慢，因为作业在一次传递中处理许多分区。

It is most useful when the number of partitions or their values are not known in advance or are too numerous for manual loading.

当分区的数量或其值无法预先知道，或者数量太多而不便手动加载时，它是最有用的。

Hive requires certain settings to enable fully dynamic partitioning.

Hive 需要某些设置来启用完全动态分区。

Specifically, you need to allow dynamic partition creation and switch to non-strict mode:

具体来说，你需要允许动态分区创建并切换到非严格模式：

SET hive.exec.dynamic.partition = true; (allows dynamic partition inserts)

SET hive.exec.dynamic.partition = true; （允许动态分区插入）

SET hive.exec.dynamic.partition.mode = nonstrict; (allows all partition columns to be dynamic)

SET hive.exec.dynamic.partition.mode = nonstrict; （允许所有分区列都是动态的）

By default, Hive’s dynamic partition mode is “strict”, meaning at least one partition column must be manually specified (as mentioned above).

默认情况下，Hive 的动态分区模式是“strict”（严格），这意味着必须手动指定至少一个分区列（如上所述）。

Setting it to nonstrict allows all partition columns to be filled dynamically from the data.

将其设置为 nonstrict 允许所有分区列从数据中动态填充。

After data is loaded, dynamic partitions behave just like static partitions.

数据加载后，动态分区的行为与静态分区完全相同。

You can query them and alter them (add/drop) as needed. The difference is only in how they were created.

你可以根据需要查询和修改（添加/删除）它们。区别仅在于它们的创建方式。

Dynamic partitioning example: Using the same sales table, if you have an unpartitioned staging table containing sales data for multiple dates, you can populate all partitions in one operation using dynamic partitioning:

动态分区示例：使用相同的 sales 表，如果你有一个包含多个日期销售数据的未分区临时表，你可以使用动态分区在一次操作中填充所有分区：

SET hive.exec.dynamic.partition=true;
SET hive.exec.dynamic.partition.mode=nonstrict;
INSERT OVERWRITE TABLE sales PARTITION(sale_year, sale_month)
SELECT item_id, quantity, price, year AS sale_year, month AS sale_month
FROM sales_staging;

Notice that the INSERT statement above does not explicitly specify any particular year or month.

请注意，上面的 INSERT 语句没有显式指定任何特定的年份或月份。

The partition columns are listed in the PARTITION(…) clause without fixed values, and their values come from the SELECT query’s output.

分区列列在 PARTITION(…) 子句中，没有固定值，它们的值来自 SELECT 查询的输出。

Hive will automatically create partition directories for each unique sale_year, sale_month combination in the data.

Hive 将自动为数据中每个唯一的 sale_year 和 sale_month 组合创建分区目录。

For instance, if the staging data contains sales for 2024 and 2025 across various months, Hive will create partitions like sale_year=2024/sale_month=Jan, sale_year=2024/sale_month=Feb, … sale_year=2025/sale_month=Dec as needed, and route each row to the appropriate partition.

例如，如果临时数据包含 2024 年和 2025 年不同月份的销售数据，Hive 将根据需要创建如 sale_year=2024/sale_month=Jan, sale_year=2024/sale_month=Feb, … sale_year=2025/sale_month=Dec 等分区，并将每一行路由到相应的分区。

Bucketing in Hive

Hive 中的分桶

Bucketing (also known as clustering) is another technique to organize data at a finer level than partitions.

分桶（也称为聚类）是另一种在比分区更细粒度上组织数据的技术。

When a table (or partition) is bucketed, Hive divides the data into a fixed number of buckets based on a hash of one or more columns (called the bucketing or clustered by columns).

当表（或分区）被分桶时，Hive 根据一个或多个列（称为分桶列或聚类列）的哈希值将数据分为固定数量的桶。

Each bucket is stored as a separate file in HDFS.

每个桶在 HDFS 中存储为一个单独的文件。

Bucketing adds structure that can improve query efficiency in certain cases:

分桶增加了结构，可以在某些情况下提高查询效率：

Even distribution: By hashing on a column like user_id and dividing by the number of buckets, Hive tries to distribute rows evenly across buckets.

均匀分布：通过对像 user_id 这样的列进行哈希处理并除以桶的数量，Hive 试图将行均匀分布在各个桶中。

This can prevent data skew and ensures that no single bucket file is overwhelmingly large or small.

这可以防止数据倾斜，并确保没有单个桶文件过大或过小。

Efficient sampling: Bucketing enables efficient sampling of data.

高效采样：分桶实现了数据的高效采样。

For example, you can use a query clause like TABLESAMPLE(BUCKET 3 OUT OF 8) on a table that is bucketed into 8 buckets.

例如，你可以在分为 8 个桶的表上使用像 TABLESAMPLE(BUCKET 3 OUT OF 8) 这样的查询子句。

This will retrieve roughly 1/8 of the data (specifically, the third bucket from each partition), which is much faster than a random sample because it reads a subset of files completely.

这将检索大约 1/8 的数据（具体来说是每个分区中的第三个桶），这比随机采样快得多，因为它完整地读取了文件的一个子集。

Join optimization: If two tables are bucketed (and sorted) on the same key with the same number of buckets, Hive can perform bucket map joins (a type of map-side join).

连接优化：如果两个表在相同的键上以相同数量的桶进行分桶（并排序），Hive 可以执行桶映射连接（Bucket Map Join，一种 Map 端连接）。

This avoids a full shuffle of data during the join because matching rows will already be grouped into corresponding buckets.

这避免了连接过程中数据的完全 Shuffle（洗牌），因为匹配的行已经被分到了对应的桶中。

In other words, the join can be performed by joining each pair of bucket files independently, which is more efficient.

换句话说，连接可以通过独立连接每一对桶文件来执行，这更加高效。

To use bucketing, you define it in the table schema with CLUSTERED BY (… column list …) INTO N BUCKETS.

要使用分桶，你在表结构中使用 CLUSTERED BY (… 列列表 …) INTO N BUCKETS 来定义它。

For example:

例如：

CREATE TABLE purchase_logs (
  user_id INT,
  item_id INT,
  purchase_date DATE
)
PARTITIONED BY (year INT)              -- partition by year
CLUSTERED BY (user_id) INTO 8 BUCKETS  -- bucket by user_id into 8 files per partition
STORED AS ORC;

In this table, data is first partitioned by year, and within each year partition it is further divided into 8 buckets based on the hash of user_id.

在这个表中，数据首先按年份分区，然后在每个年份分区内，根据 user_id 的哈希值进一步分为 8 个桶。

This means each year’s data is spread across 8 files (buckets).

这意味着每年的数据分布在 8 个文件（桶）中。

If two tables share the same bucketing column and number of buckets (and if sorting on that column is enabled), queries joining them on that column can be optimized as described above.

如果两个表共享相同的分桶列和桶数（并且如果启用了该列的排序），则可以如上所述优化在它们该列上的连接查询。

Using bucketing: When inserting data into a bucketed table, you should set a Hive parameter to ensure correct bucketing:

使用分桶：当向分桶表插入数据时，你应该设置一个 Hive 参数以确保正确的分桶：

1	SET hive.enforce.bucketing = true;

This ensures the hashing and bucketing logic is applied during insert, creating the correct number of bucket files.

这确保了在插入期间应用哈希和分桶逻辑，创建正确数量的桶文件。

(If you also declared a SORTED BY for sorted buckets, you would set hive.enforce.sorting=true similarly.)

（如果你还为排序桶声明了 SORTED BY，你应该同样设置 hive.enforce.sorting=true。）

Partitioning vs. Bucketing

分区 vs. 分桶

Both partitioning and bucketing split data into multiple parts, but they have different use cases and advantages.

分区和分桶都将数据拆分为多个部分，但它们有不同的用例和优势。

Here are the key differences and how to decide between them:

以下是主要区别以及如何在其间做决定：

How data is split: Partitioning separates data into directories based on distinct values of a partition column.

数据如何拆分：分区根据分区列的不同值将数据分隔到目录中。

Each value (or combination of values) has its own folder.

每个值（或值组合）都有自己的文件夹。

Bucketing divides data into a fixed number of files (buckets) by hashing the value of a column; each bucket file contains a mix of values, determined by the hash function.

分桶通过哈希列的值将数据分为固定数量的文件（桶）；每个桶文件包含混合的值，由哈希函数决定。

Use cases: Partitioning is very effective when there is a limited number of distinct values that are common filters in queries (e.g., partition by date, country, category).

用例：当查询中常用的过滤条件有有限数量的不同值时，分区非常有效（例如，按日期、国家、类别分区）。

Bucketing is useful to evenly distribute data on a column that has high cardinality (many unique values) or when you want to optimize operations like joins or sampling on that column.

分桶对于在具有高基数（许多唯一值）的列上均匀分布数据，或者当你希望优化该列上的连接或采样操作时非常有用。

You might partition by a coarse category (like year) and bucket by a finer identifier (like user_id) within each partition.

你可以按粗略类别（如年份）进行分区，并在每个分区内按更细的标识符（如 user_id）进行分桶。

Number of subdivisions: Partitioning too finely (for example, creating a partition for every user in a large dataset) can lead to thousands of small directories and files, which is inefficient.

细分数量：分区过细（例如，在大型数据集中为每个用户创建一个分区）可能导致成千上万个小目录和文件，这是低效的。

Bucketing, on the other hand, lets you choose a manageable number of buckets even for a very high-cardinality column, avoiding the overhead of too many small partitions.

另一方面，即使对于基数非常高的列，分桶也允许你选择可管理数量的桶，从而避免过多小分区的开销。

In essence, partitioning is good for segregating data by a few logical keys, while bucketing is good for distributing data evenly when a key has many possibilities.

本质上，分区擅长通过少量逻辑键隔离数据，而分桶擅长在键有许多可能性时均匀分布数据。

Query performance: Partitioning greatly speeds up queries that filter by partition columns by pruning irrelevant data – the query reads only the partitions it needs.

查询性能：分区通过修剪不相关的数据大大加快了按分区列过滤的查询速度——查询只读取它需要的分区。

Bucketing doesn’t eliminate reading data the way partition pruning does, but it can make certain operations more efficient.

分桶不像分区修剪那样消除数据读取，但它可以使某些操作更高效。

For example, a join on a bucketed column can be executed as a map-side join (bucket join) without shuffling, if both datasets are bucketed and use the same number of buckets on the join key.

例如，如果两个数据集都已分桶并且在连接键上使用相同数量的桶，则分桶列上的连接可以作为 Map 端连接（桶连接）执行，无需 Shuffle。

Also, reading one bucket (or a subset of buckets) can be used for sampling or debugging large tables.

此外，读取一个桶（或桶的子集）可用于采样或调试大表。

Combined usage: Partitioning and bucketing can be used together on the same table.

组合使用：分区和分桶可以在同一个表上一起使用。

For instance, you might partition a sales table by date (one directory per day) and bucket each day’s data by customer_id into, say, 16 buckets.

例如，你可以按日期对销售表进行分区（每天一个目录），并将每天的数据按 customer_id 分为（比如）16 个桶。

This way, queries that specify a date range will only scan those partitions, and operations on customer_id (like joins or aggregations) benefit from the even spread across buckets.

这样，指定日期范围的查询将只扫描这些分区，而针对 customer_id 的操作（如连接或聚合）则受益于桶间的均匀分布。

Combining these techniques provides a multi-level data organization that can optimize different aspects of query processing.

结合这些技术提供了多级数据组织，可以优化查询处理的不同方面。

In summary, use partitioning for coarse-grained separation of data that aligns with common query filters, and use bucketing for fine-grained segmentation to evenly distribute data or to optimize joins and sampling.

总之，使用分区进行与常见查询过滤器一致的粗粒度数据分离，使用分桶进行细粒度分割以均匀分布数据或优化连接和采样。

Choosing the right strategy (or combination) depends on the data and query patterns: partitions to minimize data scanned, and buckets to balance data distribution and speed up specific operations.

选择正确的策略（或组合）取决于数据和查询模式：分区用于最小化扫描的数据量，分桶用于平衡数据分布并加速特定操作。

Hive File Formats

Hive 文件格式

Hive can store table data in a variety of file formats.

Hive 可以以多种文件格式存储表数据。

The choice of file format affects storage efficiency, compression, and read/write performance.

文件格式的选择会影响存储效率、压缩以及读/写性能。

Below are common file formats supported in Hive:

以下是 Hive 支持的常见文件格式：

TEXTFILE: This is the default storage format in Hive (plain text files, often with values separated by delimiters like commas or tabs). Text files are human-readable but not space-efficient (no compression by default) and can be slow to parse for large datasets. Use TEXTFILE for simplicity or when working with small datasets, or if you need to easily view/edit the raw data. (Hive treats typical CSV/TSV files as TEXTFILE format tables.)
- TEXTFILE：这是 Hive 中的默认存储格式（纯文本文件，通常值由逗号或制表符等分隔符分隔）。文本文件是人类可读的，但空间效率不高（默认无压缩），并且对于大型数据集解析可能较慢。为了简单起见，或者在处理小型数据集时，或者如果你需要轻松查看/编辑原始数据时，请使用 TEXTFILE。（Hive 将典型的 CSV/TSV 文件视为 TEXTFILE 格式表。）
SEQUENCEFILE: A SequenceFile is a Hadoop binary file format that stores data as binary key-value pairs. It is splittable (meaning large files can be processed in parallel), supports compression, and is more compact than plain text. SequenceFiles are a good choice for intermediate data storage in Hadoop because they can be read and written efficiently by MapReduce jobs.
- SEQUENCEFILE：SequenceFile 是一种 Hadoop 二进制文件格式，将数据存储为二进制键值对。它是可拆分的（意味着大文件可以并行处理），支持压缩，并且比纯文本更紧凑。SequenceFile 是 Hadoop 中中间数据存储的不错选择，因为它们可以被 MapReduce 作业高效地读取和写入。
Custom (User-Defined) Format: Hive allows you to use custom input and output formats. If your data is in a special format that Hive doesn’t natively recognize, you can provide your own implementation of an InputFormat/OutputFormat (along with a SerDe – serializer/deserializer) to read and write the data. This is an advanced feature used for integrating Hive with specialized or proprietary file formats.
- Custom (User-Defined) Format（自定义格式）：Hive 允许你使用自定义的输入和输出格式。如果你的数据采用 Hive 原生无法识别的特殊格式，你可以提供自己的 InputFormat/OutputFormat 实现（连同 SerDe——序列化/反序列化器）来读取和写入数据。这是一个高级功能，用于将 Hive 与专用或专有文件格式集成。
AVRO: Avro is a row-oriented storage format with a built-in schema definition. Using Avro in Hive means leveraging Avro’s serialization framework and schema system. Avro files include the schema as part of the file, making them self-describing. They are compact (support compression) and splittable. Avro is particularly useful when you need to share data across different systems or programming languages, or when your data schema may evolve over time (Avro supports schema evolution).
- AVRO：Avro 是一种具有内置模式定义的行式存储格式。在 Hive 中使用 Avro 意味着利用 Avro 的序列化框架和模式系统。Avro 文件将模式作为文件的一部分，使其具有自描述性。它们很紧凑（支持压缩）并且可拆分。当你需要在不同系统或编程语言之间共享数据，或者当你的数据模式可能随时间演变时（Avro 支持模式演变），Avro 特别有用。
RCFile (Record Columnar File): RCFiles were one of the first columnar storage formats in Hadoop. In an RCFile, data is stored in a columnar fashion within each row group, which improves compression and query performance for certain workloads (especially when queries only need specific columns). Internally, an RCFile is structured as a flat binary file (similar to a SequenceFile) with key-value pairs, but it organizes the values by column. RCFile was a pioneering format for Hive, but it has largely been superseded by newer columnar formats like ORC and Parquet that offer better performance and features.
- RCFile (Record Columnar File)：RCFile 是 Hadoop 中最早的列式存储格式之一。在 RCFile 中，数据在每个行组内以列式方式存储，这提高了压缩率和某些工作负载的查询性能（特别是当查询只需要特定列时）。在内部，RCFile 被构建为带有键值对的扁平二进制文件（类似于 SequenceFile），但它按列组织值。RCFile 是 Hive 的先驱格式，但它在很大程度上已被 ORC 和 Parquet 等提供更好性能和功能的较新列式格式所取代。
ORC (Optimized Row Columnar) File: ORC is a highly optimized columnar format introduced to improve upon RCFile. ORC provides efficient compression and faster query performance, especially for read-heavy analytics. Data in ORC files is stored in stripes, along with metadata and indexes (like min/max values for columns in each stripe). This allows Hive to skip large portions of data when executing queries (for example, if the min/max of a stripe indicates no relevant values, that stripe can be skipped). ORC offers excellent compression ratios and the ability to split files for parallel processing. It is a great choice for Hive tables where query performance and storage efficiency are priorities (e.g., large fact tables). The trade-off is that ORC can be slightly slower to write (due to the extra indexing and compression work).
- ORC (Optimized Row Columnar) File：ORC 是一种高度优化的列式格式，旨在改进 RCFile。ORC 提供了高效的压缩和更快的查询性能，特别是对于读取密集型分析。ORC 文件中的数据存储在条带（stripe）中，连同元数据和索引（如每个条带中列的最小值/最大值）。这允许 Hive 在执行查询时跳过大部分数据（例如，如果条带的最小值/最大值表明没有相关值，则可以跳过该条带）。ORC 提供出色的压缩比和拆分文件以进行并行处理的能力。对于查询性能和存储效率至关重要的 Hive 表（例如大型事实表），这是一个极好的选择。折衷方案是，ORC 的写入速度可能略慢（由于额外的索引和压缩工作）。
Parquet: Parquet is another modern columnar file format, widely used in the Hadoop and Spark ecosystem. Like ORC, Parquet stores data column-by-column and is optimized for efficient storage and retrieval. Parquet is designed to handle complex nested data structures well (using a technique to encode nested fields). It also provides effective compression and encoding schemes for different data types. Parquet and ORC have similar goals — improving performance for analytical queries by reading only needed columns and compressing data — and they often achieve comparable performance. One advantage of Parquet is its broad support beyond Hive (e.g., in Spark, Impala, etc.), making it a good choice for interoperability in big data environments. Both Parquet and ORC can significantly reduce storage size and speed up queries compared to plain text. In general, columnar formats (ORC, Parquet, RCFile) are preferred for large analytic workloads because they allow queries to read only the needed columns and they compress data effectively.
- Parquet：Parquet 是另一种现代列式文件格式，广泛用于 Hadoop 和 Spark 生态系统。像 ORC 一样，Parquet 逐列存储数据，并针对高效存储和检索进行了优化。Parquet 旨在很好地处理复杂的嵌套数据结构（使用一种编码嵌套字段的技术）。它还为不同数据类型提供了有效的压缩和编码方案。Parquet 和 ORC 具有相似的目标——通过仅读取所需的列和压缩数据来提高分析查询的性能——并且它们通常达到相当的性能。Parquet 的一个优点是它在 Hive 之外（例如在 Spark、Impala 等中）有广泛的支持，使其成为大数据环境中互操作性的不错选择。与纯文本相比，Parquet 和 ORC 都可以显著减小存储大小并加快查询速度。通常，**列式格式（ORC、Parquet、RCFile）**是大型分析工作负载的首选，因为它们允许查询仅读取所需的列并且有效地压缩数据。
Row-oriented formats (Text, SequenceFile, Avro) might be suitable for use cases such as streaming data ingestion, interoperating with other systems, or scenarios where entire records are accessed frequently. The choice of format should consider factors like file size, schema evolution needs, compression, and how the data will be accessed.
- **行式格式（Text、SequenceFile、Avro）**可能适用于流数据摄取、与其他系统互操作或频繁访问整条记录的场景。格式的选择应考虑文件大小、模式演变需求、压缩以及数据将被如何访问等因素。

Exporting Data from Hive

从 Hive 导出数据

There are times you may want to export data from Hive tables into external files (for example, to share results or to load into another system).

有时你可能希望将数据从 Hive 表导出到外部文件（例如，为了分享结果或加载到另一个系统中）。

Hive provides a couple of ways to do this:

Hive 提供了几种方法来做到这一点：

INSERT OVERWRITE DIRECTORY: You can write the results of a query to an HDFS directory in a specified format.

INSERT OVERWRITE DIRECTORY：你可以将查询结果以指定格式写入 HDFS 目录。

For example, to export a table’s contents as CSV, you can run:

例如，要将表的内容导出为 CSV，你可以运行：

INSERT OVERWRITE DIRECTORY '/user/yourusername/output/orders_2025'
ROW FORMAT DELIMITED 
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
SELECT * FROM orders
WHERE order_year = 2025;

This will write the query results into one or more files under the HDFS directory /user/yourusername/output/orders_2025, with the fields in each record separated by commas.

这将把查询结果写入 HDFS 目录 /user/yourusername/output/orders_2025 下的一个或多个文件中，每条记录中的字段由逗号分隔。

(Hive will create multiple part files if the query runs in parallel.)

（如果查询并行运行，Hive 将创建多个部分文件。）

You can then retrieve these files from HDFS and, if needed, concatenate them into a single CSV.

然后你可以从 HDFS 检索这些文件，如果需要，可以将它们合并为一个 CSV。

This method is often used to export Hive data for use in external systems.

此方法通常用于导出 Hive 数据以供外部系统使用。

Piping query output to a local file: If you are using the Hive CLI or Beeline, you can redirect the output of a SELECT query to your local filesystem.

将查询输出管道传输到本地文件：如果你使用的是 Hive CLI 或 Beeline，你可以将 SELECT 查询的输出重定向到你的本地文件系统。

For example, on a Unix-like system:

例如，在类 Unix 系统上：

1	hive -e "SELECT * FROM orders WHERE order_year = 2025;" > orders_2025.csv

This command runs the Hive query and directs all output to the file orders_2025.csv on the local machine.

此命令运行 Hive 查询并将所有输出定向到本地计算机上的文件 orders_2025.csv。

Be mindful that the output will include the raw rows (with default delimiters, typically ^A for Hive text output, unless you format them) and that this approach can be slow or memory-intensive for very large results, since data is pulled through the client.

请注意，输出将包含原始行（使用默认分隔符，除非你对其进行格式化，否则 Hive 文本输出通常为 ^A），并且对于非常大的结果，此方法可能很慢或占用大量内存，因为数据是通过客户端提取的。

When exporting, ensure the data is in a convenient format (using a ROW FORMAT clause in the query as shown, if necessary) and note that Hive does not automatically include column headers in the output.

导出时，请确保数据采用方便的格式（如有必要，如上所示在查询中使用 ROW FORMAT 子句），并注意 Hive 不会在输出中自动包含列标题。

If a header row is needed (e.g., for a CSV), you may have to prepend it manually.

如果需要标题行（例如对于 CSV），你可能必须手动添加它。

Exporting data via INSERT OVERWRITE DIRECTORY is generally more scalable for large datasets, whereas piping output is handy for quick, smaller exports or ad-hoc analysis results.

通过 INSERT OVERWRITE DIRECTORY 导出数据通常对于大型数据集更具可扩展性，而管道输出对于快速、较小的导出或临时分析结果很方便。

Table Management in Hive

Hive中的表管理

Databases

数据库

Creating Tables

创建表

External Tables and Locations

外部表与位置

Partitioned Tables

分区表

Bucketed Tables (Clustering)

分桶表（聚类）

Managed vs. External Tables

托管表 vs. 外部表

Loading Data into Hive Tables

将数据加载到 Hive 表中

Partitioning in Hive

Hive 中的分区

Static vs. Dynamic Partitioning

静态分区 vs. 动态分区

Static Partitioning

静态分区

Dynamic Partitioning

动态分区

Bucketing in Hive

Hive 中的分桶

Partitioning vs. Bucketing

分区 vs. 分桶

Hive File Formats

Hive 文件格式

Exporting Data from Hive

从 Hive 导出数据

Hive Chapter 1

Spark Chap 6