Chapter 2: Table Management in Hive
第2章:Hive 中的表管理
Table management in Hive involves creating databases and tables, defining schemas, and organizing data storage.
Hive 中的表管理涉及创建数据库和表、定义模式以及组织数据存储。
In this section, we cover how to create and drop databases and tables, and how to manage table properties such as partitions and bucketing.
在本节中,我们将介绍如何创建和删除数据库与表,以及如何管理表属性,如分区和分桶。
Databases
数据库
Hive supports multiple databases (namespaces) to organize tables.
Hive 支持多个数据库(命名空间)来组织表。
You can create a new database using the CREATE DATABASE statement, optionally with properties. For example:
你可以使用 CREATE DATABASE 语句来创建一个新数据库,还可以选择性地添加属性。例如:
1 | CREATE DATABASE myhivebook; |
You can inspect a database’s details (such as its location and properties) using the DESCRIBE DATABASE command.
你可以使用 DESCRIBE DATABASE 命令查看数据库的详细信息(如位置和属性)。
For instance, to see extended information about the database myhivebook:
例如,要查看关于 myhivebook 数据库的扩展信息:
1 | DESCRIBE DATABASE EXTENDED myhivebook; |
To delete a database, use the DROP DATABASE statement.
要删除数据库,请使用 DROP DATABASE 语句。
Note: By default, Hive uses RESTRICT mode, which prevents dropping a database if it contains any tables.
注意:默认情况下,Hive 使用 RESTRICT 模式,这会防止删除包含任何表的数据库。
You must first drop all tables in the database (or use the CASCADE option to force-drop everything). For example:
你必须先删除数据库中的所有表(或者使用 CASCADE 选项强制删除所有内容)。例如:
1 | DROP DATABASE myhivebook CASCADE; |
This will remove the database and all its tables. Without CASCADE, the command will fail if the database is not empty.
这将删除数据库及其所有表。如果不使用 CASCADE,若数据库不为空,该命令将失败。
Creating Tables
创建表
To create a table in Hive, use the CREATE TABLE statement with the desired schema. For example, to create a simple table:
要在 Hive 中创建表,请使用带有所需模式的 CREATE TABLE 语句。例如,要创建一个简单的表:
1 | CREATE TABLE students ( |
This creates a table named students with three columns.
这将创建一个名为 students 的表,其中包含三列。
The ROW FORMAT DELIMITED clause with FIELDS TERMINATED BY ‘,’ indicates that the data is comma-separated (CSV format).
ROW FORMAT DELIMITED 子句配合 FIELDS TERMINATED BY ',' 表示数据是逗号分隔的(CSV 格式)。
STORED AS TEXTFILE specifies the file format for storing data (in this case, plain text files, which is the default format).
STORED AS TEXTFILE 指定了存储数据的文件格式(在本例中为纯文本文件,这是默认格式)。
If a table with the same name already exists, Hive will throw an error. You can avoid this by using IF NOT EXISTS:
如果已存在同名表,Hive 将抛出错误。你可以通过使用 IF NOT EXISTS 来避免这种情况:
1 | CREATE TABLE IF NOT EXISTS students (...columns...) ...; |
External Tables and Locations
外部表与位置
By default, the above CREATE TABLE statement creates a managed (internal) table, which means Hive will manage the data’s lifecycle (more on that below).
默认情况下,上述 CREATE TABLE 语句会创建一个托管表(内部表),这意味着 Hive 将管理数据的生命周期(下文会有更多介绍)。
If you want Hive to reference data at an existing location without moving it, you should create an external table.
如果你希望 Hive 引用现有位置的数据而不移动它,你应该创建一个外部表。
Use the EXTERNAL keyword and specify a LOCATION:
使用 EXTERNAL 关键字并指定 LOCATION:
1 | CREATE EXTERNAL TABLE web_logs ( |
In this example, Hive will not move or delete data in /data/logs/2025/. The table simply points to that directory.
在这个例子中,Hive 不会移动或删除 /data/logs/2025/ 中的数据。该表仅仅是指向那个目录。
External tables are useful when the data is shared with other tools or should remain in its original location.
当数据与其他工具共享或应保留在其原始位置时,外部表非常有用。
Partitioned Tables
分区表
Hive allows tables to be partitioned to organize data for faster queries.
Hive 允许对表进行分区,以便组织数据从而实现更快的查询。
You declare partitions at table creation using the PARTITIONED BY clause:
你在创建表时使用 PARTITIONED BY 子句声明分区:
1 | CREATE TABLE sales ( |
Here, the sales table is partitioned by year and month.
在这里,sales 表按年份和月份进行分区。
Hive will store each partition’s data in a separate directory, for example:
Hive 将把每个分区的数据存储在单独的目录中,例如:
1 | /user/hive/warehouse/sales/sale_year=2025/sale_month=Jan/... |
Each distinct combination of sale_year and sale_month values corresponds to a subdirectory containing that partition’s data files.
sale_year and sale_month 值的每一种不同组合都对应一个包含该分区数据文件的子目录。
Partitioning helps prune data during queries — e.g., a query with WHERE sale_year = 2025 will read only the directory for 2025 instead of scanning all data.
分区有助于在查询期间修剪数据——例如,带有 WHERE sale_year = 2025 的查询将只读取 2025 年的目录,而不是扫描所有数据。
Bucketed Tables (Clustering)
分桶表(聚类)
Hive also supports bucketing (clustering) of data within a table or partition.
Hive 还支持在表或分区内对数据进行分桶(聚类)。
Bucketing distributes rows into a fixed number of “buckets” (files) based on the hash of a column.
分桶根据列的哈希值将行分配到固定数量的“桶”(文件)中。
You specify bucketing in the table schema with CLUSTERED BY:
你在表模式中使用 CLUSTERED BY 指定分桶:
1 | CREATE TABLE user_logs ( |
In this example, the data for user_logs will be split into 4 buckets according to the hash of user_id.
在这个例子中,user_logs 的数据将根据 user_id 的哈希值被分成 4 个桶。
Each bucket is stored as a separate file.
每个桶存储为一个单独的文件。
Bucketing can be combined with partitioning (e.g., partition by date, then bucket by user_id within each date partition) to further organize the data.
分桶可以与分区结合使用(例如,按日期分区,然后在每个日期分区内按 user_id 分桶),以进一步组织数据。
We will discuss the benefits of bucketing below in the context of query performance.
我们将在下文中结合查询性能讨论分桶的好处。
Managed vs. External Tables
托管表 vs. 外部表
Hive tables can be managed (internal) or external, which affects how Hive handles the data storage:
Hive 表可以是托管表(内部表)或外部表,这会影响 Hive 处理数据存储的方式:
Managed (Internal) Tables: Hive manages both the table’s metadata (schema) and the actual data files.
托管表(内部表): Hive 管理表的元数据(模式)和实际的数据文件。
The data is stored in Hive’s warehouse directory (controlled by the hive.metastore.warehouse.dir property in hive-site.xml).
数据存储在 Hive 的仓库目录中(由 hive-site.xml 中的 hive.metastore.warehouse.dir 属性控制)。
When you drop a managed table, Hive deletes both the metadata and the data files from HDFS.
当你删除托管表时,Hive 会从 HDFS 中删除元数据和数据文件。
Use managed tables when Hive alone owns the data. (By default, CREATE TABLE without the EXTERNAL keyword creates a managed table.)
当只有 Hive 拥有数据时,请使用托管表。(默认情况下,不带 EXTERNAL 关键字的 CREATE TABLE 会创建托管表。)
External Tables: Hive only manages the metadata, and the data files remain in their original location (as specified by the LOCATION).
外部表: Hive 仅管理元数据,数据文件保留在其原始位置(由 LOCATION 指定)。
Dropping an external table will remove the table metadata from Hive, but will not delete the data in HDFS.
删除外部表将从 Hive 中移除表元数据,但不会删除 HDFS 中的数据。
External tables are appropriate when the data is used outside of Hive or should not be moved into the warehouse (for example, data shared with HBase or external ingestion processes).
当数据在 Hive 之外使用或不应移动到仓库中时(例如,与 HBase 共享的数据或外部摄取流程的数据),外部表是合适的。
To create an external table, include the EXTERNAL keyword in the CREATE TABLE statement as shown above.
要创建外部表,请如上所示在 CREATE TABLE 语句中包含 EXTERNAL 关键字。
Summary of differences: 差异总结:
Managed table data is stored in Hive’s warehouse directory, and Hive fully controls it (no other system should write to it). Dropping the table deletes the data files along with the metadata.
托管表数据存储在 Hive 的仓库目录中,Hive 完全控制它(其他系统不应写入)。删除表会连同元数据一起删除数据文件。
External table data stays at an external location you specify. Hive does not move or delete it, and other applications can use or modify that data. Dropping the table leaves the data untouched.
外部表数据保留在你指定的外部位置。Hive 不会移动或删除它,其他应用程序可以使用或修改该数据。删除表会保留数据原样。
Loading Data into Hive Tables
将数据加载到 Hive 表中
Once a table is created, you can load data into it.
表创建后,你可以向其中加载数据。
Hive provides a LOAD DATA command to import data from files into a table (for internal tables this will move or copy the file into Hive’s managed storage):
Hive 提供了 LOAD DATA 命令将文件中的数据导入表中(对于内部表,这将移动或复制文件到 Hive 的托管存储中):
1 | LOAD DATA [LOCAL] INPATH '/path/to/file.csv' INTO TABLE students; |
This command loads the file file.csv into the Hive table’s storage location.
此命令将文件 file.csv 加载到 Hive 表的存储位置。
(If LOCAL is specified, the path is treated as a local filesystem path; without it, the path is assumed to be on HDFS).
(如果指定了 LOCAL,则该路径被视为本地文件系统路径;如果没有,则假定路径在 HDFS 上)。
You can also load data into a specific partition of a table:
你也可以将数据加载到表的特定分区中:
1 | LOAD DATA INPATH 'hdfs:///data/sales/2025_Jan.csv' INTO TABLE sales PARTITION(sale_year=2025, sale_month='Jan'); |
This would move the given file into the partition directory for sale_year=2025 and sale_month=’Jan’ in the sales table.
这将把给定的文件移动到 sales 表中 sale_year=2025 和 sale_month='Jan' 的分区目录。
Alternatively, you can use INSERT INTO or INSERT OVERWRITE with a SELECT query to load or transform data into Hive tables.
或者,你可以使用带有 SELECT 查询的 INSERT INTO 或 INSERT OVERWRITE 将数据加载或转换到 Hive 表中。
For example, after creating a partitioned table, you might insert data from an unpartitioned source table, as explained next.
例如,在创建分区表后,你可能会从一个未分区的源表中插入数据,如下所述。
Partitioning in Hive
Hive 中的分区
Partitioning is a technique to divide a table’s data into smaller parts based on the values of one or more partition columns.
分区是一种根据一个或多个分区列的值将表数据分成更小部分的技术。
Each distinct value of a partition column (or combination of values for multi-column partitioning) corresponds to a separate directory of data in HDFS.
分区列的每个不同值(或多列分区的组合值)对应于 HDFS 中一个单独的数据目录。
Queries that filter on partition columns can retrieve data more efficiently by scanning only relevant partitions instead of the entire table.
对分区列进行过滤的查询可以通过仅扫描相关分区而不是整个表来更有效地检索数据。
For example, if you partition a log table by date, Hive will store logs for each date in its own folder (e.g., …/logs/date=2025-10-09/).
例如,如果你按日期对日志表进行分区,Hive 将把每个日期的日志存储在其自己的文件夹中(例如,.../logs/date=2025-10-09/)。
A query like SELECT * FROM logs WHERE date = ‘2025-10-09’ will only read files in the date=2025-10-09 partition directory, skipping other dates entirely.
像 SELECT * FROM logs WHERE date = '2025-10-09' 这样的查询将只读取 date=2025-10-09 分区目录中的文件,完全跳过其他日期。
This can dramatically speed up query performance when working with large datasets.
这可以在处理大型数据集时显著提高查询性能。
Static vs. Dynamic Partitioning
静态分区 vs. 动态分区
Hive supports two methods of adding data into partitions:
Hive 支持两种向分区添加数据的方法:
Static Partitioning
静态分区
With static partitioning, you specify the target partition values explicitly when loading or inserting data.
使用静态分区时,你在加载或插入数据时显式指定目标分区值。
The user (or ETL process) determines which partition each data load goes into. Key points about static partitioning include:
用户(或 ETL 流程)决定每次数据加载进入哪个分区。关于静态分区的要点包括:
You add each partition “statically” by specifying the partition key value(s) in the INSERT or LOAD command.
你通过在 INSERT 或 LOAD 命令中指定分区键值来“静态地”添加每个分区。
For example, if a table is partitioned by region, you might load a file into the partition for region=’EU’ explicitly.
例如,如果表按区域分区,你可能会显式地将文件加载到 region='EU' 的分区中。
Static partitioning is often preferred when you are loading large datasets file-by-file, as it gives you control to load one partition at a time.
当你逐个文件加载大型数据集时,静态分区通常是首选,因为它允许你一次控制加载一个分区。
It can be faster and use fewer resources than a fully dynamic load because each operation targets a single partition.
这可能比完全动态加载更快且占用更少的资源,因为每个操作都针对单个分区。
You can create or alter partitions manually. For instance, you can use ALTER TABLE … ADD PARTITION to add a new empty partition directory, then load data into it.
你可以手动创建或更改分区。例如,你可以使用 ALTER TABLE ... ADD PARTITION 添加一个新的空分区目录,然后向其中加载数据。
Similarly, partitions can be dropped or renamed as needed.
同样,可以根据需要删除或重命名分区。
A benefit of static partitioning is that you might derive the partition value from external information (such as a filename or date) without needing to parse each record in the data file.
静态分区的一个好处是,你可以从外部信息(如文件名或日期)推导出分区值,而无需解析数据文件中的每条记录。
For example, if your data files are named by date, you can use that date to decide the partition directory for the load.
例如,如果你的数据文件按日期命名,你可以使用该日期来决定加载的分区目录。
Hive operates in strict mode by default (hive.mapred.mode = strict in the configuration).
Hive 默认在严格模式下运行(配置中 hive.mapred.mode = strict)。
In strict mode, at least one partition column must be specified in any INSERT statement to a partitioned table (to prevent accidentally generating too many partitions in one query).
在严格模式下,对分区表的任何 INSERT 语句中必须至少指定一个分区列(以防止在一次查询中意外生成过多分区)。
This effectively means Hive expects static partitions unless you explicitly allow fully dynamic partitioning (described below).
这实际上意味着 Hive 期望静态分区,除非你显式允许完全动态分区(如下所述)。
Static partitioning example: Suppose you have the sales table partitioned by sale_year and sale_month as defined above.
静态分区示例: 假设你有如上定义的按 sale_year 和 sale_month 分区的 sales 表。
Using static partitioning, to load data for January 2025 into this table, you could run:
使用静态分区,要将 2025 年 1 月的数据加载到此表中,你可以运行:
1 | INSERT INTO TABLE sales PARTITION(sale_year=2025, sale_month='Jan') |
In this INSERT statement, we explicitly specify the target partition (sale_year=2025, sale_month=’Jan’).
在这个 INSERT 语句中,我们显式指定了目标分区(sale_year=2025, sale_month='Jan')。
All data from the source (sales_staging) that matches the WHERE clause will go into that single partition.
源表(sales_staging)中所有匹配 WHERE 子句的数据都将进入该单个分区。
If that partition does not exist yet, Hive will create the directory for sale_year=2025/sale_month=Jan automatically.
如果该分区尚不存在,Hive 将自动创建 sale_year=2025/sale_month=Jan 目录。
Dynamic Partitioning
动态分区
With dynamic partitioning, Hive determines the partition destinations at runtime based on the data.
使用动态分区时,Hive 会根据数据在运行时确定分区目标。
You do not explicitly specify the partition values in the insert; instead, one INSERT can create multiple partitions in one go.
你不需要在插入中显式指定分区值;相反,一个 INSERT 可以一次性创建多个分区。
Characteristics of dynamic partitioning include:
动态分区的特点包括:
A single INSERT … SELECT can populate multiple partitions in the target table.
单个 INSERT ... SELECT 可以填充目标表中的多个分区。
The partition columns are included in the SELECT query, and Hive automatically routes each row to the appropriate partition directory based on those values.
分区列包含在 SELECT 查询中,Hive 会根据这些值自动将每一行路由到相应的分区目录。
Dynamic partitioning is convenient when you have a large dataset and want to split it into many partitions without writing separate load statements for each partition.
当你有一个大型数据集并希望将其拆分为许多分区,而无需为每个分区编写单独的加载语句时,动态分区非常方便。
For example, you can insert a full year’s data and let Hive create partitions per month (or per day) for you.
例如,你可以插入一整年的数据,让 Hive 为你按月(或按天)创建分区。
Dynamic partitioning may be slower than static partitioning for very large data loads, because the job handles many partitions in one pass.
对于非常大的数据加载,动态分区可能比静态分区慢,因为作业在一次传递中处理许多分区。
It is most useful when the number of partitions or their values are not known in advance or are too numerous for manual loading.
当分区数量或其值事先未知,或者数量太多无法手动加载时,它是最有用的。
Hive requires certain settings to enable fully dynamic partitioning. Specifically, you need to allow dynamic partition creation and switch to non-strict mode:
Hive 需要某些设置来启用完全动态分区。具体来说,你需要允许动态分区创建并切换到非严格模式:
SET hive.exec.dynamic.partition = true; (allows dynamic partition inserts)
SET hive.exec.dynamic.partition = true; (允许动态分区插入)
SET hive.exec.dynamic.partition.mode = nonstrict; (allows all partition columns to be dynamic)
SET hive.exec.dynamic.partition.mode = nonstrict; (允许所有分区列都是动态的)
By default, Hive’s dynamic partition mode is “strict”, meaning at least one partition column must be manually specified (as mentioned above).
默认情况下,Hive 的动态分区模式是“strict”(严格),意味着至少必须手动指定一个分区列(如上所述)。
Setting it to nonstrict allows all partition columns to be filled dynamically from the data.
将其设置为 nonstrict 允许从数据中动态填充所有分区列。
After data is loaded, dynamic partitions behave just like static partitions. You can query them and alter them (add/drop) as needed. The difference is only in how they were created.
数据加载后,动态分区的行为与静态分区完全相同。你可以根据需要查询和更改(添加/删除)它们。区别仅在于它们的创建方式。
Dynamic partitioning example: Using the same sales table, if you have an unpartitioned staging table containing sales data for multiple dates, you can populate all partitions in one operation using dynamic partitioning:
动态分区示例: 使用相同的 sales 表,如果你有一个包含多个日期销售数据的未分区暂存表,你可以使用动态分区在一次操作中填充所有分区:
1 | SET hive.exec.dynamic.partition=true; |
Notice that the INSERT statement above does not explicitly specify any particular year or month.
请注意,上面的 INSERT 语句没有显式指定任何特定的年份或月份。
The partition columns are listed in the PARTITION(…) clause without fixed values, and their values come from the SELECT query’s output.
分区列列在 PARTITION(...) 子句中,没有固定值,它们的值来自 SELECT 查询的输出。
Hive will automatically create partition directories for each unique sale_year, sale_month combination in the data.
Hive 将为数据中每个唯一的 sale_year、sale_month 组合自动创建分区目录。
For instance, if the staging data contains sales for 2024 and 2025 across various months, Hive will create partitions like sale_year=2024/sale_month=Jan, sale_year=2024/sale_month=Feb, … sale_year=2025/sale_month=Dec as needed, and route each row to the appropriate partition.
例如,如果暂存数据包含 2024 年和 2025 年各个月份的销售额,Hive 将根据需要创建如 sale_year=2024/sale_month=Jan、sale_year=2024/sale_month=Feb… sale_year=2025/sale_month=Dec 等分区,并将每一行路由到相应的分区。
Bucketing in Hive
Hive 中的分桶
Bucketing (also known as clustering) is another technique to organize data at a finer level than partitions.
分桶(也称为聚类)是另一种在比分区更细粒度上组织数据的技术。
When a table (or partition) is bucketed, Hive divides the data into a fixed number of buckets based on a hash of one or more columns (called the bucketing or clustered by columns).
当表(或分区)被分桶时,Hive 根据一个或多个列(称为分桶列或聚类列)的哈希值将数据分成固定数量的桶。
Each bucket is stored as a separate file in HDFS.
每个桶在 HDFS 中存储为一个单独的文件。
Bucketing adds structure that can improve query efficiency in certain cases:
分桶增加了结构,可以在某些情况下提高查询效率:
Even distribution: By hashing on a column like user_id and dividing by the number of buckets, Hive tries to distribute rows evenly across buckets.
均匀分布: 通过对像 user_id 这样的列进行哈希并除以桶的数量,Hive 试图在桶之间均匀分布行。
This can prevent data skew and ensures that no single bucket file is overwhelmingly large or small.
这可以防止数据倾斜,并确保没有任何一个桶文件过大或过小。
Efficient sampling: Bucketing enables efficient sampling of data.
高效采样: 分桶支持高效的数据采样。
For example, you can use a query clause like TABLESAMPLE(BUCKET 3 OUT OF 8) on a table that is bucketed into 8 buckets.
例如,你可以在分成 8 个桶的表上使用 TABLESAMPLE(BUCKET 3 OUT OF 8) 这样的查询子句。
This will retrieve roughly 1/8 of the data (specifically, the third bucket from each partition), which is much faster than a random sample because it reads a subset of files completely.
这将检索大约 1/8 的数据(具体来说是每个分区的第三个桶),这比随机采样快得多,因为它完整读取了文件的一个子集。
Join optimization: If two tables are bucketed (and sorted) on the same key with the same number of buckets, Hive can perform bucket map joins (a type of map-side join).
连接优化: 如果两个表在相同的键上分桶(并排序),且桶数相同,Hive 可以执行桶 Map 连接(一种 Map 端连接)。
This avoids a full shuffle of data during the join because matching rows will already be grouped into corresponding buckets.
这避免了连接期间数据的完全洗牌(shuffle),因为匹配的行已经被分组到相应的桶中。
In other words, the join can be performed by joining each pair of bucket files independently, which is more efficient.
换句话说,连接可以通过独立连接每一对桶文件来执行,这更有效率。
To use bucketing, you define it in the table schema with CLUSTERED BY (… column list …) INTO N BUCKETS. For example:
要使用分桶,你在表模式中使用 CLUSTERED BY (... column list ...) INTO N BUCKETS 进行定义。例如:
1 | CREATE TABLE purchase_logs ( |
In this table, data is first partitioned by year, and within each year partition it is further divided into 8 buckets based on the hash of user_id.
在这个表中,数据首先按年份分区,在每个年份分区内,基于 user_id 的哈希值进一步分为 8 个桶。
This means each year’s data is spread across 8 files (buckets).
这意味着每年的数据分布在 8 个文件(桶)中。
If two tables share the same bucketing column and number of buckets (and if sorting on that column is enabled), queries joining them on that column can be optimized as described above.
如果两个表共享相同的分桶列和桶数(并且如果启用了该列的排序),则在它们之上进行连接的查询可以如上所述进行优化。
Using bucketing: When inserting data into a bucketed table, you should set a Hive parameter to ensure correct bucketing:
使用分桶: 当向分桶表插入数据时,你应该设置一个 Hive 参数以确保存储正确的分桶:
1 | SET hive.enforce.bucketing = true; |
This ensures the hashing and bucketing logic is applied during insert, creating the correct number of bucket files.
这确保了在插入期间应用哈希和分桶逻辑,创建正确数量的桶文件。
(If you also declared a SORTED BY for sorted buckets, you would set hive.enforce.sorting=true similarly.)
(如果你还为排序桶声明了 SORTED BY,你也应类似地设置 hive.enforce.sorting=true。)
Partitioning vs. Bucketing
分区 vs. 分桶
Both partitioning and bucketing split data into multiple parts, but they have different use cases and advantages.
分区和分桶都将数据分成多个部分,但它们有不同的用例和优势。
Here are the key differences and how to decide between them:
以下是主要区别以及如何在其间做决定:
How data is split: Partitioning separates data into directories based on distinct values of a partition column. Each value (or combination of values) has its own folder.
数据如何拆分: 分区根据分区列的不同值将数据分隔到目录中。每个值(或值组合)都有自己的文件夹。
Bucketing divides data into a fixed number of files (buckets) by hashing the value of a column; each bucket file contains a mix of values, determined by the hash function.
分桶通过哈希列的值将数据分成固定数量的文件(桶);每个桶文件包含混合的值,由哈希函数决定。
Use cases: Partitioning is very effective when there is a limited number of distinct values that are common filters in queries (e.g., partition by date, country, category).
用例: 当查询中常用的过滤器有有限数量的不同值时(例如,按日期、国家、类别分区),分区非常有效。
Bucketing is useful to evenly distribute data on a column that has high cardinality (many unique values) or when you want to optimize operations like joins or sampling on that column.
当你想在具有高基数(许多唯一值)的列上均匀分布数据,或者想优化该列上的连接或采样操作时,分桶非常有用。
You might partition by a coarse category (like year) and bucket by a finer identifier (like user_id) within each partition.
你可能会按粗略类别(如年份)进行分区,并在每个分区内按更精细的标识符(如 user_id)进行分桶。
Number of subdivisions: Partitioning too finely (for example, creating a partition for every user in a large dataset) can lead to thousands of small directories and files, which is inefficient.
细分数: 分区过细(例如,在大型数据集中为每个用户创建一个分区)可能导致成千上万个小目录和文件,这是低效的。
Bucketing, on the other hand, lets you choose a manageable number of buckets even for a very high-cardinality column, avoiding the overhead of too many small partitions.
另一方面,分桶让你即使对于非常高基数的列也能选择一个可管理的桶数,避免过多小分区的开销。
In essence, partitioning is good for segregating data by a few logical keys, while bucketing is good for distributing data evenly when a key has many possibilities.
本质上,分区适合按几个逻辑键隔离数据,而分桶适合在键有许多可能性时均匀分布数据。
Query performance: Partitioning greatly speeds up queries that filter by partition columns by pruning irrelevant data – the query reads only the partitions it needs.
查询性能: 分区通过修剪不相关的数据大大加快了按分区列过滤的查询——查询只读取它需要的分区。
Bucketing doesn’t eliminate reading data the way partition pruning does, but it can make certain operations more efficient.
分桶不像分区修剪那样消除数据读取,但它可以使某些操作更高效。
For example, a join on a bucketed column can be executed as a map-side join (bucket join) without shuffling, if both datasets are bucketed and use the same number of buckets on the join key.
例如,如果两个数据集都已分桶并在连接键上使用相同数量的桶,则分桶列上的连接可以作为 Map 端连接(桶连接)执行,无需洗牌。
Also, reading one bucket (or a subset of buckets) can be used for sampling or debugging large tables.
此外,读取一个桶(或桶的子集)可用于采样或调试大型表。
Combined usage: Partitioning and bucketing can be used together on the same table.
组合使用: 分区和分桶可以在同一个表上一起使用。
For instance, you might partition a sales table by date (one directory per day) and bucket each day’s data by customer_id into, say, 16 buckets.
例如,你可能会按日期对销售表进行分区(每天一个目录),并将每天的数据按 customer_id 分成(比方说)16 个桶。
This way, queries that specify a date range will only scan those partitions, and operations on customer_id (like joins or aggregations) benefit from the even spread across buckets.
这样,指定日期范围的查询将只扫描那些分区,而在 customer_id 上的操作(如连接或聚合)将受益于桶间的均匀分布。
Combining these techniques provides a multi-level data organization that can optimize different aspects of query processing.
结合这些技术提供了多级数据组织,可以优化查询处理的不同方面。
In summary, use partitioning for coarse-grained separation of data that aligns with common query filters, and use bucketing for fine-grained segmentation to evenly distribute data or to optimize joins and sampling.
总之,使用分区进行与常见查询过滤器一致的粗粒度数据分离,使用分桶进行细粒度分割以均匀分布数据或优化连接和采样。
Choosing the right strategy (or combination) depends on the data and query patterns: partitions to minimize data scanned, and buckets to balance data distribution and speed up specific operations.
选择正确的策略(或组合)取决于数据和查询模式:分区用于最小化扫描的数据,分桶用于平衡数据分布并加速特定操作。
Hive File Formats
Hive 文件格式
Hive can store table data in a variety of file formats. The choice of file format affects storage efficiency, compression, and read/write performance.
Hive 可以以多种文件格式存储表数据。文件格式的选择会影响存储效率、压缩和读/写性能。
Below are common file formats supported in Hive:
以下是 Hive 支持的常见文件格式:
TEXTFILE : This is the default storage format in Hive (plain text files, often with values separated by delimiters like commas or tabs).
TEXTFILE: 这是 Hive 中的默认存储格式(纯文本文件,通常值由逗号或制表符等分隔符分隔)。
Text files are human-readable but not space-efficient (no compression by default) and can be slow to parse for large datasets.
文本文件是人类可读的,但空间效率不高(默认无压缩),并且对于大型数据集解析可能较慢。
Use TEXTFILE for simplicity or when working with small datasets, or if you need to easily view/edit the raw data. (Hive treats typical CSV/TSV files as TEXTFILE format tables.)
为了简单起见,或者在处理小型数据集时,或者如果你需要轻松查看/编辑原始数据,请使用 TEXTFILE。(Hive 将典型的 CSV/TSV 文件视为 TEXTFILE 格式表。)
SEQUENCEFILE : A SequenceFile is a Hadoop binary file format that stores data as binary key-value pairs.
SEQUENCEFILE: SequenceFile 是一种 Hadoop 二进制文件格式,将数据存储为二进制键值对。
It is splittable (meaning large files can be processed in parallel), supports compression, and is more compact than plain text.
它是可分割的(意味着大文件可以并行处理),支持压缩,并且比纯文本更紧凑。
SequenceFiles are a good choice for intermediate data storage in Hadoop because they can be read and written efficiently by MapReduce jobs.
SequenceFiles 是 Hadoop 中中间数据存储的不错选择,因为它们可以被 MapReduce 作业高效地读取和写入。
Custom (User-Defined) Format : Hive allows you to use custom input and output formats.
自定义(用户定义)格式: Hive 允许你使用自定义的输入和输出格式。
If your data is in a special format that Hive doesn’t natively recognize, you can provide your own implementation of an InputFormat/OutputFormat (along with a SerDe – serializer/deserializer) to read and write the data.
如果你的数据采用了 Hive 原生无法识别的特殊格式,你可以提供自己的 InputFormat/OutputFormat 实现(以及 SerDe——序列化器/反序列化器)来读取和写入数据。
This is an advanced feature used for integrating Hive with specialized or proprietary file formats.
这是用于将 Hive 与专用或专有文件格式集成的搞基功能。
AVRO : Avro is a row-oriented storage format with a built-in schema definition.
AVRO: Avro 是一种具有内置模式定义的行式存储格式。
Using Avro in Hive means leveraging Avro’s serialization framework and schema system. Avro files include the schema as part of the file, making them self-describing.
在 Hive 中使用 Avro 意味着利用 Avro 的序列化框架和模式系统。Avro 文件将模式作为文件的一部分包含在内,使其具有自描述性。
They are compact (support compression) and splittable.
它们紧凑(支持压缩)且可分割。
Avro is particularly useful when you need to share data across different systems or programming languages, or when your data schema may evolve over time (Avro supports schema evolution).
当你需要在不同系统或编程语言之间共享数据,或者当你的数据模式可能随时间演变时(Avro 支持模式演变),Avro 特别有用。
RCFile (Record Columnar File) : RCFiles were one of the first columnar storage formats in Hadoop.
RCFile(记录列式文件): RCFile 是 Hadoop 中最早的列式存储格式之一。
In an RCFile, data is stored in a columnar fashion within each row group, which improves compression and query performance for certain workloads (especially when queries only need specific columns).
在 RCFile 中,数据在每个行组内以列式方式存储,这提高了某些工作负载的压缩和查询性能(特别是当查询只需要特定列时)。
Internally, an RCFile is structured as a flat binary file (similar to a SequenceFile) with key-value pairs, but it organizes the values by column.
在内部,RCFile 被构建为带有键值对的扁平二进制文件(类似于 SequenceFile),但它按列组织值。
RCFile was a pioneering format for Hive, but it has largely been superseded by newer columnar formats like ORC and Parquet that offer better performance and features.
RCFile 是 Hive 的先驱格式,但它已主要被更新的列式格式(如 ORC 和 Parquet)所取代,后者提供更好的性能和功能。
ORC (Optimized Row Columnar) File : ORC is a highly optimized columnar format introduced to improve upon RCFile.
ORC(优化行列式)文件: ORC 是一种高度优化的列式格式,旨在改进 RCFile。
ORC provides efficient compression and faster query performance, especially for read-heavy analytics.
ORC 提供了高效的压缩和更快的查询性能,特别是对于读取密集型分析。
Data in ORC files is stored in stripes, along with metadata and indexes (like min/max values for columns in each stripe).
ORC 文件中的数据存储在条带(stripes)中,同时包含元数据和索引(如每个条带中列的最小/最大值)。
This allows Hive to skip large portions of data when executing queries (for example, if the min/max of a stripe indicates no relevant values, that stripe can be skipped).
这允许 Hive 在执行查询时跳过大部分数据(例如,如果条带的最小/最大值表明没有相关值,则可以跳过该条带)。
ORC offers excellent compression ratios and the ability to split files for parallel processing.
ORC 提供了出色的压缩比和分割文件以进行并行处理的能力。
It is a great choice for Hive tables where query performance and storage efficiency are priorities (e.g., large fact tables).
对于查询性能和存储效率优先的 Hive 表(例如大型事实表),它是一个极好的选择。
The trade-off is that ORC can be slightly slower to write (due to the extra indexing and compression work).
折衷方案是 ORC 的写入速度可能稍慢(由于额外的索引和压缩工作)。
Parquet : Parquet is another modern columnar file format, widely used in the Hadoop and Spark ecosystem.
Parquet: Parquet 是另一种现代列式文件格式,广泛用于 Hadoop 和 Spark 生态系统。
Like ORC, Parquet stores data column-by-column and is optimized for efficient storage and retrieval.
像 ORC 一样,Parquet 逐列存储数据,并针对高效存储和检索进行了优化。
Parquet is designed to handle complex nested data structures well (using a technique to encode nested fields).
Parquet 旨在很好地处理复杂的嵌套数据结构(使用一种编码嵌套字段的技术)。
It also provides effective compression and encoding schemes for different data types.
它还为不同的数据类型提供了有效的压缩和编码方案。
Parquet and ORC have similar goals — improving performance for analytical queries by reading only needed columns and compressing data — and they often achieve comparable performance.
Parquet 和 ORC 有相似的目标——通过仅读取所需的列和压缩数据来提高分析查询的性能——并且它们通常能达到相当的性能。
One advantage of Parquet is its broad support beyond Hive (e.g., in Spark, Impala, etc.), making it a good choice for interoperability in big data environments.
Parquet 的一个优势是它在 Hive 之外的广泛支持(例如在 Spark、Impala 等中),这使其成为大数据环境中互操作性的不错选择。
Both Parquet and ORC can significantly reduce storage size and speed up queries compared to plain text.
与纯文本相比,Parquet 和 ORC 都可以显著减少存储大小并加快查询速度。
In general, columnar formats (ORC, Parquet, RCFile) are preferred for large analytic workloads because they allow queries to read only the needed columns and they compress data effectively.
一般来说,列式格式(ORC、Parquet、RCFile) 是大型分析工作负载的首选,因为它们允许查询仅读取所需的列并且有效地压缩数据。
Row-oriented formats (Text, SequenceFile, Avro) might be suitable for use cases such as streaming data ingestion, interoperating with other systems, or scenarios where entire records are accessed frequently.
行式格式(Text、SequenceFile、Avro) 可能适用于流数据摄取、与其他系统互操作或频繁访问整条记录的场景。
The choice of format should consider factors like file size, schema evolution needs, compression, and how the data will be accessed.
格式的选择应考虑文件大小、模式演变需求、压缩以及数据访问方式等因素。
Exporting Data from Hive
从 Hive 导出数据
There are times you may want to export data from Hive tables into external files (for example, to share results or to load into another system).
有些时候你可能希望将数据从 Hive 表导出到外部文件(例如,分享结果或加载到另一个系统)。
Hive provides a couple of ways to do this:
Hive 提供了几种方法来做到这一点:
INSERT OVERWRITE DIRECTORY : You can write the results of a query to an HDFS directory in a specified format.
INSERT OVERWRITE DIRECTORY: 你可以将查询结果以指定格式写入 HDFS 目录。
For example, to export a table’s contents as CSV, you can run:
例如,要将表的内容导出为 CSV,你可以运行:
1 | INSERT OVERWRITE DIRECTORY '/user/yourusername/output/orders_2025' |
This will write the query results into one or more files under the HDFS directory /user/yourusername/output/orders_2025, with the fields in each record separated by commas.
这将把查询结果写入 HDFS 目录 /user/yourusername/output/orders_2025 下的一个或多个文件中,每条记录中的字段由逗号分隔。
(Hive will create multiple part files if the query runs in parallel.)
(如果查询并行运行,Hive 将创建多个部分文件。)
You can then retrieve these files from HDFS and, if needed, concatenate them into a single CSV.
然后你可以从 HDFS 检索这些文件,并在需要时将它们连接成一个 CSV。
This method is often used to export Hive data for use in external systems.
此方法常用于导出 Hive 数据以供外部系统使用。
Piping query output to a local file : If you are using the Hive CLI or Beeline, you can redirect the output of a SELECT query to your local filesystem.
将查询输出管道传输到本地文件: 如果你使用的是 Hive CLI 或 Beeline,你可以将 SELECT 查询的输出重定向到本地文件系统。
For example, on a Unix-like system:
例如,在类 Unix 系统上:
1 | hive -e "SELECT * FROM orders WHERE order_year = 2025;" > orders_2025.csv |
This command runs the Hive query and directs all output to the file orders_2025.csv on the local machine.
此命令运行 Hive 查询并将所有输出定向到本地计算机上的 orders_2025.csv 文件。
Be mindful that the output will include the raw rows (with default delimiters, typically ^A for Hive text output, unless you format them) and that this approach can be slow or memory-intensive for very large results, since data is pulled through the client.
请注意,输出将包含原始行(带有默认分隔符,通常 Hive 文本输出为 ^A,除非你对其进行格式化),并且对于非常大的结果,此方法可能很慢或占用大量内存,因为数据是通过客户端拉取的。
When exporting, ensure the data is in a convenient format (using a ROW FORMAT clause in the query as shown, if necessary) and note that Hive does not automatically include column headers in the output.
导出时,请确保数据采用方便的格式(如有必要,如上所示在查询中使用 ROW FORMAT 子句),并注意 Hive 不会自动在输出中包含列标题。
If a header row is needed (e.g., for a CSV), you may have to prepend it manually.
如果需要标题行(例如对于 CSV),你可能必须手动添加它。
Exporting data via INSERT OVERWRITE DIRECTORY is generally more scalable for large datasets, whereas piping output is handy for quick, smaller exports or ad-hoc analysis results.
通过 INSERT OVERWRITE DIRECTORY 导出数据通常对大型数据集更具可扩展性,而管道输出对于快速、较小的导出或临时分析结果很方便。