Chapter 4 第 4 章

Hive Operators

Hive 运算符

Operators in Hive are special symbols or keywords used in expressions to perform operations on data values (columns, literals, etc.).

Hive 中的运算符是用于在表达式中对数据值（列、字面量等）执行操作的特殊符号或关键字。

They are categorized by their purpose — arithmetic, comparison, logical, relational, etc.

它们根据用途进行分类——算术、比较、逻辑、关系等。

Arithmetic Operators

算术运算符

Used for Mathematical calculation.

用于数学计算。

Operator	Type	类型	Description	描述
A+B	All Numeric	所有数值类型	+ operator is used to add numbers. The resulting value of this expression depends up on the largest data types used in the expression.	+ 运算符用于对数字进行加法。该表达式的结果值取决于表达式中使用的最大数据类型。
A-B	All Numeric	所有数值类型	- operator is used to subtract numbers. The resulting value of this expression depends up on the largest data types used in the expression.	- 运算符用于对数字进行减法。该表达式的结果值取决于表达式中使用的最大数据类型。
A*B	All Numeric	所有数值类型	* operator is used to multiply numbers. The resulting value of this expression depends up on the largest data types used in the expression.	* 运算符用于对数字进行乘法。该表达式的结果值取决于表达式中使用的最大数据类型。
A/B	All Numeric	所有数值类型	A and B are divided and the result is a double (double precision) result.	A 与 B 相除，结果为 double（双精度）类型。
A%B	All Numeric	所有数值类型	The division of A by B has common types with operands.	A 除以 B 的结果类型与操作数具有共同（兼容）的类型。
A&B	All Numeric	所有数值类型	Performs a bitwise AND operation, when one bit of both expressions is 1, the bit of the result is 1. Otherwise, the result is 0.	执行按位与（AND）运算：当两个表达式对应位都为 1 时，结果该位为 1；否则为 0。
A \| B	All Numeric	所有数值类型	Performs a bitwise “or” operation, as long as one bit of any expression is 1, that bit of the result is 1. Otherwise, the result is 0.	执行按位或（OR）运算：只要任一表达式对应位为 1，结果该位为 1；否则为 0。
A ^ B	All Numeric	所有数值类型	Performs a bitwise XOR operation. The bit of the result is 1 if and only if only one bit of the expression is 1. Otherwise the result is 0.	执行按位异或（XOR）运算：当且仅当对应位中只有一个为 1 时，结果该位为 1；否则为 0。
~A	All Numeric	所有数值类型	Performs a bitwise “not” (invert) on an expression.	对表达式执行按位非（NOT，取反）运算。

1 2	select 1+9 from table_name limit 5; select 40*5 from table_name limit 5;

Find the Total Revenu.

计算总收入。

1
2
3

select productname, price* quantity as  totalsales
from products  p ,orders  o
where  p .productid = o .productid

Give a 10 % discount on expensive goods.

给昂贵的商品打 10% 的折扣。

SELECT productid, productname, price,
       price - (price * 0.10) AS price_after_10pct_discount
FROM products
WHERE price > 1000
LIMIT 20;

Find orders with odd quantity (modulus) and fraction price per item 查找数量为奇数（取模）的订单以及每件商品的单价（分数形式）

SELECT orderid, quantity,
       (quantity % 2) AS quantity_mod_2,
       CAST(p.price AS DOUBLE) / o.quantity AS price_per_item
FROM orders o
JOIN products p ON o.productid = p.productid
WHERE o.quantity > 0;

Relational Operators

关系运算符

Like many programming languages Hive also supports many relational operators. Relational operators are used to compare two operands and generate a TRUE or FALSE value.

像许多编程语言一样，Hive 也支持许多关系运算符。关系运算符用于比较两个操作数并生成 TRUE（真）或 FALSE（假）值。

Used to Compare two values.

用于比较两个值。

Operator	Type	类型	Description	描述
`A = B`	All Primitives	所有原始类型	Returns TRUE if A is equal to B, otherwise FALSE	如果 A 等于 B，则返回 TRUE；否则返回 FALSE。
`A <> B`	All primitive	所有原始类型	Returns TRUE if A is not equal to B, otherwise FALSE. If the value of A or B is “NULL”, the result returns “NULL”.	如果 A 不等于 B，则返回 TRUE；否则返回 FALSE。若 A 或 B 的值为 “NULL”，则结果返回 “NULL”。
`A < B`	All primitive	所有原始类型	Returns TRUE if A is less than B, otherwise FALSE. If the value of A or B is NULL, the result returns NULL.	如果 A 小于 B，则返回 TRUE；否则返回 FALSE。若 A 或 B 的值为 NULL，则结果返回 NULL。
`A <= B`	All primitive	所有原始类型	Returns TRUE if A is less than or equal to B, otherwise FALSE. If the value of A or B is NULL, the result returns NULL.	如果 A 小于或等于 B，则返回 TRUE；否则返回 FALSE。若 A 或 B 的值为 NULL，则结果返回 NULL。
`A > B`	All primitive	所有原始类型	Returns TRUE if A is greater than B, otherwise FALSE. If the value of A or B is “NULL”, the result returns “NULL”.	如果 A 大于 B，则返回 TRUE；否则返回 FALSE。若 A 或 B 的值为 “NULL”，则结果返回 “NULL”。
`A >= B`	All primitive	所有原始类型	Returns TRUE if A is greater than or equal to B, otherwise returns FALSE. If the value of A or B is NULL, the result returns NULL.	如果 A 大于或等于 B，则返回 TRUE；否则返回 FALSE。若 A 或 B 的值为 NULL，则结果返回 NULL。

Operator	Type	类型	Description	描述
`A LIKE B`	String	字符串（String）	If the value of A or B is NULL, the result returns NULL. The Strings A is matched with B returns TRUE, returns FALSE otherwise. ““ in the B string represents any character, and “%” represents multiple arbitrary characters. For example: (‘foobar’ like ‘foo’) returns FALSE, (‘foobar’ like ‘foo__’ or ‘foobar’ like ‘foo%’) returns TRUE.	若 A 或 B 的值为 NULL，则结果返回 NULL。将字符串 A 与 B 进行 LIKE 匹配，匹配成功返回 TRUE，否则返回 FALSE。B 中 “_” 表示任意单个字符，“%” 表示任意长度的多个字符。例如：`('foobar' like 'foo')` 返回 FALSE，`('foobar' like 'foo___')` 或 `('foobar' like 'foo%')` 返回 TRUE。
`A RLIKE B`	String	字符串（String）	If the value of A or B is “NULL”, the result returns NULL. The strings A and B are matched by Java. If they match, it returns TRUE, otherwise FALSE. For example: (‘foobar’ rlike ‘foo’) returns TRUE, (‘foobar’ rlike ‘^f.*r$’) returns FALSE.	若 A 或 B 的值为 “NULL”，则结果返回 “NULL”。A 与 B 使用 Java 的正则匹配规则进行匹配，匹配成功返回 TRUE，否则返回 FALSE。例如：`('foobar' rlike 'foo')` 返回 TRUE，`('foobar' rlike '^f.*r$')` 返回 FALSE。
`A REGEXP B`	String	字符串（String）	Same as RLIKE.	与 RLIKE 相同。

Orders placed by a specific customer (custid = 1010)

特定客户 (custid = 1010) 下的订单

1
2
3

SELECT orderid, custid, productid, quantity, orderdate
FROM orders
WHERE custid = 1010;

Products cheaper than or equal to 500

价格低于或等于 500 的产品

1
2
3

SELECT productid, productname, price
FROM products
WHERE price <= 500;

Customers not from ‘India’

不是来自“India”的客户

1
2
3

SELECT custid, cust_name, country
FROM customer
WHERE country <> 'India';

Customer name starting with

以…开头的客户名称

1
2
3

SELECT custid, cust_name, country
FROM customer
WHERE cust_name  Like 'Customer_10%';

Logical Operators

逻辑运算符

Logical operators in Hive provide support for creating logical expressions. All returns Boolean TRUE, FALSE, or NULL depending upon the Boolean values of the operands. In this NULL behaves as an “unknown” flag.

Hive 中的逻辑运算符支持创建逻辑表达式。根据操作数的布尔值，所有运算符都返回布尔值 TRUE、FALSE 或 NULL。在这里，NULL 表现为一个“未知”标记。

They allow us to combine multiple conditions.

它们允许我们组合多个条件。

Operator	Type	类型	Description	描述
`A AND B`	Boolean	布尔（Boolean）	TRUE if both A and B are TRUE. Otherwise FALSE. NULL if A or B is NULL	当 A 和 B 都为 TRUE 时返回 TRUE；否则返回 FALSE。若 A 或 B 为 NULL，则返回 NULL。
`A && B`	Boolean	布尔（Boolean）	Same as A AND B	与 `A AND B` 相同。
`A OR B`	Boolean	布尔（Boolean）	A or B is TRUE both return TRUE, otherwise FALSE. If A and B are both NULL, return NULL	A 或 B 只要有一个为 TRUE 就返回 TRUE；否则返回 FALSE。若 A 和 B 都为 NULL，则返回 NULL。
`A		B`	Boolean	布尔（Boolean）
`NOT A`	Boolean	布尔（Boolean）	Returns TRUE if A is NULL or FALSE, otherwise FALSE.	若 A 为 NULL 或 FALSE，则返回 TRUE；否则返回 FALSE。
`!A`	Boolean	布尔（Boolean）	Same as NOT A	与 `NOT A` 相同。

Orders in 2023 for expensive products (price > 1000) 2023 年昂贵产品 (price > 1000) 的订单

SELECT o.orderid, o.custsid, o.productid, o.quantity, o.orderdate
FROM orders o
JOIN products p ON o.productid = p.productid
WHERE year(o.orderdate) = 2023 AND p.price > 1000;

Customers located in either UK or Germany 位于英国 (UK) 或德国 (Germany) 的客户

1
2
3

SELECT custid, cust_name, country
FROM customer
WHERE country = 'UK' OR country = 'Germany';

Products that are not very expensive (NOT price > 1500) 不是非常昂贵的产品 (NOT price > 1500)

1
2
3

SELECT productid, productname, price
FROM products
WHERE NOT (price > 1500);

Hive String Operators

Hive 字符串运算符

Operator	Type	类型	Description	描述
`A		B`	strings	字符串（strings）

Add a “_VIP” tag to customers from India

给来自印度的客户添加“_VIP”标签

SELECT custid,
       CONCAT(cust_name, '_VIP') AS vip_name,
       country
FROM customer
WHERE country = 'India';

Country and Customer Name in 1 column

将国家和客户名称放在一列中

1
2
3

SELECT  custid,
       cust_name ||  ' - '  || country  AS   full_info
FROM  customer;

Complex Type Operators

复杂类型运算符

Used for array, Maps and Structs.

用于数组 (array)、映射 (Maps) 和结构体 (Structs)。

Operator	Type	类型	Description	描述
`A[n]`	A is an Array and n is an int	A 是数组，n 是 int（整数）	Returns the nth element in the array A. The first element has index 0	返回数组 A 的第 n 个元素。第一个元素的索引为 0。
`M[key]`	M is a Map<K, V> and key has type K	M 是 Map<K,V>，key 的类型为 K	Returns the value corresponding to the key in the map	返回 map 中与 key 对应的 value。
`S.x`	S is a struct	S 是结构体（struct）	Returns the x field of S.	返回 S 的 x 字段。

CREATE TABLE customer_orders (
  custid INT,
  orders ARRAY<INT>
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
COLLECTION ITEMS TERMINATED BY ':';

Data copy in text file customer_orders.txt and upload in HDFS. 将数据复制到文本文件 customer_orders.txt 中并上传到 HDFS。

1001,1:2:3
1002,4:5
1003,6:7:8:9
1004,10:11
1005,12:13:14
LOAD DATA INPATH '/Data/customer_orders.txt'
INTO TABLE customer_orders;

Filter by array contents: 按数组内容过滤：

1 2	SELECT custid , orders [0] AS first_order FROM customer_orders ;

Find customers who have order ID 7 查找拥有订单 ID 7 的客户

1
2
3

SELECT custid
FROM customer_orders
WHERE array_contains(orders, 7);

Hive Functions

Hive 函数

Functions in Hive are predefined (or user-defined) routines that process data and return results. Hive 中的函数是预定义（或用户定义）的例程，用于处理数据并返回结果。

They are broadly categorized into Built-in and User-Defined Functions (UDFs). 它们大致分为内置函数和用户定义函数 (UDF)。

Mathematical Functions

数学函数

Function Name	Return Type	返回类型	Description	描述
`round(DOUBLE X)`	DOUBLE	双精度浮点数（DOUBLE）	It will fetch and returns the rounded BIGINT value of X	获取并返回 X 四舍五入后的 BIGINT 值。
`round(DOUBLE X, INT d)`	DOUBLE	双精度浮点数（DOUBLE）	It will fetch and returns X rounded to d decimal places	获取并返回将 X 四舍五入到 d 位小数后的结果。
`bround(DOUBLE X)`	DOUBLE	双精度浮点数（DOUBLE）	It will fetch and returns the rounded BIGINT value of X using HALF_EVEN rounding mode	使用 HALF_EVEN（五留双/银行家舍入）模式，对 X 进行舍入并返回舍入后的 BIGINT 值。
`floor(DOUBLE X)`	BIGINT	大整数（BIGINT）	It will fetch and returns the maximum BIGINT value that is equal to or less than X value	获取并返回小于或等于 X 的最大 BIGINT 值（向下取整）。
`ceil(DOUBLE a)`, `ceiling(DOUBLE a)`	BIGINT	大整数（BIGINT）	It will fetch and returns the minimum BIGINT value that is equal to or greater than X value	获取并返回大于或等于 X 的最小 BIGINT 值（向上取整）。
`rand()`, `rand(INT seed)`	DOUBLE	双精度浮点数（DOUBLE）	It will fetch and returns a random number that is distributed uniformly from 0 to 1	获取并返回在 0 到 1 之间均匀分布的随机数。

1
2
3

select productname, price*3.14  from  products  p ;
round()
select productname, round(price*3.14,1)  from  products  p ;

Bround()

Bround() 函数

Rounds a number using Banker’s rounding (HALF_EVEN method).

使用银行家舍入法（HALF_EVEN 方法）对数字进行舍入。

If the fractional part is exactly 0.5, it rounds to the nearest even integer.

如果小数部分正好是 0.5，它会舍入到最近的偶数整数。

1	select productname, price3.14, bround(price3.14) from products p ;

random discount for lucky customers.

为幸运客户提供随机折扣。

1	select price, rand() as discount from products p ;

Function	Example	Description	描述（翻译）	Sample Output
`round(X)`	`round(45.67)`	Rounds to nearest integer	四舍五入到最接近的整数	46
`round(X,d)`	`round(45.678,2)`	Rounds to d decimal places	四舍五入保留 d 位小数	45.68
`bround(X)`	`bround(4.5)`	HALF_EVEN rounding	HALF_EVEN（银行家舍入/五留双）	4
`floor(X)`	`floor(45.89)`	Largest integer ≤ X	不大于 X 的最大整数（向下取整）	45
`ceil(X)`	`ceil(12.1)`	Smallest integer ≥ X	不小于 X 的最小整数（向上取整）	13
`rand()`	`rand()`	Random 0–1 value	生成 0～1 的随机值	0.34
`rand(seed)`	`rand(10)`	Deterministic random value	基于 seed 生成确定性的随机值	0.645

Date Function

日期函数

Function Name	Return Type	返回类型	Description	描述
`Unix_Timestamp()`	`Unix_Timestamp()`	Unix 时间戳（秒）	We will get current Unix timestamp in seconds	获取当前 Unix 时间戳（秒）。
`To_date(string timestamp)`	string	字符串（string）	It will fetch and give the date part of a timestamp string.	从时间戳字符串中提取并返回日期部分。
`year(string date)`	INT	整数（INT）	It will fetch and give the year part of a date or a timestamp string	从日期或时间戳字符串中提取并返回年份部分。
`quarter(date/timestamp/string)`	INT	整数（INT）	It will fetch and give the quarter of the year for a date, timestamp, or string in the range 1 to 4	获取并返回季度（1 到 4），输入可为 date/timestamp/string。
`month(string date)`	INT	整数（INT）	It will give the month part of a date or a timestamp string	返回日期或时间戳字符串中的月份部分。
`hour(string date)`	INT	整数（INT）	It will fetch and gives the hour of the timestamp	获取并返回时间戳的小时部分。
`minute(string date)`	INT	整数（INT）	It will fetch and gives the minute of the timestamp	获取并返回时间戳的分钟部分。
`Date_sub(string starting date, int days)`	string	字符串（string）	It will fetch and gives Subtraction of number of days to starting date	从起始日期中减去指定天数并返回结果。
`Current_date`	date	日期（date）	It will fetch and gives the current date at the start of query evaluation	在查询开始求值时获取并返回当前日期。
`LAST_day(string date)`	string	字符串（string）	It will fetch and gives the last day of the month which the date belongs to	获取并返回该日期所在月份的最后一天。
`trunc(string date, string format)`	string	字符串（string）	It will fetch and gives date truncated to the unit specified by the format.	按 format 指定的单位对日期进行截断并返回。

Convert a date string to UNIX time

将日期字符串转换为 UNIX 时间戳

1	SELECT unix_timestamp('2024-11-05', 'yyyy-MM-dd') AS timestamp_value;

Convert timestamp to date 将时间戳转换为日期

1	SELECT to_date('2024-11-05 10:30:45') AS only_date;

year(string date) year(string date)

1	SELECT year (orderdate) AS order_year from orders o ;

Returns the quarter of the year (1–4). 返回一年的季度 (1–4)。

1	SELECT quarter(orderdate) AS quarter_value from orders o;

Subtracts a given number of days from a date. 从日期中减去指定的天数。

1	SELECT orderdate ,date_sub(orderdate, 7) AS one_week_before from orders o;

Returns the current system date. 返回当前系统日期。

1	SELECT current_date();

Returns the last day of the month for a given date. 返回给定日期所在月份的最后一天。

1	SELECT last_day(orderdate) AS last_day_feb from orders o;

Function	Example	Output Example	说明（翻译）
`unix_timestamp()`	`unix_timestamp()`	1730813242	获取当前 Unix 时间戳（秒）。
`to_date()`	`to_date('2024-11-05 12:00:00')`	2024-11-05	将日期时间/时间戳字符串转换为日期部分（YYYY-MM-DD）。
`year()`	`year('2024-11-05')`	2024	提取年份。
`quarter()`	`quarter('2024-11-05')`	4	提取季度（1–4）。
`month()`	`month('2024-11-05')`	11	提取月份（1–12）。
`hour()`	`hour('2024-11-05 13:40:00')`	13	提取小时（0–23）。
`minute()`	`minute('2024-11-05 13:40:00')`	40	提取分钟（0–59）。
`date_sub()`	`date_sub('2024-11-05', 5)`	2024-10-31	从起始日期中减去指定天数。
`current_date`	`current_date()`	2025-11-05	获取当前日期。
`last_day()`	`last_day('2024-02-12')`	2024-02-29	获取该日期所在月份的最后一天。
`trunc()`	`trunc('2024-11-05','MM')`	2024-11-01	按指定粒度截断日期（示例：按月截断到月初）。

Conditional operator examples (IF, CASE) 条件运算符示例（IF, CASE）

Simple IF: label each product ‘Expensive’ or ‘Cheap’ 简单的 IF：将每个产品标记为 ‘Expensive’（昂贵）或 ‘Cheap’（便宜）

1
2
3

SELECT productid, productname, price,
       IF(price > 1200, 'Expensive', 'Cheap') AS price_label
FROM products;

CASE: multi-level discount category based on price CASE：基于价格的多级折扣类别

SELECT productid, productname, price,
  CASE
    WHEN price >= 1500 THEN 'Tier A - 20% off'
    WHEN price >= 800  THEN 'Tier B - 10% off'
    ELSE 'Tier C - no discount'
  END AS discount_tier
FROM products;

String Function

字符串函数

reverse(string X) reverse(string X)

1	select reverse(cust_name) from customer c ;

rpad(string str, int length, string pad) rpad(string str, int length, string pad)

Pads the right side of the string up to given length 将字符串的右侧填充到给定的长度

1	SELECT rpad(cust_name, 5, '*' ) AS result from customer;

Split product name by _ 通过 _ 分割产品名称

1
2
3

SELECT productid, productname, 
       split(productname, '_') AS name_parts
FROM products;

Function	Purpose	用途（翻译）
`reverse()`	Reverse a string	反转字符串
`rpad()`	Right-pad a string	在字符串右侧填充（右填充）
`rtrim()`	Trim spaces from the right	去除右侧空格
`space()`	Creates N spaces	生成 N 个空格
`split()`	Split string into array	将字符串拆分为数组
`str_to_map()`	Convert text to key-value map	将文本转换为键值对映射（map）

Aggregation Function

聚合函数

Function Name	Return Type	返回类型	Description	描述
`count(*)`, `count(expr)`,	BIGINT	大整数（BIGINT）	count(*) - Returns the total number of retrieved rows.	`count(*)`：返回检索到的行的总数。
`sum(col)`, `sum(DISTINCT col)`	DOUBLE	双精度浮点数（DOUBLE）	It returns the sum of the elements in the group or the sum of the distinct values of the column in the group.	返回分组内元素之和，或返回分组内该列去重后的值之和。
`avg(col)`, `avg(DISTINCT col)`	DOUBLE	双精度浮点数（DOUBLE）	It returns the average of the elements in the group or the average of the distinct values of the column in the group.	返回分组内元素的平均值，或返回分组内该列去重后的值的平均值。
`min(col)`	DOUBLE	双精度浮点数（DOUBLE）	It returns the minimum value of the column in the group.	返回分组内该列的最小值。
`max(col)`	DOUBLE	双精度浮点数（DOUBLE）	It returns the maximum value of the column in the group.	返回分组内该列的最大值。

Select count(*) from customer;
Select sum(price) from products;
Select avg(price) from products;
Select min(price) from price;
SELECT Max(price ) from price;

UDF

用户自定义函数 (UDF)

A User-Defined Function (UDF) lets you extend Hive with custom logic not available in built-in functions. 用户自定义函数 (UDF) 允许你使用内置函数中不可用的自定义逻辑来扩展 Hive。

UDFs run inside Hive’s JVM and are typically written in Java (any JVM language). UDF 在 Hive 的 JVM 内部运行，通常用 Java（或任何 JVM 语言）编写。

Hive has several extension points: Hive 有几个扩展点：

Simple UDF — row-wise scalar function. Extend org.apache.hadoop.hive.ql.exec.UDF. 简单 UDF — 逐行标量函数。继承 org.apache.hadoop.hive.ql.exec.UDF。

Input – Single row 输入 – 单行

Ouput – Single row 输出 – 单行

UDAF (User-Defined Aggregate Function) — aggregate functions (SUM, AVG style). UDAF (用户自定义聚合函数) — 聚合函数（SUM、AVG 风格）。

Input – Multiple row 输入 – 多行

Ouput – Single row 输出 – 单行

Example: - calculate total, find average 示例：- 计算总和，查找平均值

UDTF (User-Defined Table-Generating Function) — emits 0..N rows per input row. Extend org.apache.hadoop.hive.ql.udf.generic.GenericUDTF or the older UDTF. UDTF (用户自定义表生成函数) — 每行输入生成 0 到 N 行。继承 org.apache.hadoop.hive.ql.udf.generic.GenericUDTF 或旧版 UDTF。

Input – one row 输入 – 一行

Output – Multiple row 输出 – 多行

Example: - Split string into multiple rows 示例：- 将字符串分割成多行

When to use which 何时使用哪一种

Use Simple UDF: trivial scalar ops with fixed types (e.g., string -> string). 使用 简单 UDF：具有固定类型的简单标量操作（例如，字符串 -> 字符串）。

Use UDAF: computing aggregates across rows (custom aggregator). 使用 UDAF：跨行计算聚合（自定义聚合器）。

Use UDTF: explode/expand input into multiple rows/columns. 使用 UDTF：将输入分解/扩展为多行/多列。

Variables in HIVE

HIVE 中的变量

Variables in Hive behave same as in other programming languages.

Hive 中的变量与其他编程语言中的变量表现相同。

Variables are values that are changeable and not fixed.

变量是可变的而不是固定的值。

We can pass a value to a variable during the execution of program.

我们可以在程序执行期间向变量传递值。

Hiveconf – Configuration-level variable

Hiveconf – 配置级变量

Hivevar – user-level variable for script

Hivevar – 脚本的用户级变量

Set variable 设置变量

1	set emp_id = 90;

To check 查看变量

set emp_id;
SET limit=5;
Select * from emp LIMIT${hiveconf:limit}
SET hivevar:tablename= employee;
Select * from ${hivevar:tablename};

Chapter 4

第 4 章