Hive Chapter 1 - Hive - Big Data | Dumpling's Blog = My Port

What is Hive?

什么是 Hive？

Definition: Apache Hive is an open-source data warehouse framework built on Hadoop.

定义：Apache Hive 是一个构建在 Hadoop 之上的开源数据仓库框架。

It allows users to query and analyze large datasets stored in HDFS (Hadoop Distributed File System) using a SQL-like language called HiveQL (HQL).

它允许用户使用一种称为 HiveQL (HQL) 的类 SQL 语言查询和分析存储在 HDFS（Hadoop 分布式文件系统）中的大型数据集。

Explanation:

解释：

Instead of writing Java MapReduce programs, Hive users can write simple SQL-like queries.

Hive 用户可以编写简单的类 SQL 查询，而不必编写 Java MapReduce 程序。

Hive internally converts HiveQL into MapReduce, Tez, or Spark jobs for execution.

Hive 在内部将 HiveQL 转换为 MapReduce、Tez 或 Spark 作业以进行执行。

It is not a traditional RDBMS but a query engine for batch processing.

它不是传统的关系型数据库管理系统 (RDBMS)，而是用于批处理的查询引擎。

Features of Hive

Hive 的特性

SQL-like query language (HiveQL).

类 SQL 查询语言 (HiveQL)。

Data warehousing capabilities (ETL, reporting, analytics).

数据仓库功能（ETL、报表、分析）。

Scalable & fault-tolerant (uses Hadoop cluster).

可扩展且容错（使用 Hadoop 集群）。

Extensible (supports User Defined Functions).

可扩展（支持用户自定义函数）。

Schema on Read: Schema is applied when reading data, not when storing (flexible with semi-structured data).

读时模式 (Schema on Read)：模式 (Schema) 在读取数据时应用，而不是在存储时应用（对半结构化数据具有灵活性）。

Hive Data Storage Model

Hive 数据存储模型

Definition: Hive divides storage into table data (actual user data in HDFS) and metadata (schema info stored in Metastore).

定义：Hive 将存储分为表数据（HDFS 中的实际用户数据）和元数据（存储在 Metastore 中的模式信息）。

Explanation:

解释：

Table Data → Stored in HDFS as files.

表数据 → 作为文件存储在 HDFS 中。

Metadata → Stored in a relational database called Metastore (e.g., MySQL/Derby). Includes table names, column details, partitions, table type (managed/external), data directory, etc.

元数据 → 存储在一个称为 Metastore 的关系型数据库中（例如 MySQL/Derby）。包括表名、列详情、分区、表类型（托管/外部）、数据目录等。

Without Metastore, Hive cannot function because queries need schema info.

没有 Metastore，Hive 无法运行，因为查询需要模式信息。

Hive Architecture

Hive 架构

Definition: Hive architecture defines how Hive receives a query, compiles it, optimizes it, and executes it on Hadoop.

定义：Hive 架构定义了 Hive 如何接收查询、编译查询、优化查询并在 Hadoop 上执行查询。

Main Components & Explanation:

主要组件及解释：

Clients → Interfaces for submitting queries:

客户端 → 用于提交查询的接口：
- CLI (Command Line Interface) – Allows users to run Hive queries directly from the terminal.
  - CLI（命令行界面）– 允许用户直接从终端运行 Hive 查询。
- Web UI – Browser-based interface to interact with Hive.
  - Web UI – 基于浏览器的与 Hive 交互的界面。
- JDBC/ODBC drivers – Enable applications (e.g., Java, BI tools like Tableau/Power BI) to connect to Hive.
  - JDBC/ODBC 驱动程序 – 使应用程序（例如 Java、Tableau/Power BI 等 BI 工具）能够连接到 Hive。
- Thrift Server – Provides cross-language support (Python, C++, etc.) using the Thrift protocol.
  - Thrift 服务器 – 使用 Thrift 协议提供跨语言支持（Python、C++ 等）。

Services → Core components of Hive:

服务 → Hive 的核心组件：
- Hive Driver – It receives queries from different sources like web UI, CLI, Thrift, and JDBC/ODBC driver. It transfers the queries to the compiler.
  - Hive 驱动器 (Driver) – 它从 Web UI、CLI、Thrift 和 JDBC/ODBC 驱动程序等不同来源接收查询。它将查询传输给编译器。
- Compiler – The purpose of the compiler is to parse the query and perform semantic analysis on the different query blocks and expressions. It converts HiveQL statements into MapReduce jobs.
  - 编译器 (Compiler) – 编译器的目的是解析查询并对不同的查询块和表达式执行语义分析。它将 HiveQL 语句转换为 MapReduce 作业。
- Execution Engine – Executes the optimized query plan (DAG of tasks) by submitting jobs to Hadoop. It manages the execution order and dependencies.Execution Engine also works with Tez or Spark (not only MapReduce).
  - 执行引擎 (Execution Engine) – 通过向 Hadoop 提交作业来执行优化的查询计划（任务的有向无环图 DAG）。它管理执行顺序和依赖关系。执行引擎也可以与 Tez 或 Spark 一起工作（不仅仅是 MapReduce）。
- MetaStore – Stores metadata (table schemas, partitions, column types, table locations, etc.). MetaStore specifically holds metadata in an RDBMS (like MySQL/Derby).
  - 元数据存储 (MetaStore) – 存储元数据（表模式、分区、列类型、表位置等）。MetaStore 专门将元数据保存在关系型数据库管理系统 (RDBMS) 中（如 MySQL/Derby）。

Hive Data Models

Hive 数据模型

Hive organizes data into Tables, Partitions, and Buckets.

Hive 将数据组织成表、分区和分桶。

Tables
- 表
- Similar to RDBMS tables.
  - 类似于 RDBMS 表。
- Can be Managed (Internal) or External.
  - 可以是托管表（内部表）或外部表。
- Metadata stored in Metastore.
  - 元数据存储在 Metastore 中。
Partitions
- 分区
- Definition: Logical division of a table based on column values (e.g., year=2023, region=Asia).
  - 定义：基于列值（例如 year=2023, region=Asia）对表进行的逻辑划分。
- Advantage: Queries scan only the relevant partition instead of the whole dataset → reduces latency and improves performance.
  - 优势：查询仅扫描相关分区而不是整个数据集 → 降低延迟并提高性能。
Buckets
- 分桶
- Definition: Further subdivision of partitions (or tables) into fixed-size files using a hash function on a column.
  - 定义：使用哈希函数基于列将分区（或表）进一步细分为固定大小的文件。
- Useful for sampling and speeds up joins (as matching data goes to the same bucket).
  - 对于抽样很有用，并且可以加速连接操作（因为匹配的数据会进入同一个桶）。

Hive Data Types

Hive 数据类型

Definition: Data types specify the kind of values that can be stored in Hive columns.

定义：数据类型指定了可以存储在 Hive 列中的值的种类。

1. Primitive Data Types

1. 原始数据类型

Integer types → TINYINT, SMALLINT, INT, BIGINT

整数类型 → TINYINT, SMALLINT, INT, BIGINT

Floating-point → FLOAT, DOUBLE

浮点型 → FLOAT, DOUBLE

BOOLEAN → true/false

布尔型 → true/false

STRING, CHAR, VARCHAR → Character data

字符串型 → STRING, CHAR, VARCHAR（字符数据）

TIMESTAMP, BINARY → Time and binary values

时间戳和二进制型 → TIMESTAMP, BINARY（时间和二进制值）

Equivalent to Java types since Hive is developed in Java.

由于 Hive 是用 Java 开发的，因此这些类型等同于 Java 类型。

2. Complex Data Types

2. 复杂数据类型

ARRAY → Ordered collection of same type. (ARRAY(10,20,30))

数组 (ARRAY) → 相同类型的有序集合。(ARRAY(10,20,30))

MAP → Key-Value pairs. MAP('Hadoop',70,'HBase',80,'Hive',85)

映射 (MAP) → 键值对。MAP('Hadoop',70,'HBase',80,'Hive',85)

STRUCT → Collection of fields (different types) NAMED_STRUCT('name','Jason','age',30,'salary',5000.0)

结构体 (STRUCT) → 字段（不同类型）的集合 NAMED_STRUCT('name','Jason','age',30,'salary',5000.0)

UNIONTYPE → Holds one of multiple possible data types

联合类型 (UNIONTYPE) → 容纳多种可能数据类型中的一种

HiveServer2 (HS2)

Definition: HiveServer2 is a service that enables multiple clients to connect to Hive concurrently using JDBC/ODBC/Thrift protocols.

定义：HiveServer2 是一项服务，它允许通过 JDBC/ODBC/Thrift 协议让多个客户端并发连接到 Hive。

Explanation:

解释：

Successor to HiveServer1 (which supported only one client at a time).

HiveServer1 的继任者（HiveServer1 一次只支持一个客户端）。

Supports authentication, concurrency, and session management.

支持身份验证、并发和会话管理。

Works with BI tools (Tableau, Power BI, etc.).

与 BI 工具（Tableau, Power BI 等）协同工作。

Commands:

命令：

Start HS2:

启动 HS2：

1	hive --service hiveserver2

Connect via Beeline:

通过 Beeline 连接：

1	beeline> !connect jdbc:hive2://localhost:10000/default

User Defined Functions (UDFs) in Hive

Hive 中的用户自定义函数 (UDF)

Definition: UDFs extend Hive by allowing users to define custom functions.

定义：UDF 允许用户定义自定义函数，从而扩展 Hive。

Types:

类型：

UDF (User Defined Function) → Scalar functions (row-level).
- UDF（用户自定义函数）→ 标量函数（行级）。
UDAF (User Defined Aggregate Function) → Aggregate (works on multiple rows).
- UDAF（用户自定义聚合函数）→ 聚合函数（作用于多行）。
UDTF (User Defined Table Function) → Generates multiple rows from one input row.
- UDTF（用户自定义表生成函数）→ 从一行输入生成多行数据。

What is Hive?

什么是 Hive？

Features of Hive

Hive 的特性

Hive Data Storage Model

Hive 数据存储模型

Hive Architecture

Hive 架构

Hive Data Models

Hive 数据模型

Hive Data Types

Hive 数据类型

HiveServer2 (HS2)

HiveServer2 (HS2)

User Defined Functions (UDFs) in Hive

Hive 中的用户自定义函数 (UDF)

Spark Chap 4

Hive Chapter 2