HBase Final Note - HBase - Big Data | Dumpling's Blog = My Port

Chapter 1: Introduction of HBase

第一章：HBase简介

What is HBase

什么是HBase

HBase is an ==open-source, column-oriented NoSQL== distributed database
HBase是一个==开源的、面向列的NoSQL==分布式数据库
Built on the top of ==Hadoop Distributed File System==(HDFS)
构建于==Hadoop分布式文件系统==(HDFS)之上
Written in ==Java==
使用==Java==编写
Supports ==real-time, random read/write== access
支持==实时、随机读/写==访问
Ideal for ==Big Data applications== with data size ranging from ==terabytes to petabytes==
是==大数据应用==的理想选择，数据规模可从==TB级到PB级==
Design for large amount of Data
为海量数据设计
It is Schema less
它是无模式的 (Schema-less)

Key Features of Apache HBase

Apache HBase的主要特性

==Low latency== access to large datasets
对大型数据集的==低延迟==访问
==Random read/write== operations
==随机读/写==操作
==Consistent== read and write support
支持==一致性==的读写
==Java API== available for client access
提供==Java API==供客户端访问
Stores data ==in tables with billions of rows and millions of columns==
数据存储在表中，可容纳==数十亿行和数百万列==

Row-Oriented vs Column-oriented Database

行式数据库 vs 列式数据库

Feature 特性	HBase	RDBMS
Access	Row by row	Column by column
访问方式	按行访问	按列访问
Usage	OLTP	OLAP
用途	OLTP (联机事务处理)	OLAP (联机分析处理)
I/O	Higher for entire rows	Reduced for specific columns
I/O	读取整行时I/O较高	读取特定列时I/O减少

Hadoop vs HBase

Feature 特性	HBase	RDBMS
Type	File System	NoSQL Database
类型	文件系统	NoSQL数据库
Access	Sequential	Random
访问方式	顺序访问	随机访问
Latency	High	Low
延迟	高	低
Use Case	Batch Processing	Real-time Processing
使用场景	批处理	实时处理

HBase vs RDBMS

Feature 特性	HBase	RDBMS
Schema	Schema-less	Fixed Schema
模式	无模式	固定模式
Orientation	Column-based	Row-based
存储方向	基于列	基于行
Table Type	Wide, sparse	Thin
表类型	宽表、稀疏表	窄表
Data Type	Semi-structured, structured	Structured
数据类型	半结构化、结构化	结构化

HBase Storage Mechanism

HBase存储机制

Tables consist of ==column families==
表由==列族==组成
Each column family can have ==any number of columns==
每个列族可以有==任意数量的列==
Schema defines ==only column families==, not columns
模式只定义==列族==，不定义列

Chapter 2: HBase Architecture

第二章：HBase架构

Overview

概述

HBase uses ==key-value pair== format
HBase使用==键值对==格式
Supports ACID(to a limited context)
支持ACID（在有限的上下文中）
Provides ==auto-sharding, low latency and schema flexibility==
提供==自动分片、低延迟和模式灵活性==

Components of HBase Architecture

HBase架构组件

HMaster

HMaster is a Master server in HBase architecture. It acts as a monitoring agent to monitor all Region Server instances present in the cluster and acts as an interface for all the metadata changes. Master runs on NameNode
HMaster是HBase架构中的主服务器。它作为监控代理，监控集群中所有的RegionServer实例，并作为所有元数据变更的接口。Master运行在NameNode上。
Roles Performed by HMaster in HBase
HMaster在HBase中扮演的角色
- HBase HMaster performs DDL operations (create and delete tables) and assigns regions to the Region servers
- HBase HMaster执行DDL操作（创建和删除表），并将Region分配给RegionServer。
- When a client wants to change any schema and to change any Metadata operations, HMaster takes responsibility for these operations.
- 当客户端想要更改任何模式或执行任何元数据操作时，HMaster负责处理这些操作。
- HMaster assigns regions to region servers.
- HMaster将Region分配给RegionServer。
- Handles load balancing of the regions across region servers. It unloads the busy servers and shifts the regions to less occupied servers.
- 处理Region在RegionServer之间的负载均衡。它会卸载繁忙服务器上的Region，并将其转移到负载较轻的服务器上。
- Create table, remove table, enable, disable, add Column, modify Column, move regions, assign regions.
- 创建表、删除表、启用表、禁用表、添加列、修改列、移动Region、分配Region。
HMaster check the health status of region servers
HMaster检查RegionServer的健康状态
- Acts as a ==controller== and handles ==metadata operations==
- 充当==控制器==并处理==元数据操作==
- Manage and assigns ==regions== to ==RegionServers==
- 管理==Region==并将其分配给==RegionServer==
- Responsible for ==DDL operations, schema changes, load balancing==, and ==monitoring==(administrative task)
- 负责==DDL操作、模式变更、负载均衡==和==监控==（管理任务）

RegionServer

RegionServers are slaves in HBase architecture. It is responsible for serving and managing regions or data that is present in a distributed cluster. The region servers run on Data Nodes present in the Hadoop cluster.
RegionServer是HBase架构中的从属服务器。它负责提供和管理分布式集群中的Region或数据。RegionServer运行在Hadoop集群的DataNode上。
When HBase Region Server receives writes and read requests from the client, it assigns the request to a specific region, where the actual column family resides. the client can directly contact with HRegion servers, there is no need of HMaster mandatory permission to the client regarding communication with HRegion servers.
当HBase RegionServer收到客户端的读写请求时，它会将请求分配给一个特定的Region，即实际列族所在的位置。客户端可以直接与HRegionServer联系，与HRegionServer通信无需HMaster的强制许可。
The client requires HMaster help when operations related to metadata and schema changes are required.
当需要进行与元数据和模式变更相关的操作时，客户端才需要HMaster的帮助。
Roles performed by Region Server in HBase (HRegion)
RegionServer在HBase中扮演的角色 (HRegion)
- Communicate with the client and handle data-related operations.
- 与客户端通信并处理数据相关操作。
- Handle read and write requests for all the regions under it.
- 处理其下所有Region的读写请求。
- Decide the size of the region by following the region size thresholds.
- 遵循Region大小阈值来决定Region的大小。
- Writes data to MemStore (RAM) first, then to disk (HFiles on HDFS).
- 先将数据写入MemStore（RAM），然后再写入磁盘（HDFS上的HFile）。
- Flushes MemStore to HDFS when it reaches a threshold.
- 当MemStore达到阈值时，将其刷写到HDFS。
- ==Manages regions==(horizontal partitions of tables)
- ==管理Region==（表的水平分区）
- Handles ==client read/write== requests
- 处理==客户端的读/写==请求
- Writes to ==Memstore(RAM)== $\Longrightarrow$ then flushes to ==HFile==(HDFS) when threshold is reached
- 写入==Memstore(RAM)== $\Longrightarrow$ 达到阈值后刷写到==HFile(HDFS)==

ZooKeeper

==Central coordinator== for configuration and synchronization.
用于配置和同步的==中央协调器==。
Tracks ==server status, network partitions, and client communication==.
跟踪==服务器状态、网络分区和客户端通信==。
Maintains ==ephemeral nodes== for RegionServers.
为RegionServer维护==临时节点==。
Regions
Region
- Tables are split into regions (range of rows).
- 表被拆分成多个Region（行的范围）。
- Horizontal partitions of tables.
- 表的水平分区。
- Allow horizontal scalability.
- 实现水平扩展。
Other Key Concepts
其他关键概念
- Atomicity: operation is all-or-nothing
- 原子性: 操作要么全部成功，要么全部失败
- Consistency: Updates are serialized via RegionServers.
- 一致性: 更新通过RegionServer进行序列化。
- Sharding: Tables are automatically split and distributed.
- 分片: 表被自动拆分和分布。
- High Availability: HBase ensures continuous uptime.
- 高可用性: HBase确保服务的持续正常运行。
Optimization Techniques
优化技术
- Block Cache: In-memory caching of frequently read data blocks.
- 块缓存 (Block Cache): 对频繁读取的数据块进行内存缓存。
- Bloom Filter: In-memory structure to quickly check presence of a row/column.
- 布隆过滤器 (Bloom Filter): 一种内存结构，用于快速检查某行/列是否存在。

Chapter 3: HBase Shell and Commands

第三章：HBase Shell及命令

Working with HBase Shell

使用HBase Shell

Start Hadoop → start-all.sh
启动Hadoop → start-all.sh
Start HBase → start-hbase.sh
启动HBase → start-hbase.sh
Launch Shell → hbase shell
启动Shell → hbase shell
Exit → exit
退出 → exit

General Commands

通用命令

status: Shows system status.
status: 显示系统状态。
version: Current HBase version.
version: 当前HBase版本。
whoami: Current user info.
whoami: 当前用户信息。
table_help: Help with table commands.
table_help: 表相关命令的帮助。

Data Definition Language(DDL)

数据定义语言(DDL)

Command 命令	Description 描述
`create 'table','cf'`	Create table
`create 'table','cf'`	创建表
`list`	List tables
`list`	列出所有表
`describe 'table'`	Table schema
`describe 'table'`	表的模式信息
`exists 'table'`	Check table existence
`exists 'table'`	检查表是否存在
`enable/disable 'table'`	Enable/disable a table
`enable/disable 'table'`	启用/禁用一个表

Command 命令	Description 描述
`drop 'table'`	Drop table
`drop 'table'`	删除表
`alter`	Modify table (add/delete CF, replication, memory settings)
`alter`	修改表（添加/删除列族、副本、内存设置）

Data Manipulation Language(DML)

数据操作语言(DML)

Command 命令	Description 描述
put ‘table’,’row’,’cf:col’,’val’	Insert data
put ‘table’,’row’,’cf:col’,’val’	插入数据
get ‘table’,’row’	Retrieve row
get ‘table’,’row’	检索行
delete ‘table’,’row’,’cf:col’	Delete cell
delete ‘table’,’row’,’cf:col’	删除单元格
deleteall ‘table’,’row’	Delete entire row
deleteall ‘table’,’row’	删除整行
scan ‘table’	Scan all rows
scan ‘table’	扫描所有行
count ‘table’	Count rows
count ‘table’	统计行数
truncate ‘table’	Drop and recreate table
truncate ‘table’	清空并重建表

Snapshot and Cloning

快照与克隆

Command 命令	Description 描述
snapshot ‘source’,’snap’	Create snapshot
snapshot ‘source’,’snap’	创建快照
clone_snapshot ‘snap’,’target’	Clone snapshot to new table
clone_snapshot ‘snap’,’target’	从快照克隆出新表

Security & ACL

安全与访问控制列表(ACL)

Access Control Lists (ACLs) in HBase allow you to control access to specific tables and columns based on user roles and permissions. This is crucial for managing who can read, write, modify, or delete data in HBase (To manage user permissions on HBase resources).
HBase中的访问控制列表(ACL)允许您根据用户角色和权限来控制对特定表和列的访问。这对于管理谁可以读取、写入、修改或删除HBase中的数据至关重要（用于管理用户对HBase资源的权限）。

Permissions: · READ, WRITE, EXECUTE, DELETE, ADMIN
权限: · READ(读), WRITE(写), EXECUTE(执行), DELETE(删除), ADMIN(管理)
Commands
命令

# 授予用户权限
grant 'user','perm','table','cf','col'
# 撤销用户权限
revoke 'user','table'
# 查看用户权限
user_permission – View permissions

Java API for HBase

用于HBase的Java API

Maven Dependency

Maven依赖

<dependency>
  <groupId>org.apache.hbase</groupId>
  <artifactId>hbase-client</artifactId>
  <version>2.4.17</version>
</dependency>

Configuration

配置

Configuration config = HBaseConfiguration.create();
// 用于远程访问
config.set("hbase.zookeeper.quorum", "192.168.56.102:2181"); // for remote access
Connection connection = ConnectionFactory.createConnection(config);

Admin Operations

管理操作

Admin admin = connection.getAdmin();
TableName tableName = TableName.valueOf("Table24");

// 列族
ColumnFamilyDescriptor cf1 = ColumnFamilyDescriptorBuilder.newBuilder("cf1".getBytes()).build();

// 表描述符
TableDescriptor tableDesc = TableDescriptorBuilder.newBuilder(tableName)
                         .setColumnFamily(cf1)
                         .build();

admin.createTable(tableDesc); // 创建表
admin.deleteTable(tableName); // 删除表
admin.enableTable(tableName); // 启用表
admin.disableTable(tableName); // 禁用表
admin.tableExists(tableName); // 检查表是否存在
admin.listTableNames(); // 列出所有表名

Chapter 4: HBase Data Model

第四章：HBase数据模型

In HBase, data is stored in tables, which consist of rows and columns. HBase is a column family-oriented data store. The HBase Data Model includes several core components: Tables, Rows, Column Families, Cells, Columns, and Versions.
在HBase中，数据存储在由行和列组成的表中。HBase是一个面向列族的数据存储。HBase数据模型包括几个核心组件：表、行、列族、单元格、列和版本。

Tables contain column families and rows, with elements defined by primary keys. Each column represents attributes of the stored objects. Tables are divided into sequences of rows grouped by key ranges called Regions, distributed across Region Servers.
表包含列族和行，其元素由主键定义。每一列代表存储对象的属性。表被划分为按键范围分组的行序列，称为Region，并分布在各个RegionServer上。

Tables

表

Data is organized into tables.
数据被组织成表。
Each table has rows and columns; each cell is versioned.
每张表都有行和列；每个单元格都有版本控制。
Rows are arranged lexicographically by their keys.
行根据其键按字典序排列。
Tables start with one region and split as data grows.
表开始时只有一个Region，随着数据增长而分裂。

Lexicographic order means dictionary-style or alphabetical ordering based on Unicode/ASCII values.
字典序意味着基于Unicode/ASCII值的字典式或字母顺序排序。

Rows

行

Identified by row keys (byte arrays).
通过行键（字节数组）来标识。
Sorted based on their keys.
根据其键进行排序。
Row key design affects data distribution and access.
行键的设计影响数据的分布和访问。

Columns

列

Multiple columns per row.
每行可以有多个列。
Grouped into column families.
列被分组到列族中。
Same-family columns are stored together physically.
同一列族的列在物理上存储在一起。

Column Families

列族

Logical and physical grouping of related columns.
相关列的逻辑和物理分组。
Share the same prefix.
共享相同的前缀。
Stored together with shared storage settings (compression, TTL, versioning).
存储在一起，并共享存储设置（压缩、TTL、版本控制）。
Recommended to have similar access patterns and data sizes.
建议将具有相似访问模式和数据大小的列放在同一列族。

Column Qualifiers

列限定符

Identify specific columns within a family.
标识一个列族中的具体列。
Used to uniquely identify cells along with row key, family, and timestamp.
与行键、列族和时间戳一起唯一地标识单元格。

Cells and Versions

单元格和版本

Intersection of row, family, and qualifier.
行、列族和限定符的交集。
Multiple versions per cell, identified by timestamps.
每个单元格可以有多个版本，通过时间戳来标识。
By Default 1 version of a cell but we can increase up to 5
默认情况下单元格有1个版本，但我们可以增加到最多5个。

Regions and Region Servers

Region和RegionServer

Regions are ranges of rows.
Region是行的范围。
Tables start with one region and split with growth.
表开始时只有一个Region，并随数据增长而分裂。
Region Servers manage regions and handle requests.
RegionServer管理Region并处理请求。

Summary Diagram

结构总结图

Table (表)
 └── Row (Row Key) (行 (行键))
      └── Column Family (列族)
           └── Column Qualifier (列限定符)
                └── Cell (Value, Timestamp) (单元格 (值, 时间戳))

Column Metadata

列元数据

Describes column attributes like name, type, compression, versioning.
描述列的属性，如名称、类型、压缩、版本控制。
Helps organize and understand data structure.
有助于组织和理解数据结构。

Timestamp

时间戳

Identifier for each version of a cell.
每个单元格版本的标识符。
Defaults to RegionServer time; custom timestamps can be set.
默认为RegionServer的时间；也可以设置自定义时间戳。

Versions

版本

Tracks changes to cell values over time.
跟踪单元格值随时间的变化。
Useful for auditing and historical queries.
对于审计和历史查询很有用。

Activity 4.1: Version Activity

实践 4.1：版本操作

Change table version setting:
更改表的版本设置：

1	alter 'table_nm', NAME => 'column_family', VERSIONS => '3'

Add data:
添加数据：

1	put 'table_nm','row_id','column_family:qualifier','value'

Retrieve data:
检索数据：

1	get 'table_nm','row_id', COLUMN=>'column_family:qualifier', VERSIONS=>2

Chapter 5: Schema Design Approach

第五章：模式设计方法

Design schema based on query patterns. Example: Users following other users (TwitBase use case). Define access patterns early.
根据查询模式设计模式。示例：用户关注其他用户（TwitBase用例）。尽早定义访问模式。

Key Considerations

关键考虑因素

Number of column families.
列族的数量。
Data distribution among families.
数据在列族间的分布。
Columns per family.
每个列族的列数。
Column names.
列名。
Number of versions.
版本数量。
Row key structure.
行键结构。

Schema Design Rules

模式设计规则

Know your queries.
了解你的查询。
Keep 1–3 column families.
保持1-3个列族。
Region size: 10–50 GB.
Region大小：10–50 GB。
Cell size: <10 MB, <50 MB with MOB.
单元格大小：小于10MB，使用MOB时小于50MB。
50–100 regions per table.
每个表50–100个Region。
Short column family names.
使用简短的列族名称。
Allow more regions if most writes go to few active regions.
如果大部分写入集中在少数几个活跃的Region，可以允许更多的Region。
Write load affects memory.
写入负载会影响内存。

MOB(Medium Object Storage)

MOB(中等对象存储)

For cells up to 50 MB.
用于最大50MB的单元格。
MOB compaction separates large cells to improve efficiency
MOB的合并（compaction）会分离大单元格以提高效率。

Split Policy Configuration

分裂策略配置

Edit hbase-site.xml
编辑 hbase-site.xml

1
2
3

<property>
  <name>hbase.hregion.max.filesize</name>
  <value>53687091200</value> </property>

Java Heap

Java堆

Memory managed by JVM.
由JVM管理的内存。
Stores Java objects used in HBase operations.
存储HBase操作中使用的Java对象。

Row Key Design

行键设计

Ensure uniqueness.
确保唯一性。
Optimize for access patterns.
针对访问模式进行优化。
Avoid hot spotting:
避免热点问题：
- Salting
- 加盐 (Salting)
- Hashing
- 哈希 (Hashing)
- Composite keys (e.g., dept_ID)
- 复合键 (例如, dept_ID)

Supported Datatypes

支持的数据类型

Internally stored as byte arrays.
内部存储为字节数组。
Supports counters for atomic increments.
支持用于原子性递增的计数器。

Secondary Indexes

二级索引

that enable efficient data retrieval based on criteria other than the primary key.
它能够基于主键以外的条件实现高效的数据检索。

In HBase, data is stored in a distributed manner across different regions and region servers, with each table split into multiple regions based on the row keys. While HBase is optimized for high-speed read and write operations using row keys, it lacks native support for secondary indexes, which can complicate queries based on non-key columns. However, there are several strategies to achieve efficient querying on non-key columns, such as secondary indexes and alternate query paths.
在HBase中，数据以分布式方式存储在不同的Region和RegionServer上，每个表根据行键被拆分成多个Region。虽然HBase针对使用行键的高速读写操作进行了优化，但它缺乏对二级索引的原生支持，这使得基于非键列的查询变得复杂。然而，有几种策略可以实现对非键列的高效查询，例如二级索引和备用查询路径。

Secondary indexes in HBase are additional tables or structures that enable efficient querying on columns other than the primary row key. These indexes maintain a mapping from the secondary index column to the row keys of the original table.
HBase中的二级索引是额外的表或结构，可以实现对主行键以外的列进行高效查询。这些索引维护了从二级索引列到原始表行键的映射。

HBase lacks native support.
HBase缺乏原生支持。
Create additional tables for mapping secondary to primary keys.
创建额外的表来映射二级键到主键。
Requires maintenance and consistency handling.
需要维护和处理一致性。

Alternate Query Paths

备用查询路径

Use filters and scans.
使用过滤器和扫描。
Coprocessors for custom query logic.
使用协处理器实现自定义查询逻辑。

Coprocessors

协处理器

A coprocessor in HBase is a custom Java program that runs directly within the HBase region servers. It allows developers to extend the functionality of HBase by intercepting and processing data manipulation operations (such as Put, Get, and Delete) on HBase tables.
HBase中的协处理器是直接在HBase RegionServer内部运行的自定义Java程序。它允许开发者通过拦截和处理HBase表上的数据操作（如Put、Get和Delete）来扩展HBase的功能。

In HBase, ==Coprocessors== are similar to ==triggers== or ==stored procedures== in traditional RDBMS systems. They allow you to run custom server-side logic directly on HBase RegionServers, close to the data, improving performance and enabling more complex operations.
在HBase中，==协处理器==类似于传统RDBMS系统中的==触发器==或==存储过程==。它们允许您直接在HBase RegionServer上运行自定义的服务器端逻辑，靠近数据，从而提高性能并支持更复杂的操作。

Custom Java code on Region Servers to extend the functionality.
在RegionServer上运行的自定义Java代码，用于扩展功能。
Types
类型
- Observer (RegionObserver, MasterObserver): Observer Coprocessors: Observer coprocessors in HBase are used to intercept and process HBase operations at various stages. There are different types of observer coprocessors that correspond to different stages and levels of operation within HBase.
- Observer (RegionObserver, MasterObserver): 观察者协处理器：HBase中的观察者协处理器用于在不同阶段拦截和处理HBase操作。有不同类型的观察者协处理器对应HBase内部不同阶段和级别的操作。
- Endpoint (custom logic): Endpoint Coprocessors: These can intercept and modify HBase operations. They are used for implementing custom logic such as data enrichment, filtering, or aggregation.
- Endpoint (自定义逻辑): 端点协处理器：这些可以拦截和修改HBase操作。它们用于实现自定义逻辑，如数据丰富、过滤或聚合。
Uses
用途
- Validation, analytics, indexing, access control
- 数据验证、分析、索引、访问控制

Why Coprocessors

为何使用协处理器

Server-side logic.
服务器端逻辑。
Reduce client-side/network load.
减少客户端/网络负载。
Support advanced operations (filters, aggregates).
支持高级操作（过滤器、聚合）。

Deployment

部署

Static(global): via hbase-site.xml
静态（全局）：通过 hbase-site.xml
Dynamic (per table):
动态（按表）：

disable 'my_table'
alter 'my_table', METHOD => 'table_att', 'coprocessor'=>'hdfs:///path.jar|class|priority'
enable 'my_table'
describe 'my_table'

Chapter 6: MapReduce

第六章：MapReduce

MapReduce Framework

MapReduce框架

MapReduce is a programming model for processing large datasets in a distributed cluster.
MapReduce是一种用于在分布式集群中处理大规模数据集的编程模型。

Phases

阶段

Map Phase
Map阶段
- Splits input into chunks (InputSplits).
- 将输入分割成块 (InputSplits)。
- Mapper processes each split, emitting key-value pairs.
- Mapper处理每个分片，并输出键值对。
Shuffle and Sort
Shuffle和Sort阶段
- Groups all intermediate key-value pairs by key.
- 按键对所有中间键值对进行分组。
- Sorts them to prepare for reduction.
- 对它们进行排序，为Reduce阶段做准备。
Reduce Phase
Reduce阶段
- Reducer processes grouped data.
- Reducer处理分组后的数据。
- Outputs final key-value results.
- 输出最终的键值结果。

Map and Reduce can run concurrently.
Map和Reduce可以并发运行。
Provides ==fault tolerance, scalability==, and ==parallelism==.
提供==容错性、可扩展性==和==并行性==。

Use Cases

使用场景

Log analysis, web indexing, machine learning, large-scale data aggregation.
日志分析、网页索引、机器学习、大规模数据聚合。
Example
示例
- Input
- 输入

1
2
3

DOG CAT RAT
CAR CAR RAT
DOG CAR CAT

Mapper Output
Mapper输出

1	(DOG,1), (CAT,1), (RAT,1), ...

Reducer Output
Reducer输出

CAR - 3
CAT - 2
DOG - 2
RAT - 2

When using HBase, each region = 1 InputSplit → 10 regions = 10 Mappers.
当使用HBase时，每个Region = 1个InputSplit → 10个Region = 10个Mapper。

Life Cycle of a MapReduce Job

MapReduce作业的生命周期

Job Submission
作业提交
- Specifies input, output, mapper/reducer classes, configs.
- 指定输入、输出、Mapper/Reducer类、配置。
- Uses HDFS for input/output.
- 使用HDFS作为输入/输出。
Job Initialization
作业初始化
- YARN’s ResourceManager allocates memory, CPU.
- YARN的ResourceManager分配内存、CPU。
- Communicates with NodeManagers for launching containers.
- 与NodeManager通信以启动容器。
Job Execution
作业执行
- Mappers process splits → Emit intermediate key-value pairs.
- Mapper处理分片 → 输出中间键值对。
- Shuffle/sort → Group by key → Reducers process.
- Shuffle/Sort → 按键分组 → Reducer处理。
Job Completion
作业完成
- Output stored in HDFS.
- 输出存储在HDFS中。
- Job status reported.
- 报告作业状态。
- Can be used in downstream apps.
- 可用于下游应用。

HBase and MapReduce Integration

HBase与MapReduce集成

TableMapReduceUtil: Utility for integrating MapReduce with HBase.
TableMapReduceUtil: 用于将MapReduce与HBase集成的实用工具。
HBase Input:
HBase输入:
- Configure HBase table as input.
- 配置HBase表作为输入。
- Define scan parameters, mapper class, output key/value classes.
- 定义扫描参数、Mapper类、输出的键/值类。
HBase Output
HBase输出
- Configure HBase as output.
- 配置HBase作为输出。
- Specify reducer class (optional).
- 指定Reducer类（可选）。
Scan Caching: Use scan.setCaching(100); to fetch 100 rows per RPC call.
扫描缓存: 使用 scan.setCaching(100); 在每次RPC调用中获取100行数据。

Chapter 7: Administering HBase

第七章：管理HBase

Deploying HBase in a Fully Distributed Setup
在完全分布式配置中部署HBase
Core Components
核心组件
- HBase Master
- HBase Master
- ZooKeeper
- ZooKeeper
- RegionServers
- RegionServers
- HDFS DataNodes
- HDFS DataNodes
Cluster Planning
集群规划
- Hardware
- 硬件
  - Sufficient CPU, RAM, and disk
  - 充足的CPU、内存和磁盘
  - Commodity hardware is fine
  - 使用普通商用硬件即可
- Networking
- 网络
  - High bandwidth and low-latency network
  - 高带宽和低延迟的网络
- Storage
- 存储
  - Use HDFS or cloud storage like S3
  - 使用HDFS或像S3这样的云存储
- Fault Tolerance
- 容错性
  - Use replication and redundancy
  - 使用复制和冗余
- Cluster Architecture
- 集群架构
  - Plan number of Masters and RegionServers
  - 规划Master和RegionServer的数量
- Security
- 安全
  - Authentication, authorization, encryption.
  - 认证、授权、加密。
- Monitoring
- 监控
  - Use Ambari, Cloudera Manager, Prometheus, ganglia.
  - 使用Ambari、Cloudera Manager、Prometheus、Ganglia。
- Scalability
- 可扩展性
  - Plan for horizontal scaling.
  - 为水平扩展进行规划。
- Backup & Recovery
- 备份与恢复
  - Regular backups and recovery testing.
  - 定期备份和恢复测试。
- Documentation & Training
- 文档与培训
  - Well-documented procedures and trained admins.
  - 完善的流程文档和训练有素的管理员。
Prototype Cluster
原型集群
- Best for learning, testing and experimental purpose.
- 最适合学习、测试和实验目的。
- Collocate Name Node, Resource Manager, HBase Master, and Zookeeper.
- 将NameNode、Resource Manager、HBase Master和Zookeeper部署在同一节点。
- Less than 5 nodes.
- 少于5个节点。
- Limited capacity, no HA needed.
- 容量有限，不需要高可用性(HA)。
- OK to fail; used for dev/testing.
- 可以失败；用于开发/测试。
Small Production Cluster
小型生产集群
- Up to 10 nodes.
- 最多10个节点。
- NameNode and Resource Manager collocated.
- NameNode和Resource Manager部署在同一节点。
- HBase Master on separate hardware.
- HBase Master部署在独立的硬件上。
- One ZooKeeper node is enough.
- 一个ZooKeeper节点就足够了。
- Collocation is okay with light workload.
- 在轻负载情况下，混合部署是可以的。
Medium Production Cluster
中型生产集群
- Up to 50 nodes.
- 最多50个节点。
- No HBase-MapReduce collocation.
- 不要将HBase和MapReduce混合部署。
- Three ZooKeeper and two/three HBase Masters.
- 三个ZooKeeper和两到三个HBase Master。
- Separate hardware for NameNode.
- 为NameNode使用独立的硬件。
Large Production Cluster
大型生产集群
- Same as medium, but with 5 ZooKeeper instances.
- 与中型集群相同，但有5个ZooKeeper实例。
- Collocate HBase Master with ZooKeeper.
- 将HBase Master与ZooKeeper部署在同一节点。
- Dedicated disk for ZooKeeper.
- 为ZooKeeper使用专用磁盘。
Ganglia Monitoring
Ganglia监控
- Ganglia provides web interface to visualize all cluster information.
- Ganglia提供Web界面以可视化所有集群信息。
- Scalable, distributed monitoring system for clusters.
- 面向集群的可扩展、分布式监控系统。
- Lightweight and efficient.
- 轻量且高效。
Use Cases
用例
- Cluster and grid monitoring
- 集群和网格监控
- Data center resource usage
- 数据中心资源使用情况
- System health and performance optimization
- 系统健康与性能优化

Chapter 8: HBase Tuning

第八章：HBase调优

Garbage Collection (GC) in HBase

HBase中的垃圾回收 (GC)

Definition: Garbage Collection is a process by which the JVM reclaims memory from objects that are no longer in use, preventing memory leaks.
定义：垃圾回收是JVM回收不再使用的对象所占用的内存，以防止内存泄漏的过程。
Importance
重要性
- Crucial for ==long-runing HBase applications==
- 对==长时间运行的HBase应用==至关重要
- Helps maintain ==memory efficiency== and ==system stability==
- 有助于维持==内存效率==和==系统稳定性==
Goal of GC Tuning
GC调优的目标
- ==Minimize pause time==
- ==最小化暂停时间==
- ==Optimize performance== of the HBase cluster
- ==优化HBase集群的性能==

Heap Size Tuning

堆大小调优

What is Heap Size?

什么是堆大小？

Memory allocated to the JVM for storing objects and data structures.
分配给JVM用于存储对象和数据结构的内存。
Used for MemStore, BlockCache, and other internal HBase processes.
用于MemStore、BlockCache及其他HBase内部进程。

What Heap Size Matters?

堆大小为何重要？

Heap Size Condition	Impact
堆大小情况	影响
Too Small	Frequent GC, system slowdowns, crashes
过小	频繁的GC、系统变慢、崩溃
Too Large	Longer GC pauses, memory contention with other apps
过大	更长的GC暂停时间、与其他应用的内存竞争

How to Configure Heap Size

如何配置堆大小

Edit hbase-env.sh file
编辑 hbase-env.sh 文件

1	export HBASE_HEAPSIZE = 8000

Monitoring and Analyzing GC

监控与分析GC

Tools Used
使用工具
- JVisualVM
- JVisualVM
- GC Logs
- GC日志
- Third-party monitoring tools
- 第三方监控工具
To enable GC logging, add to hbase-env.sh
要启用GC日志记录，添加到 hbase-env.sh

1	export HBASE_OPTS="$HBASE_OPTS -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -Xloggc:/usr/local/hbase/logs/gc-hbase.log"

To fine-tune CMS GC behavior
微调CMS GC行为

1	export HBASE_OPTS="$HBASE_OPTS -XX:CMSInitiatingOccupancyFraction=60"

Memory Configuration

内存配置

For Master Process

针对Master进程

1	export HBASE_MASTER_OPTS="$HBASE_MASTER_OPTS $HBASE_JMX_BASE -Xmx2000m -Xms2000m -Xmn750m -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=70"

Xmx/xms: · Max and initial heap size (2 GB)
Xmx/xms: · 最大和初始堆大小 (2 GB)
Xmn: · Young generation size (750 MB)
Xmn: · 新生代大小 (750 MB)

For RegionServer Process

针对RegionServer进程

1	export HBASE_REGIONSERVER_OPTS="$HBASE_REGIONSERVER_OPTS $HBASE_JMX_BASE -Xmx6000m -Xms6000m -Xmn2250m -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=70"

Memstore and Compaction

Memstore与合并(Compaction)

MemStore: In-memory component where writes are first stored.
MemStore: 内存中的组件，写入操作首先存储在此。
When full, it is flushed to disk into immutable ==StoreFiles==.
当写满时，它会被刷写到磁盘，形成不可变的==StoreFiles==。
Compaction: Merges multiple StoreFiles for optimization.
Compaction (合并): 为了优化而合并多个StoreFiles。

Data Locality

数据本地性

Definition: Placing data close to its processing node.
定义: 将数据放置在靠近其处理节点的位置。
Importance: Reduces network I/O, boosts performance.
重要性: 减少网络I/O，提升性能。
Working
工作原理
- HDFS stores data
- HDFS存储数据
- RegionServer manages local data
- RegionServer管理本地数据
- Rewriting (compaction) helps maintain locality
- 重写（合并）有助于维持本地性

Compression

压缩

Purpose: Save disk space and enhance read/write speeds.
目的: 节省磁盘空间并提高读写速度。
Trade-Off: Reduces I/O but increases CPU load due to decompression.
权衡: 减少了I/O，但因解压而增加了CPU负载。
Supported Algorithms: GZIP, Snappy, LZO, LZ4
支持的算法: GZIP, Snappy, LZO, LZ4
commands
命令

1 2	create 'my_table', {NAME => 'cf', COMPRESSION => 'GZ'} alter 'my_table', {NAME => 'cf', COMPRESSION => 'GZ'}

Note: GZIP is not available in HBase 2.x
注意：GZIP在HBase 2.x中不可用

Optimizing Region Splits

优化Region拆分

Pre-Splitting Regions
预拆分Region
- Prevents hotspots
- 防止热点
- Ensures balanced workload from the beginning
- 从一开始就确保负载均衡
- Ideal during initial table creation
- 在初始建表时最为理想
Command Example
命令示例

1	create 'my_table', 'cf', {SPLITS => ['split1', 'split2', 'split3']}

Alter Region Split Threshold
修改Region拆分阈值

alter ‘my_table’, {METHOD => ‘table_att’, MAX_FILESIZE => ‘1073741824’}

Load Balancing

负载均衡

Ensures ==equal distribution== of regions across RegionServers.
确保Region在所有RegionServer上==均匀分布==。
Controlled by ==Balancer== (in Master).
由Master中的==Balancer==控制。
Runs every 5 mins by default (hbase.balancer.period).
默认每5分钟运行一次 (hbase.balancer.period)。
Key Properties
关键属性
- hbase.balancer.period: Time interval for balancer to run.
- hbase.balancer.period: Balancer运行的时间间隔。
- hbase.balancer.max.balancing: Max time allowed for a balancing run.
- hbase.balancer.max.balancing: 一次均衡运行所允许的最大时间。

Merging Regions

合并Region

Helps reduce number of regions, improving efficiency.
有助于减少Region数量，提高效率。
Needed when auto-split results in too many small regions.
当自动拆分导致过多小Region时需要。
Commands
命令

1 2	list_regions 'example_table' merge_region 'region_name1', 'region_name2'

Summary
总结

Tuning Area 调优领域	Key Action 关键操作
GC Tuning	Analyze logs, adjust heap size, CMS settings
GC调优	分析日志，调整堆大小，CMS设置
Heap Size	Set appropriate JVM heap, monitor GC
堆大小	设置合适的JVM堆，监控GC
Data Locality	Ensure RegionServer and DataNode co-locate
数据本地性	确保RegionServer和DataNode同地部署
Compression	Use suitable algorithms for disk and speed
压缩	使用合适的算法以优化磁盘和速度
Region Splits	Pre-split tables to avoid hotspots
Region拆分	预拆分表以避免热点
Load Balancing	Enable periodic balancing with tuning
负载均衡	启用周期性均衡并进行调优
Region Merging	Reduce number of regions when needed
Region合并	在需要时减少Region数量