🏗️ HBase Architecture & Features

🏗️ HBase 架构与特性

Overview of HBase

HBase 概述

HBase is a NoSQL database that provides ==ACID== (Atomicity, Consistency, Isolation, Durability) compliance, making it suitable for high-scale, real-time applications.
HBase 是一个提供 ==ACID== (原子性、一致性、隔离性、持久性) 合规性的 NoSQL 数据库,使其适用于大规模、实时应用。

It is schema-less, allowing the addition of new data without a predefined model.
它是无模式的,允许在没有预定义模型的情况下添加新数据。

HBase offers database-like access to ==Hadoop-scale storage==, enabling efficient read and write operations on subsets of data without scanning the entire dataset.
HBase 提供对 ==Hadoop 规模存储== 的类数据库访问,能够对数据子集进行高效的读写操作,而无需扫描整个数据集。

Key Objectives

关键目标

  • Architectural Overview of HBase
  • HBase 架构概述
  • Three Major Components of HBase
  • HBase 的三大主要组件
  • HBase Features
  • HBase 特性
    • Consistency
    • 一致性
    • Atomic Read and Write
    • 原子读写
    • High Availability
    • 高可用性
    • Real-Time Processing
    • 实时处理

🏗️ Architectural Overview of HBase

🏗️ HBase 架构概述

HBase architecture consists of several key components:
HBase 架构由几个关键组件组成:

  1. HMaster
  2. HBase Region Server
  3. Regions and Zookeeper

Components of HBase Architecture

HBase 架构组件

Component Description
HMaster Manages Region Servers, performs DDL operations, assigns regions, monitors servers, and facilitates recovery.
Region Server Handles data-related operations for multiple regions and communicates with clients.
Zookeeper Provides configuration management, naming, and synchronization services for HBase.
组件 描述
HMaster 管理 Region Server,执行 DDL 操作,分配 region,监控服务器并促进恢复。
Region Server 处理多个 region 的数据相关操作,并与客户端通信。
Zookeeper 为 HBase 提供配置管理、命名和同步服务。

HMaster Role

HMaster 角色

  • Performs Data Definition Language (DDL) operations (create/delete tables).
  • 执行数据定义语言 (DDL) 操作(创建/删除表)。
  • Assigns and reassigns regions to Region Servers.
  • 将 region 分配和重新分配给 Region Server。
  • Monitors Region Server instances and coordinates recovery activities.
  • 监控 Region Server 实例并协调恢复活动。

Zookeeper Role

Zookeeper 角色

  • Maintains configuration information and assists HMaster in managing the environment.
  • 维护配置信息并协助 HMaster 管理环境。
  • Uses ephemeral nodes for tracking available Region Servers and monitoring failures.
  • 使用临时节点来跟踪可用的 Region Server 并监控故障。

🌐 HBase Region Server

🌐 HBase Region Server

  • HBase tables are divided into ==Regions== by row key range.
  • HBase 表按行键范围划分为 ==Regions==。
  • Regions are the basic elements for distributing tables and consist of column families.
  • Region 是分布表的基本元素,由列族组成。
  • A Region Server runs on ==HDFS DataNode== and is responsible for read/write operations.
  • Region Server 运行在 ==HDFS DataNode== 上,负责读/写操作。

Region Server Responsibilities

Region Server 职责

Responsibility Description
Data Communication Communicates with clients and manages data-related operations.
Read/Write Handling Handles read/write requests for all regions under its management.
Region Size Management Determines region sizes based on defined thresholds.
职责 描述
数据通信 与客户端通信并管理数据相关操作。
读/写处理 处理其管理下所有 region 的读/写请求。
Region 大小管理 根据定义的阈值确定 region 大小。

🛠️ HBase Features

🛠️ HBase 特性

Consistency

一致性

  • HBase supports a strong consistency model where reads and writes go through a single server, ensuring serialized updates.
  • HBase 支持强一致性模型,读写都通过单个服务器进行,确保序列化更新。
  • It can handle atomic increment operations with a special “counter” datatype, useful for counting operations.
  • 它可以通过特殊的“计数器”数据类型处理原子增量操作,这对于计数操作很有用。

Atomic Read and Write

原子读写

  • Atomicity ensures that operations are all-or-nothing: if one part fails, the entire operation fails, maintaining system integrity.
  • 原子性 确保操作是“全有或全无”的:如果一部分失败,整个操作都将失败,从而保持系统完整性。

Sharding

分片

  • HBase tables consist of regions that can be automatically or manually split into smaller subregions, facilitating horizontal scaling.
  • HBase 表由 region 组成,这些 region 可以自动或手动拆分为更小的子 region,从而便于水平扩展。
  • Auto sharding allows for dynamic division of tables into manageable parts when they exceed a certain size.
  • 自动分片允许在表超过一定大小时将其动态划分为可管理的部分。

High Availability

高可用性

  • HBase achieves high availability through ==region replication==, where multiple replicas of regions can exist on different Region Servers.
  • HBase 通过 ==region 复制== 实现高可用性,即 region 的多个副本可以存在于不同的 Region Server 上。
  • By default, region replication is set to 1, but can be increased to improve fault tolerance.
  • 默认情况下,region 复制设置为 1,但可以增加以提高容错能力。

Real-Time Processing

实时处理

  • HBase supports ==block cache== and ==Bloom filters== for efficient real-time query processing.
  • HBase 支持 ==块缓存== 和 ==布隆过滤器== 以实现高效的实时查询处理。
  • Block cache improves read performance by caching data blocks in memory, reducing access time for subsequent reads.
  • 块缓存通过在内存中缓存数据块来提高读取性能,减少后续读取的访问时间。

MemStore

MemStore

  • MemStore acts as a write cache, temporarily storing incoming data before it is committed to disk.
  • MemStore 充当写缓存,在数据提交到磁盘之前临时存储传入的数据。

Summary of HBase Features

HBase 特性总结

Feature Description
Consistency Ensures strong consistency with serialized updates.
Atomic Read/Write Provides atomic operations for data integrity.
Sharding Dynamically splits and distributes regions to manage large datasets.
High Availability Uses region replication to ensure operational performance and fault tolerance.
Real-Time Processing Supports efficient querying through caching mechanisms.
特性 描述
一致性 通过序列化更新确保强一致性。
原子读写 提供原子操作以保证数据完整性。
分片 动态拆分和分发 region 以管理大型数据集。
高可用性 使用 region 复制来确保操作性能和容错能力。
实时处理 通过缓存机制支持高效查询。

🗄️ HBase Architecture & Features

🗄️ HBase 架构与特性

Block Cache

块缓存

“Block cache helps in reducing disk I/O for retrieving data.”
“块缓存有助于减少检索数据时的磁盘 I/O。”

  • The ==block cache== is a mechanism that allows data in the same block to be served quickly, reducing the need for disk I/O.
  • ==块缓存== 是一种允许快速提供同一块中数据的机制,减少了磁盘 I/O 的需求。
  • It is configurable at the ==table’s column family level==, meaning different column families can have different cache priorities or even disable the block cache entirely.
  • 它可以在 ==表的列族级别== 进行配置,这意味着不同的列族可以有不同的缓存优先级,甚至可以完全禁用块缓存。
  • Applications utilize this cache mechanism to accommodate various data sizes and access patterns.
  • 应用程序利用这种缓存机制来适应各种数据大小和访问模式。

Bloom Filter

布隆过滤器

“Bloom Filters provide an in-memory structure to reduce disk reads to only the files likely to contain that Row.”
“布隆过滤器提供一种内存结构,将磁盘读取减少到只读取那些可能包含该行的文件。”

  • A ==Bloom Filter== is an efficient mechanism used to test whether a ==StoreFile== contains a specific row or row-column cell.
  • ==布隆过滤器== 是一种用于测试 ==StoreFile== 是否包含特定行或行列单元的高效机制。
  • Without Bloom filters, the only method to find a row key in a StoreFile is by checking the StoreFile’s block index, which stores the start row key of each block.
  • 如果没有布隆过滤器,在 StoreFile 中查找行键的唯一方法是检查 StoreFile 的块索引,该索引存储了每个块的起始行键。
  • Bloom Filters act as an ==in-memory index==, significantly reducing disk reads by narrowing down the search to files that are likely to contain the specified row.
  • 布隆过滤器充当 ==内存索引==,通过将搜索范围缩小到可能包含指定行的文件,从而显著减少磁盘读取。

HBase Features

HBase 特性

HBase provides ==atomic read and write== operations on a row level, ensuring data consistency and reliability.
HBase 提供行级别的 ==原子读写== 操作,确保数据的一致性和可靠性。

Features of HBase Region Server

HBase Region Server 的特性

Feature Description
Consistency Guarantees that data will not be lost or corrupted.
High Availability Ensures that the system remains operational even during failures.
Sharding Distributes data across multiple servers to improve performance.
All of the above Region servers provide all these features.
特性 描述
一致性 保证数据不会丢失或损坏。
高可用性 确保系统在发生故障时仍能保持运行。
分片 将数据分布到多个服务器以提高性能。
以上所有 Region Server 提供所有这些特性。

Key Components of HBase Architecture

HBase 架构的关键组件

  • HMaster Server: Manages the overall HBase operations and performs DDL operations such as creating and deleting tables.
  • HMaster 服务器:管理 HBase 的整体操作,并执行如创建和删除表之类的 DDL 操作。
  • HBase Region Server: Handles read and write requests for the regions it manages.
  • HBase Region Server:处理其所管理 region 的读写请求。
  • Regions: The basic building blocks of HBase clusters, consisting of a distribution of tables and containing column families.
  • Regions:HBase 集群的基本构建块,由表的分布组成并包含列族。
  • Zookeeper: Provides services like maintaining configuration information, naming, and distributed synchronization.
  • Zookeeper:提供维护配置信息、命名和分布式同步等服务。

Storage Structure

存储结构

  • Data in HBase is stored in a block file on HDFS in the form of a ==storefile (HFile)== binary stream.
  • HBase 中的数据以 ==storefile (HFile)== 二进制流的形式存储在 HDFS 的块文件中。
  • HBase can operate in a multiple master setup, with only a single active master at any time.
  • HBase 可以在多主节点设置下运行,但在任何时候只有一个活动的主节点。

Summary of HBase Features

HBase 特性总结

  • Low-latency random reads and writes operations on top of HDFS.
  • 基于 HDFS 的低延迟随机读写操作。
  • HBase tables are partitioned into multiple regions, each storing multiple rows.
  • HBase 表被分区为多个 region,每个 region 存储多行。
  • Key features include:
  • 主要特性包括:
    • Consistency
    • 一致性
    • Atomic Read and Write
    • 原子读写
    • Sharding
    • 分片
    • High Availability
    • 高可用性
    • Real-Time Processing
    • 实时处理