📊 HBase Data Model

📊 HBase 数据模型

Overview of HBase Data Model

HBase 数据模型概述

The ==Apache HBase Data Model== is designed to manage ==structured== or ==semi-structured data== that may vary in field size, data type, and columns. HBase stores data in tables, which consist of rows and columns, and its schema differs significantly from traditional relational database tables.
==Apache HBase 数据模型==旨在管理字段大小、数据类型和列可能变化的==结构化==或==半结构化数据==。HBase将数据存储在由行和列组成的表中,其模式与传统关系数据库表有显著不同。

Key Terminologies

关键术语

  • Tables: The primary structure for data storage in HBase.
  • 表 (Tables):HBase 中用于数据存储的主要结构。
  • Row: A single record in a table, indexed by a unique ==row key==.
  • 行 (Row):表中的一条记录,通过唯一的==行键 (row key)==进行索引。
  • Column: A data field within a row, belonging to a column family.
  • 列 (Column):行内的一个数据字段,隶属于一个列族。
  • Column Families: Groups of related columns that are stored and accessed together.
  • 列族 (Column Families):相关列的分组,它们被一起存储和访问。
  • Column Qualifier: A specific identifier for a column within a column family.
  • 列限定符 (Column Qualifier):列族内列的特定标识符。
  • Cell: The intersection of a row and a column, storing a value.
  • 单元格 (Cell):行和列的交集,用于存储一个值。
  • Column Meta Data: Information about the columns in a table.
  • 列元数据 (Column Meta Data):关于表中列的信息。
  • Time Stamp: A marker indicating when the data was written.
  • 时间戳 (Time Stamp):一个标记,指示数据写入的时间。
  • Versions: Different instances of data stored in the same cell identified by timestamps.
  • 版本 (Versions):存储在同一单元格中、由时间戳标识的不同数据实例。

HBase Data Storage

HBase 数据存储

In HBase, data is organized in tables with the following characteristics:
在 HBase 中,数据以表的形式组织,具有以下特点:

  • Column Family-Oriented: Data is stored in column families, which group related data together.
  • 面向列族 (Column Family-Oriented):数据存储在列族中,列族将相关数据分组在一起。
  • Row Key: Each row is indexed by a unique ==row key== for quick lookups. Rows are stored in a sorted order based on this key.
  • 行键 (Row Key):每行都由唯一的==行键==索引,以便快速查找。行根据此键按排序顺序存储。
  • Regions: Tables are divided into regions, which are distributed across ==Region Servers== in the cluster to enhance scalability and performance.
  • 区域 (Regions):表被划分为多个区域,这些区域分布在集群中的==区域服务器 (Region Servers)==上,以增强可伸缩性和性能。

Table Structure

表结构

Component组件 Description描述
Row Key Primary key for each record, stored as a byte array.
行键 (Row Key) 每条记录的主键,以字节数组形式存储。
Column Family Groups columns with a shared prefix; members stored together.
列族 (Column Family) 将具有共享前缀的列分组;成员一起存储。
Column Belongs to a column family and can be dynamically added.
列 (Column) 属于一个列族,并且可以动态添加。
Value (Cell) The actual data stored, represented as a byte array.
值 (单元格) 存储的实际数据,表示为字节数组。
Version Number Indicates the version of the data, defaulting to the system timestamp.
版本号 (Version Number) 指示数据版本,默认为系统时间戳。

Physical Properties of HBase Tables

HBase 表的物理属性

  • All rows are arranged in lexicographic order based on the row key.
  • 所有行都根据行键按字典顺序排列。
  • Tables are divided into multiple regions, which can split when they exceed a certain size.
  • 表被划分为多个区域,当区域超过一定大小时可以进行分裂。
  • Different regions can be distributed to various region servers for load balancing.
  • 不同的区域可以分布到不同的区域服务器上以实现负载均衡。

HBase Table Characteristics

HBase 表的特性

  • Maximum Versions: Typically allows for up to 3 versions of each cell, with the option to adjust based on application needs.
  • 最大版本数 (Maximum Versions):通常每个单元格最多允许3个版本,并可根据应用需求进行调整。
  • Compression Algorithm: Utilizes the Snappy algorithm for efficient data compression.
  • 压缩算法 (Compression Algorithm):利用 Snappy 算法进行高效的数据压缩。
  • In-memory Storage: Certain properties are stored in memory for performance but are ignored during certain operations.
  • 内存存储 (In-memory Storage):某些属性为提高性能而存储在内存中,但在某些操作期间会被忽略。
  • Bloom Filter: A data structure used to quickly determine whether a row key exists, improving lookup efficiency.
  • 布隆过滤器 (Bloom Filter):一种用于快速确定行键是否存在的数据结构,可提高查找效率。

Table Properties

表属性

Property属性 Description描述
Size Can support hundreds of millions of rows and millions of columns.
大小 (Size) 可支持数亿行和数百万列。
Column-oriented Allows independent storage and permission control for column families.
面向列 (Column-oriented) 允许对列族进行独立的存储和权限控制。
Sparse Null columns do not occupy storage, enabling sparse design.
稀疏 (Sparse) Null 列不占用存储空间,从而支持稀疏设计。

Creating a Table in HBase

在 HBase 中创建表

The command to create a table is structured as follows:
创建表的命令结构如下:

1
create 'table name', 'column family', 'column family'

Example Usage

使用示例

In a traditional RDBMS, columns are fixed, but in HBase, new columns can be added dynamically. For instance:
在传统的关系型数据库管理系统 (RDBMS) 中,列是固定的,但在 HBase 中,可以动态添加新列。例如:

  • Define a user table with a column family info:
  • 定义一个用户表,包含一个列族 info
    • info:name = niit
    • info:age = 30
    • info:sex = male
    • Adding a new property: info:newProperty
    • 添加一个新属性:info:newProperty

Column Families and Qualifiers

列族和列限定符

Column Families

列族

Columns in HBase are grouped into column families, which must be defined at schema creation. Members of a column family are stored together on the filesystem, which enhances performance.
在 HBase 中,列被分组成列族,列族必须在模式创建时定义。一个列族的成员在文件系统上存储在一起,这可以提高性能。

Column Qualifiers

列限定符

Column qualifiers provide unique names for data values within a column family. They can vary in content and length, enabling flexibility in data representation.
列限定符为列族内的数据值提供唯一的名称。它们的内容和长度可以变化,从而实现了数据表示的灵活性。

Cells in HBase

HBase 中的单元格

Cells are unique combinations of row key, column family, and column qualifier, with each cell identified by a specific key.
单元格是行键列族列限定符的唯一组合,每个单元格都由一个特定的键来标识。

Key Characteristics of Cells

单元格的主要特点

  • Cells allow for quick data retrieval and modifications.
  • 单元格允许快速的数据检索和修改。
  • HBase organizes data by storage key, enabling efficient access.
  • HBase 按存储键组织数据,从而实现高效访问。
  • Cells are first organized by column family and then by column qualifier.
  • 单元格首先按列族组织,然后按列限定符组织。

Example Structure of a Cell

单元格结构示例

  • User data can be organized as:
  • 用户数据可以组织如下:
    • Row Key: User’s email address
    • 行键:用户的电子邮件地址
    • Column Family: name
    • 列族:name
    • Column Qualifier: first, last
    • 列限定符:first, last

This model supports a wide variety of columns per row and allows for heterogeneous sets of columns across different rows.
该模型支持每行有多种多样的列,并允许不同行之间有异构的列集合。

🗃️ HBase Data Model

🗃️ HBase 数据模型

Key-Value Concept

键值 (Key-Value) 概念

“The KeyValue class is the heart of data storage in HBase, wrapping a byte array and taking offsets and lengths to interpret the content.”
KeyValue 类是 HBase 中数据存储的核心,它包装了一个字节数组,并通过偏移量和长度来解释其内容。”

KeyValue Format

KeyValue 格式

The KeyValue format inside a byte array consists of:
字节数组内的 KeyValue 格式包括:

  • keylength
  • 键长度 (keylength)
  • valuelength
  • 值长度 (valuelength)
  • key
  • 键 (key)
  • value
  • 值 (value)

Key Breakdown

键的分解

The Key is further decomposed into:
键 (Key) 被进一步分解为:

  • rowlength
  • 行长度 (rowlength)
  • row (i.e., the rowkey)
  • 行 (row) (即行键)
  • Cell
  • 单元格 (Cell)
  • columnfamilylength
  • 列族长度 (columnfamilylength)
  • columnfamily
  • 列族 (columnfamily)
  • columnqualifier
  • 列限定符 (columnqualifier)
  • timestamp
  • 时间戳 (timestamp)
  • keytype (e.g., Put, Delete, DeleteColumn, DeleteFamily)
  • 键类型 (keytype) (例如:Put, Delete, DeleteColumn, DeleteFamily)

Timestamp

时间戳

A timestamp is written alongside each value, serving as an identifier for a given version of a value. By default, it reflects the time on the RegionServer when the data was written. However, a different timestamp can be specified when inserting data.
每个值旁边都会写入一个时间戳,作为该值特定版本的标识符。默认情况下,它反映了数据写入时 RegionServer 上的时间。但是,在插入数据时可以指定一个不同的时间戳。

  • The timestamp is specified using a long integer, typically containing time instances such as those returned by java.util.Date.getTime() or System.currentTimeMillis(). This represents the difference, measured in milliseconds, from midnight, January 1, 1970 UTC.
  • 时间戳使用长整型指定,通常包含时间实例,例如由 java.util.Date.getTime()System.currentTimeMillis() 返回的值。这表示从协调世界时 (UTC) 1970年1月1日午夜开始经过的毫秒数。

Cell Specification

单元格规范

A {row, column, version} tuple precisely identifies a cell in HBase. It is possible to have multiple cells with the same row and column but differing only in the version dimension.
一个 {row, column, version} 元组精确地标识了 HBase 中的一个单元格。可能存在多个单元格具有相同的行和列,仅在版本维度上有所不同。

Dimension Type Description
维度 类型 描述
Row Byte array The primary key for records in the table
字节数组 表中记录的主键
Column Byte array Grouped into column families
字节数组 分组到列族中
Version Long integer Specifies the version of the cell
版本 长整型 指定单元格的版本

Version Storage

版本存储

The HBase version dimension is stored in decreasing order, ensuring that when reading from a store file, the most recent values are encountered first.
HBase 的版本维度按降序存储,确保在从存储文件中读取时,最先遇到的是最新的值。

Classes and Interfaces

类和接口

Overview of Key Classes and Interfaces

关键类和接口概述

Class or Interface类或接口 Package包 Description描述
ColumnCountGetFilter org.apache.hadoop.hbase.filter Simple filter that returns the first N columns on a row. Unsuited for Scan filters.
ColumnCountGetFilter org.apache.hadoop.hbase.filter 简单的过滤器,返回一行中的前 N 列。不适用于 Scan 过滤器。
ColumnFamilyDescriptor org.apache.hadoop.hbase.client Contains information about a column family, like versions and compression settings.
ColumnFamilyDescriptor org.apache.hadoop.hbase.client 包含有关列族的信息,如版本和压缩设置。
Cell org.apache.hadoop.hbase.io.HeapSize Implements Comparable<Cell>, meaningful when comparing to other keys in the same table.
Cell org.apache.hadoop.hbase.io.HeapSize 实现 Comparable<Cell>,在与同一表中的其他键进行比较时有意义。
VersionInfo org.apache.hadoop.hbase.util Finds the version information for HBase.
VersionInfo org.apache.hadoop.hbase.util 查找 HBase 的版本信息。
TimestampsFilter org.apache.hadoop.hbase.filter Returns only cells with timestamps (versions) in the specified list.
TimestampsFilter org.apache.hadoop.hbase.filter 仅返回时间戳(版本)在指定列表中的单元格。
Table AutoCloseable, Closeable Used to communicate with a single HBase table for operations like get, put, delete, or scan.
Table AutoCloseable, Closeable 用于与单个 HBase 表通信,以执行 get、put、delete 或 scan 等操作。

Key Concepts Recap

关键概念回顾

  • HBase stores data as a table consisting of rows and columns.
  • HBase 将数据存储为由行和列组成的表。
  • RowKey: A byte array serving as the primary key for each record, enhancing data access speed.
  • 行键 (RowKey):一个作为每条记录主键的字节数组,可提高数据访问速度。
  • Columns are grouped into ==column families==, with all members sharing a prefix.
  • 列被分组到==列族==中,所有成员共享一个前缀。
  • ==Column qualifiers== are specific names assigned to data values for accurate identification.
  • ==列限定符==是分配给数据值的特定名称,用于准确识别。
  • Each cell in HBase is a unique combination of row key and stores data as a group of values.
  • HBase 中的每个单元格都是行键的唯一组合,并将数据存储为一组值。
  • The ==KeyValue== class is crucial for data storage, wrapping a byte array for content interpretation.
  • ==KeyValue== 类对于数据存储至关重要,它包装一个字节数组以解释内容。