🗂️ HDFS Distributed File System
🗂️ HDFS 分布式文件系统
Overview of HDFS
HDFS 概述
“Hadoop Distributed File System (HDFS) is a Distributed File System, built on the idea of one write, many reads, high fault tolerance, and high throughput.”
“Hadoop 分布式文件系统 (HDFS) 是一个分布式文件系统,基于一次写入、多次读取、高容错性和高吞吐量的理念构建。”
Key Features of HDFS
HDFS 的主要特性
- One Write, Many Reads: Allows multiple reads after a single write operation, enhancing efficiency.
- 一次写入,多次读取:允许在单次写入操作后进行多次读取,从而提高效率。
- High Fault Tolerance: Designed to handle hardware failures gracefully.
- 高容错性:旨在优雅地处理硬件故障。
- High Throughput: Optimized for large data sets and batch processing.
- 高吞吐量:针对大数据集和批处理进行了优化。
Components
组件
- DataNode: Responsible for reading and writing requests from file system clients.
- DataNode (数据节点):负责处理来自文件系统客户端的读写请求。
🛠️ Hadoop Shell Commands
🛠️ Hadoop Shell 命令
HDFS is accessed through a set of shell commands to perform various file operations, such as moving, deleting, and changing file permissions. Below are the essential commands categorized by their functionality.
HDFS 通过一组 shell 命令进行访问,以执行各种文件操作,例如移动、删除和更改文件权限。以下是按功能分类的基本命令。
Basic Commands: View and Create Files and Directories
基本命令:查看和创建文件及目录
| Command Syntax | Description |
|---|---|
hdfs dfs -ls <path> |
Lists all the files in the specified path. Use lsr for a recursive approach. |
hdfs dfs -ls <路径> |
列出指定路径中的所有文件。使用 lsr 进行递归操作。 |
hdfs dfs -mkdir <folder name> |
Creates a directory. Note: No home directory by default in Hadoop DFS. |
hdfs dfs -mkdir <文件夹名称> |
创建一个目录。注意:Hadoop DFS 中默认没有主目录。 |
hdfs dfs -touchz <file_path> |
Creates an empty file. |
hdfs dfs -touchz <文件路径> |
创建一个空文件。 |
hdfs dfs -touch <hdfs-file-path> |
Creates a file without any content, updating access and modification times. |
hdfs dfs -touch <hdfs文件路径> |
创建一个没有任何内容的文件,更新访问和修改时间。 |
hdfs dfs -cat <path> |
Prints the contents of the specified file. |
hdfs dfs -cat <路径> |
打印指定文件的内容。 |
hdfs fs -tail [-f] URI |
Outputs the last 1K bytes of a file. Supports the -f option for continuous output. |
hdfs fs -tail [-f] URI |
输出文件的最后 1K 字节。支持 -f 选项以实现连续输出。 |
hdfs fs -test -[ezd] URI |
Checks file properties: -e for existence, -z for zero bytes, -d for directory status. |
hdfs fs -test -[ezd] URI |
检查文件属性:-e 表示存在,-z 表示零字节,-d 表示目录状态。 |
hdfs fs -text <src> |
Outputs the source file in text format. |
hdfs fs -text <源文件> |
以文本格式输出源文件。 |
Count and Size Commands
计数和大小命令
| Command Syntax | Description |
|---|---|
hdfs dfs -count /hdfs-file-path |
Counts the number of directories, files, and their sizes on HDFS. |
hdfs dfs -count /hdfs文件路径 |
统计 HDFS 上目录、文件及其大小的数量。 |
hdfs dfs -du <dirName> |
Displays the size of each file in the specified directory. |
hdfs dfs -du <目录名> |
显示指定目录中每个文件的大小。 |
hdfs dfs -dus <dirName> |
Provides the total size of the specified directory/file. |
hdfs dfs -dus <目录名> |
提供指定目录/文件的总大小。 |
hadoop fs -du hdfs://master:54310/hbase |
Displays the size of all HBase files. |
hadoop fs -du hdfs://master:54310/hbase |
显示所有 HBase 文件的大小。 |
Copy and Move Commands
复制和移动命令
| Command Syntax | Description |
|---|---|
hdfs dfs -put /local-file-path /hdfs-file-path |
Copies a file/folder from local to HDFS. |
hdfs dfs -put /本地文件路径 /hdfs文件路径 |
将文件/文件夹从本地复制到 HDFS。 |
hdfs dfs -copyFromLocal <local file path> <dest> |
Copies files/folders from the local file system to HDFS. |
hdfs dfs -copyFromLocal <本地文件路径> <目标路径> |
将文件/文件夹从本地文件系统复制到 HDFS。 |
hdfs dfs -copyToLocal <srcfile(on hdfs)> <local file dest> |
Copies files/folders from HDFS to the local file system. |
hdfs dfs -copyToLocal <源文件(在hdfs上)> <本地文件目标路径> |
将文件/文件夹从 HDFS 复制到本地文件系统。 |
hdfs dfs -moveFromLocal <local src> <dest(on hdfs)> |
Moves files from local to HDFS. |
hdfs dfs -moveFromLocal <本地源文件> <目标路径(在hdfs上)> |
将文件从本地移动到 HDFS。 |
hdfs dfs -cp <src(on hdfs)> <dest(on hdfs)> |
Copies files within HDFS. |
hdfs dfs -cp <源文件(在hdfs上)> <目标路径(在hdfs上)> |
在 HDFS 内复制文件。 |
hadoop fs -getmerge <src> <localdst> [addnl] |
Merges files from a source directory into a single target file. |
hadoop fs -getmerge <源路径> <本地目标文件> [addnl] |
将源目录中的文件合并到单个目标文件中。 |
Remove or Delete Commands
删除命令
| Command Syntax | Description |
|---|---|
hdfs dfs -rmr <filename/directoryName> |
Recursively deletes a file or directory from HDFS. |
hdfs dfs -rmr <文件名/目录名> |
从 HDFS 递归删除文件或目录。 |
hdfs dfs -rm /file-name |
Deletes a specified file from HDFS. |
hdfs dfs -rm /文件名 |
从 HDFS 删除指定文件。 |
hdfs dfs -rmdir /directory-name |
Removes a directory only if it is empty. |
hdfs dfs -rmdir /目录名 |
仅当目录为空时才删除目录。 |
hdfs fs -expunge |
Cleans up the recycle bin in HDFS. |
hdfs fs -expunge |
清理 HDFS 中的回收站。 |
Other Commands
其他命令
| Command Syntax | Description |
|---|---|
hdfs dfs -stat <hdfs file> |
Provides the last modified time of the directory or file. |
hdfs dfs -stat <hdfs 文件> |
提供目录或文件的最后修改时间。 |
hdfs dfs -setrep -R -w <replication factor> <file> |
Changes the replication factor of a file/directory in HDFS. |
hdfs dfs -setrep -R -w <复制因子> <文件> |
更改 HDFS 中文件/目录的复制因子。 |
hdfs fs -chgrp [-R] GROUP URI |
Changes the group ownership of the specified file/directory. |
hdfs fs -chgrp [-R] 组 URI |
更改指定文件/目录的组所有权。 |
hdfs fs -chmod [-R] URI |
Changes the permissions of files/directories. |
hdfs fs -chmod [-R] URI |
更改文件/目录的权限。 |
hdfs fs -chown [-R] [OWNER][:[GROUP]] URI |
Changes the owner of the file/directory. |
hdfs fs -chown [-R] [所有者][:[组]] URI |
更改文件/目录的所有者。 |
🛠️ Hadoop Administration Commands
🛠️ Hadoop 管理命令
| Command Syntax | Description |
|---|---|
hadoop distcp [Options] src_url dest_url |
Copies data within or between clusters. |
hadoop distcp [选项] 源url 目标url |
在集群内或集群之间复制数据。 |
hadoop fsck [GENERIC_OPTIONS] |
Checks the health of the entire file system. |
hadoop fsck [通用选项] |
检查整个文件系统的健康状况。 |
hadoop jar <jar> [mainClass] args… |
Runs a jar file, executing MapReduce code. |
hadoop jar <jar包> [主类] 参数… |
运行一个 jar 文件,执行 MapReduce 代码。 |
hadoop balancer [-threshold] |
Runs cluster balancing tools. |
hadoop balancer [-阈值] |
运行集群均衡工具。 |
hadoop dfsadmin [GENERIC_OPTIONS] |
Admin client for HDFS operations. |
hadoop dfsadmin [通用选项] |
用于 HDFS 操作的管理客户端。 |
🗂️ Hadoop Commands and HDFS Architecture
🗂️ Hadoop 命令和 HDFS 架构
🛠️ Hadoop Namenode Command
🛠️ Hadoop Namenode 命令
The Namenode Command is utilized to manage the Hadoop Namenode. The general syntax is:
Namenode 命令 用于管理 Hadoop Namenode。常规语法是:
1 | hadoop namenode [-format] | [-upgrade] | [-rollback] | [-finalize] | [-importCheckpoint] |
Options and Descriptions
选项和说明
| Option | Description |
|---|---|
-format |
Formats the namenode, starting and then closing it. |
-format |
格式化 namenode,启动然后关闭它。 |
-upgrade |
Used to start the namenode after distributing a new version of Hadoop. |
-upgrade |
在分发新版本的 Hadoop 后用于启动 namenode。 |
-rollback |
Rolls back the namenode to the previous version after stopping the cluster. |
-rollback |
在停止集群后将 namenode 回滚到先前版本。 |
-finalize |
Deletes the previous state of the file system, making recent upgrades persistent. |
-finalize |
删除文件系统的先前状态,使最近的升级持久化。 |
-importCheckpoint |
Loads the image from the checkpoint directory to the current checkpoint directory. |
-importCheckpoint |
将镜像从检查点目录加载到当前检查点目录。 |
🚀 Hadoop Job Command
🚀 Hadoop 作业命令
The Hadoop Job Command is essential for interacting with MapReduce jobs. The syntax is:
Hadoop 作业命令 对于与 MapReduce 作业交互至关重要。语法是:
1 | hadoop job [GENERIC_OPTIONS] [-submit] | [-status] | [-counter] | [-kill] | [-events <#-of-events>] | [-history [all]] | [-list [all]] | [-kill-task] | [-fail-task] |
Options and Descriptions
选项和说明
| Option | Description |
|---|---|
-submit <job-file> |
Submits a job for execution. |
-submit <作业文件> |
提交作业以供执行。 |
-status <job-id> |
Prints map and reduce completion percentages and all counters. |
-status <作业id> |
打印 map 和 reduce 的完成百分比以及所有计数器。 |
-counter <job-id> <group-name> <counter-name> |
Prints the value of the specified counter. |
-counter <作业id> <组名> <计数器名> |
打印指定计数器的值。 |
-kill <job-id> |
Kills the specified job. |
-kill <作业id> |
终止指定的作业。 |
-events <job-id> <from-event#> <#-of-events> |
Prints event details received by the job tracker within a given range. |
-events <作业id> <起始事件编号> <事件数量> |
打印作业跟踪器在给定范围内接收到的事件详细信息。 |
-history [all] |
Prints the details of the job, including failures and the cause of job termination. |
-history [all] |
打印作业的详细信息,包括故障和作业终止的原因。 |
-list [all] |
list all shows all jobs; list shows only jobs to be completed. |
-list [all] |
list all 显示所有作业;list 仅显示待完成的作业。 |
-kill-task <task-id> |
Kills the specified task, which does not affect failed attempts. |
-kill-task <任务id> |
终止指定的任务,这不影响失败的尝试。 |
-fail-task <task-id> |
Forces the specified task to fail, which is detrimental to failed attempts. |
-fail-task <任务id> |
强制指定的任务失败,这对失败的尝试是有害的。 |
📁 HDFS Write Architecture
📁 HDFS 写入架构
HDFS follows the Write Once – Read Many principle. Data cannot be edited once stored, but new data can be appended.
HDFS 遵循 一次写入 – 多次读取 的原则。数据一旦存储就无法编辑,但可以追加新数据。
HDFS Write Request Steps
HDFS 写入请求步骤
- Client Interaction: The client interacts with the NameNode to gain permission and obtain the IPs of DataNodes for writing.
- 客户端交互:客户端与 NameNode (名称节点) 交互以获取权限并获取用于写入的 DataNodes (数据节点) 的 IP 地址。
- DataNode Interaction: The client writes data directly to the DataNodes, which creates replicas based on the defined replication factor (e.g., 3).
- DataNode 交互:客户端直接将数据写入 DataNodes,DataNodes 根据定义的复制因子(例如 3)创建副本。
Replication Factor
复制因子
If the replication factor is 3, then at least 3 copies of each data block are created across different DataNodes.
如果复制因子为 3,则每个数据块至少在不同的 DataNodes 上创建 3 个副本。
📥 HDFS Read Architecture
📥 HDFS 读取架构
The HDFS read operation is straightforward. Here’s how it works:
HDFS 读取操作非常简单。其工作原理如下:
HDFS Read Request Steps
HDFS 读取请求步骤
- The client requests block metadata from the NameNode.
- 客户端向 NameNode (名称节点) 请求块元数据。
- The NameNode returns a list of DataNodes where each block is stored.
- NameNode 返回存储每个块的 DataNodes (数据节点) 列表。
- The client connects to the DataNodes and reads data in parallel (e.g., Block A from DataNode 1 and Block B from DataNode 3).
- 客户端连接到 DataNodes 并并行读取数据(例如,从 DataNode 1 读取块 A,从 DataNode 3 读取块 B)。
- The client combines the blocks to reconstruct the original file.
- 客户端组合这些块以重建原始文件。
Acknowledgment Flow
确认流程
After creating the required replicas during a write operation, an acknowledgment is sent to the client.
在写入操作期间创建所需副本后,会向客户端发送确认。
📝 Client File Writing Operation
📝 客户端文件写入操作
The file operation classes in Hadoop are found in the org.apache.hadoop.fs package. These APIs support operations including opening, reading, writing, and deleting files.
Hadoop 中的文件操作类位于 org.apache.hadoop.fs 包中。这些 API 支持包括打开、读取、写入和删除文件在内的操作。
Writing to HDFS
写入 HDFS
- The client sends a create request to the DistributedFileSystem.
- 客户端向 DistributedFileSystem (分布式文件系统) 发送创建请求。
- The DistributedFileSystem makes an RPC call to the namenode to create a new file.
- DistributedFileSystem 向 namenode 发出 RPC 调用以创建新文件。
- The client receives an
FSDataOutputStreamto start writing data. - 客户端接收一个
FSDataOutputStream以开始写入数据。
File Write Process
文件写入过程
- Data is streamed to the first DataNode, which forwards it to the next in the pipeline.
- 数据被流式传输到第一个 DataNode,然后由该 DataNode 将其转发到管道中的下一个 DataNode。
- An internal queue, known as the ack queue, maintains packets waiting for acknowledgment.
- 一个内部队列,称为 ack 队列 (确认队列),维护等待确认的数据包。
Example Code for Writing to HDFS
写入 HDFS 的示例代码
1 | InputStream inputStream = new BufferedInputStream(new FileInputStream("/Data/dependencies.txt")); |
📖 Client File Read Operation
📖 客户端文件读取操作
Reading from HDFS
从 HDFS 读取
- The client opens the file using the
open()function of FileSystem. - 客户端使用 FileSystem (文件系统) 的
open()函数打开文件。 - The DistributedFileSystem retrieves data block information from metadata nodes.
- DistributedFileSystem (分布式文件系统) 从元数据节点检索数据块信息。
- The client reads data from the nearest DataNodes.
- 客户端从最近的 DataNodes (数据节点) 读取数据。
Example Code for Reading from HDFS
从 HDFS 读取的示例代码
1 | InputStream inputStream = hdfs.open(path); |
Handling Errors
处理错误
If communication fails with a DataNode, the client attempts to connect to another DataNode containing the required data block.
如果与 DataNode (数据节点) 的通信失败,客户端会尝试连接到包含所需数据块的另一个 DataNode。
🗂️ File Operations
🗂️ 文件操作
Renaming Files
重命名文件
To rename a file in HDFS, you need to pass:
要在 HDFS 中重命名文件,您需要传递:
src: The original file name.src:原始文件名。dst: The new file name.dst:新文件名。
1 | fs.rename(currentName, reName); |
Deleting Files
删除文件
To delete a file, the command is:
要删除文件,命令是:
1 | public abstract boolean delete(Path f, boolean recursive) throws IOException |
f: The path of the file to be deleted.f:要删除的文件的路径。recursive: If true, deletes the directory and its contents.recursive:如果为 true,则删除目录及其内容。
🗂️ HDFS Operations
🗂️ HDFS 操作
File Deletion
文件删除
To delete a file in HDFS, you can use the following code:
要在 HDFS 中删除文件,可以使用以下代码:
1 | Path delef = new Path("/home/test_2.txt"); |
Return Values:
返回值:
- True: Indicates successful deletion.
- True (真):表示删除成功。
- False: Indicates failed deletion.
- False (假):表示删除失败。
Directory Creation
目录创建
To create a directory at a specified location, use:
要在指定位置创建目录,请使用:
1 | public boolean mkdirs(Path f) throws IOException |
FileSystem Instance:
文件系统实例:
1 | FileSystem fs2 = FileSystem.get(URI.create(address), conf); |
Creating the Directory:
创建目录:
1 | boolean isCreated = fs2.mkdirs(new Path(dir)); |
Return Values:
返回值:
- True: Indicates successful creation.
- True (真):表示创建成功。
- False: Indicates failed creation.
- False (假):表示创建失败。
Example Output
示例输出
1 | if (isCreated) { |
1 | if (isCreated) { |
Listing Folder Contents
列出文件夹内容
To display all directories and files in the current directory, use:
要显示当前目录中的所有目录和文件,请使用:
1 | public abstract FileStatus[] listStatus(Path f) throws FileNotFoundException, IOException |
FileStatus:
FileStatus (文件状态):
- Represents the file status including path information of the given directory.
表示文件状态,包括给定目录的路径信息。
Example Code
示例代码
1 | uri = "hdfs://niit:9000/niit"; |
1 | uri = "hdfs://niit:9000/niit"; |