Chapter 4

第4章

  1. The Sqoop export tool can export a set of files from HDFS back to RDBMS.
    • Sqoop导出工具可以将一组文件从HDFS导出回关系型数据库(RDBMS)。
  2. The target table must already exist in the database. The input files are read and parsed into a set of records according to the user-specified delimiters.
    • 目标表必须在数据库中已经存在。输入文件会被读取,并根据用户指定的分隔符解析成一组记录。
  3. The default operation is to transform these into a set of INSERT statements that insert the records into the database.
    • 默认操作是将这些记录转换为一组INSERT语句,从而将记录插入到数据库中。
  4. In “update mode,” Sqoop will generate UPDATE statements that replace existing records in the database, and in “call mode” Sqoop will call a stored procedure for each record.
    • 在“更新模式”下,Sqoop将生成UPDATE语句来替换数据库中的现有记录;而在“调用模式”下,Sqoop将为每条记录调用一个存储过程。
  5. Syntax:- sqoop export (generic-args) (export-args)
    • 语法:- sqoop export (通用参数) (导出参数)
  6. Export control arguments:
    • 导出控制参数:
Argument Description 说明
--columns <col,col.col…> Columns to export to table 导出到表的列
--direct Use direct export fast path 使用直接导出快速路径
--export-dir <dir> HDFS source path for the export 导出的HDFS源路径
-m,–num-mappers <n> Use n map tasks to export in parallel 使用n个map任务并行导出
--table <table-name> Table to populate 要填充的表
--call <stored-proc-name> Stored Procedure to call 要调用的存储过程
--update-key <col-name> Anchor column to use for updates. Use a comma separated list of columns if there are more than one column. 用于更新的锚列。如果有多个列,请使用逗号分隔的列列表。
--update-mode <mode> Specify how updates are performed when new rows are found with non-matching keys in database. Legal values for mode include updateonly (default) and allowinsert. 指定当数据库中发现不匹配键的新行时如何执行更新。模式的合法值包括 updateonly(默认)和 allowinsert。
--input-null-string <null-string> The string to be interpreted as null for string columns 将被解释为字符串列的null的字符串
--input-null-non-string <null-string> The string to be interpreted as null for non-string columns 将被解释为非字符串列的null的字符串
--staging-table <staging-table-name> The table in which data will be staged before being inserted into the destination table. 数据在插入目标表之前将被暂存的表。
--clear-staging-table Indicates that any data present in the staging table can be deleted. 表示暂存表中存在的任何数据都可以被删除。
--batch Use batch mode for underlying statement execution. 对底层语句执行使用批处理模式。
  1. When using the export tool, you must use the --export-dir parameter to specify the directory in which the source data is contained in HDFS, and specify the destination table to export to the database via --table, or specify the stored procedure to call via --call. Note --table and --call cannot be used at the same time.

    • 使用导出工具时,必须使用 --export-dir 参数指定源数据在HDFS中包含的目录,并通过 --table 指定要导出到数据库的目标表,或者通过 --call 指定要调用的存储过程。注意 --table 和 --call 不能同时使用。
  2. By default, all columns in the table are selected for export. You can also use the --columns parameter to indicate which columns to export and to control the order in which columns are exported, with values that correspond to a comma-separated list of column names.

    • 默认情况下,表中的所有列都被选中进行导出。你也可以使用 --columns 参数来指示要导出哪些列,并控制列导出的顺序,其值为对应列名的逗号分隔列表。
  3. Some databases provides a direct mode for exports as well. Use the --direct argument to specify this codepath. This may be higher-performance than the standard JDBC codepath.

    • 一些数据库也提供了直接导出模式。使用 --direct 参数来指定此代码路径。这可能比标准的JDBC代码路径性能更高。
  4. Since Sqoop breaks down export process into multiple transactions, it is possible that a failed export job may result in partial data being committed to the database. This can further lead to subsequent jobs failing due to insert collisions in some cases, or lead to duplicated data in others. You can overcome this problem by specifying a staging table via the --staging-table option which acts as an auxiliary table that is used to stage exported data. The staged data is finally moved to the destination table in a single transaction.

    • 由于Sqoop将导出过程分解为多个事务,因此失败的导出作业可能会导致部分数据提交到数据库。这可能会在某些情况下导致后续作业因插入冲突而失败,或者在其他情况下导致数据重复。你可以通过 --staging-table 选项指定一个暂存表来解决这个问题,该表作为一个辅助表用于暂存导出的数据。暂存的数据最终会在一个事务中移动到目标表。
  5. In order to use the staging facility, you must create the staging table prior to running the export job. This table must be structurally identical to the target table. This table should either be empty before the export job runs, or the --clear-staging-table option must be specified. If the staging table contains data and the --clear-staging-table option is specified, Sqoop will delete all of the data before starting the export job.

    • 为了使用暂存功能,必须在运行导出作业之前创建暂存表。该表的结构必须与目标表完全相同。在导出作业运行之前,该表应该为空,或者必须指定 --clear-staging-table 选项。如果暂存表包含数据并且指定了 --clear-staging-table 选项,Sqoop将在开始导出作业之前删除所有数据。
  6. There are two modes for exporting data, insert mode and update mode. The insert mode is generally used for full export, that is, to export data to an empty table. The update mode is sometimes used for incremental export, and the update mode has two types, the default is updateonly, and it can also be specified as allowinsert.

    • 导出数据有两种模式,插入模式和更新模式。插入模式通常用于全量导出,即将数据导出到空表中。更新模式有时用于增量导出,更新模式有两种类型,默认是 updateonly,也可以指定为 allowinsert。
    1. Insert Mode: If not specified --update-key, Sqoop will complete the export using the default insert mode, it will convert each record into an INSERT statement to the database table, if the target table has some constraints such as unique constraints, be careful when using insert mode to avoid violating those constraints. If one record fails to insert, the entire export job will eventually fail. This insert mode is typically used to export data to a new, empty table.
    • 插入模式:如果没有指定 --update-key,Sqoop将使用默认的插入模式完成导出,它会将每条记录转换为数据库表的INSERT语句。如果目标表有一些约束(如唯一约束),在使用插入模式时要小心,避免违反这些约束。如果一条记录插入失败,整个导出作业最终将失败。这种插入模式通常用于将数据导出到一个新的空表中。
    1. Update Mode: If --update-key is specified, the export will be done using update mode, the default UPDATE mode is updateonly, or you can add --update-mode updateonly to set it explicitly, in updateonly mode, Sqoop will only modify the dataset that already exists in the database table, each record as input will be converted into a UPDATE statement to modify the existing record, which record to modify is determined by the column specified by update-key, if a UPDATE statement does not have a corresponding record in the database, it will not insert new data, but it will not report an error, and the export operation will continue. In shorten brief, new records are not exported in the database, but only the existed records are updated. You can also specify the update mode as allowinsert by adding the --update-mode allowinsert, then you can update existing records and insert new records at the same time. Whether an update operation or an insert operation is performed on each record is determined by the column specified by --update-key.
    • 更新模式:如果指定了 --update-key,导出将使用更新模式进行。默认的UPDATE模式是 updateonly,或者你可以添加 --update-mode updateonly 来显式设置。在 updateonly 模式下,Sqoop只会修改数据库表中已经存在的数据集,作为输入的每条记录将被转换为UPDATE语句来修改现有记录,要修改哪条记录由 update-key 指定的列决定。如果UPDATE语句在数据库中没有对应的记录,它不会插入新数据,但也不会报错,导出操作将继续。简而言之,新记录不会被导出到数据库中,只有存在的记录会被更新。你也可以通过添加 --update-mode allowinsert 将更新模式指定为 allowinsert,这样你可以同时更新现有记录和插入新记录。对每条记录执行更新操作还是插入操作由 --update-key 指定的列决定。
  7. Arguments Explained:

    • 参数解释:
    1. The --export-dir argument and one of --table or --call are required. These specify the table to populate in the database (or the stored procedure to call), and the directory in HDFS that contains the source data.

      • --export-dir 参数以及 --table 或 --call 中的一个是必需的。这些指定了要在数据库中填充的表(或要调用的存储过程),以及包含源数据的HDFS目录。
    2. Legal values for --update-mode include updateonly (default) and allowinsert. The default mode is updateonly, only data updates exist, no data insertions. If you specify --update-mode to allowinsert, you can export data that does not exist in the target database to the database table while updating the data.

      • --update-mode 的合法值包括 updateonly(默认)和 allowinsert。默认模式是 updateonly,只存在数据更新,没有数据插入。如果你将 --update-mode 指定为 allowinsert,你可以在更新数据的同时将目标数据库中不存在的数据导出到数据库表中。
    3. --update-key Specifies the column to refer to when performing the update, separated by commas if there are multiple columns. Note that update will only occur when all reference columns match.

    • --update-key 指定执行更新时参考的列,如果有多个列,用逗号分隔。注意,只有当所有参考列都匹配时,更新才会发生。
  8. In call mode, Sqoop calls a stored procedure for each record to insert or update data. Stored procedures need to be created on the database in advance.

    • 在调用模式下,Sqoop为每条记录调用一个存储过程来插入或更新数据。存储过程需要预先在数据库中创建。

Call Mode

调用模式

In call mode, Sqoop calls a stored procedure for each record to insert or update data. Stored procedures need to be created on the database in advance. For example, if you want to export data to a bar table under mySQL’s foo database, create a stored procedure named barproc as follows:

  • 在调用模式下,Sqoop为每条记录调用一个存储过程来插入或更新数据。存储过程需要预先在数据库中创建。例如,如果你想将数据导出到mySQL的foo数据库下的bar表中,可以创建一个名为barproc的存储过程,如下所示:
1
2
3
4
5
6
7
use foo;
delimiter $$
CREATE PROCEDURE barproc (IN c1 TYPE1, IN c2 TYPE2…)
BEGIN
INSERT INTO barproc(col1, col2…) VALUES(c1, c2…);
END$$
delimiter ;

Then write the Sqoop command, Sqoop will call a stored procedure named barproc for exporting each record in /results/bar data.

  • 然后编写Sqoop命令,Sqoop将调用名为barproc的存储过程来导出 /results/bar data 中的每条记录。
1
$ sqoop-export --connect jdbc:mysql://db.example.com/foo --call barproc \--export-dir /results/bar_data

Export and Transaction

导出与事务

  1. The Sqoop export operation turns the input records into INSERT statements, an INSERT statement can insert up to 100 records.
    • Sqoop导出操作将输入记录转换为INSERT语句,一条INSERT语句最多可以插入100条记录。
  2. Tasks responsible for writing to the database will commit a transaction for every 1,000 records they receive, which guarantees that the transaction buffer does not overflow and avoids the risk of memory exhaustion.
    • 负责写入数据库的任务将每接收1,000条记录提交一次事务,这保证了事务缓冲区不会溢出,并避免了内存耗尽的风险。

Export Failure Handling

导出失败处理

  1. Exports may fail for a number of reasons:

    • 导出可能会因为多种原因失败:
    1. Loss of connectivity from the Hadoop cluster to the database (either due to hardware fault, or server software crashes)
      • 从Hadoop集群到数据库的连接丢失(由于硬件故障或服务器软件崩溃)
    2. Attempting to INSERT a row which violates a consistency constraint (for example, inserting a duplicate primary key value)
      • 试图插入违反一致性约束的行(例如,插入重复的主键值)
    3. Attempting to parse an incomplete or malformed record from the HDFS source data
      • 试图从HDFS源数据中解析不完整或格式错误的记录
    4. Attempting to parse records using incorrect delimiters(, ; : -)
      • 试图使用错误的分隔符(, ; : -)解析记录
    5. Capacity issues (such as insufficient RAM or disk space)
      • 容量问题(如RAM或磁盘空间不足)

Exporting Data from Hive

从Hive导出数据

  1. Sqoop does not support exporting data directly from Hive tables, you can only export data from HDFS with --export-dir option, steps are as follows:

    • Sqoop不支持直接从Hive表导出数据,你只能使用 --export-dir 选项从HDFS导出数据,步骤如下:
    1. Determine the structure of the Hive table to be exported, whether it is a partitioned table, whether compression is enabled, etc.
      • 确定要导出的Hive表的结构,是否是分区表,是否启用了压缩等。
    2. Determine the actual storage location in HDFS for the data in the Hive table.
      • 确定Hive表中数据在HDFS中的实际存储位置。
    3. Determine the delimiter settings for the source data.
      • 确定源数据的分隔符设置。
    4. Create a table with the same structure in the database based on the Hive table for exporting data.
      • 根据Hive表在数据库中创建一个具有相同结构的表用于导出数据
    5. Write a command for exporting the data into the target table in the database using the Sqoop export tool, noting that:
      • 编写命令使用Sqoop导出工具将数据导出到数据库的目标表中,注意:
      1. Use --export-dir to correctly specify the directory where the Hive data in HDFS is located 使用 --export-dir 正确指定Hive数据在HDFS中的位置目录
      2. The delimiter settings should be matched with the source table 分隔符设置应与源表匹配

Exporting Data from HBase

从HBase导出数据

  1. Sqoop does not support exporting data directly from HBase tables, but it can be done indirectly with the help of Hive tables, as follows:

    • Sqoop不支持直接从HBase表导出数据,但可以借助Hive表间接完成,如下所示:
    1. Create an external tables based on HBase table in Hive.
      • 在Hive中基于HBase表创建外部表。
    2. Create the Hive internal table based on the Hive external table we just created.
      • 基于刚刚创建的Hive外部表创建Hive内部表。
    3. Load the data from the Hive external table into the Hive internal table.
      • 将数据从Hive外部表加载到Hive内部表中。
    4. Export the data from the Hive internal table to the target table that we created in advance in the database.
      • 将数据从Hive内部表导出到我们预先在数据库中创建的目标表。
    5. Clean up the Hive temporary table if necessary.
      • 如有必要,清理Hive临时表。

Export Results Validation

导出结果验证

  1. To validate the exported results, you can use the --validation option.
    • 要验证导出结果,可以使用 --validation 选项。

Validation Argument
验证参数

Argument Description 说明
--validate Enable validation of data copied, supports single table copy only. 启用复制数据的验证,仅支持单表复制。
--validator <class-name> Specify validator class to use. 指定要使用的验证器类。
--validation-threshold <class-name> Specify validation threshold class to use. 指定要使用的验证阈值类。
--validation-failurehandler <class-name> Specify validation failure handler class to use. 指定要使用的验证失败处理程序类。
  1. The purpose of validation is to compare the differences in records before and after exporting, and thus know whether the exported results are as expected.

    • 验证的目的是比较导出前后记录的差异,从而知道导出结果是否符合预期。
  2. There are 3 basic interfaces:

    • 有3个基本接口:
    1. Validation Threshold:- Determines if the error margin between the source and target are acceptable: Absolute, Percentage Tolerant, etc. Default implementation is Absolute ValidationThreshold which ensures the row counts from source and targets are the same.
      • 验证阈值:- 确定源和目标之间的误差范围是否可接受:绝对值、百分比容忍度等。默认实现是绝对验证阈值,确保源和目标的行数相同。
    2. Validation FailureHandler:- Responsible for handling failures: log an error/warning, abort, etc. Default implementation is LogOnFailureHandler that logs a warning message to the configured logger.
      • 验证失败处理程序:- 负责处理失败:记录错误/警告、中止等。默认实现是LogOnFailureHandler,将警告消息记录到配置的日志记录器。
    3. Validator:- Drives the validation logic by delegating the decision to Validation Threshold and delegating failure handling to ValidationFailureHandler. The default implementation is RowCount Validator which validates the row counts from source and the target.
      • 验证器:- 通过将决策委托给验证阈值并将失败处理委托给验证失败处理程序来驱动验证逻辑。默认实现是行数验证器,验证源和目标的行数。
  3. These interfaces can be found under the org.apache.sqoop.validation package.

    • 这些接口可以在 org.apache.sqoop.validation 包下找到。
1
2
a.sqoop export --connect jdbc:mysql://db.example.com/foo --table bar \
--export-dir /results/bar_data –validate

Export Performance Optimization

导出性能优化

  1. When target database supports, using the --direct parameter in the command may improve the export performance.

    • 当目标数据库支持时,在命令中使用 --direct 参数可能会提高导出性能。
  2. By default, Sqoop’s export function executes one INSERT statement for each row of data exported. If you want to improve the export speed when the data volume is large, you can set a single INSERT statement to insert multiple rows of data in bulk:

    • 默认情况下,Sqoop的导出功能为每行导出的数据执行一条INSERT语句。如果要在数据量较大时提高导出速度,可以设置一条INSERT语句批量插入多行数据:
    1. Add --batch option to the command to enable JDBC batch processing
      • 在命令中添加 --batch 选项以启用JDBC批处理
    2. Modify the number of record rows that can be exported in bulk for each SQL statement
      • 修改每个SQL语句可以批量导出的记录行数
    3. Set the number of query statements submitted by a single transaction
      • 设置单个事务提交的查询语句数量
  3. With -Dsqoop.export.statements.per.transaction=10000, we can specify how many INSERT statements will be executed in a single transaction. A higher value generally helps improve the performance, but depending on the database.

    • 使用 -Dsqoop.export.statements.per.transaction=10000,我们可以指定在一个事务中执行多少条INSERT语句。较高的值通常有助于提高性能,但这取决于数据库。
  4. By adding -Djdbc.transaction.isolation=TRANSACTION_READ_UNCOMMITTED in the Sqoop export command, you can change the transaction isolation level of the database to read uncommitted to improve the export speed and to avoid deadlocks and other problems at the cost of lowering the transaction isolation level.

    • 通过在Sqoop导出命令中添加 -Djdbc.transaction.isolation=TRANSACTION_READ_UNCOMMITTED,你可以将数据库的事务隔离级别更改为读取未提交,以提高导出速度并避免死锁和其他问题,代价是降低了事务隔离级别。