Chapter 3 / 第 3 章

  1. The import tool can be used to import a single table from RDBMS to HDFS. Each row from a table is represented as a separate record in HDFS.

    • 导入工具可用于将单个表从 RDBMS 导入到 HDFS。表中的每一行在 HDFS 中都表示为一个单独的记录。
  2. Records can be stored in text format (one record per line), or in binary format as Avro or SequenceFile.

    • 记录可以以文本格式(每行一条记录)存储,也可以以 Avro 或 SequenceFile 等二进制格式存储。
  3. Sqoop Import Tool Usage:

    • Sqoop 导入工具用法
    1
    sqoop import (generic-args) (import-args)  
  4. Database Connection Parameters - Sqoop is designed to import tables from a database into HDFS. To do so, you must specify a connect string that describes how to connect to the database. The connect string is similar to a URL, and is passed to Sqoop command with the --connect argument. This describes the server and database to connect to, it may also specify the port. For example: sqoop import --connect jdbc:mysql://database.example.com/employees

    • 数据库连接参数 - Sqoop 旨在将数据库中的表导入到 HDFS。为此,您必须指定一个连接字符串,描述如何连接到数据库。连接字符串类似于 URL,并通过 --connect 参数传递给 Sqoop 命令。这描述了要连接的服务器和数据库,还可能指定端口。例如:sqoop import --connect jdbc:mysql://database.example.com/employees
  5. You should use the full hostname or IP address of the database host that can be seen by all your remote nodes.

    • 您应当使用所有远程节点都能看到的数据库主机的完整主机名或 IP 地址。
  6. You might need to authenticate against the database before you can access it. You can use the username to supply a username to the database. Sqoop provides couple of different ways to supply a password, secure and non-secure, to the database which is detailed below.

    • 在访问数据库之前,您可能需要进行身份验证。您可以使用 username 向数据库提供用户名。Sqoop 提供了几种不同的方式向数据库提供密码(安全和非安全方式),详见下文。
  7. Secure way of supplying password to the database. You should save the password in a file on the users home directory with 400 permissions and specify the path to that file using the --password-file argument, and is the preferred method of entering credentials. Sqoop will then read the password from the file and pass it to the MapReduce cluster using secure means without exposing the password in the job configuration. The file containing the password can either be on the Local FS or HDFS. For example: sqoop import --connect jdbc:mysql://127.0.0.1:3306/employees --username niit --password-file ${user.home}/.password

    • 向数据库提供密码的安全方式。您应该将密码保存在用户主目录下权限为 400 的文件中,并使用 --password-file 参数指定该文件的路径,这是输入凭据的首选方法。随后 Sqoop 将从文件中读取密码,并使用安全手段将其传递给 MapReduce 集群,而不会在作业配置中暴露密码。包含密码的文件可以在本地文件系统或 HDFS 上。例如:sqoop import --connect jdbc:mysql://127.0.0.1:3306/employees --username niit --password-file ${user.home}/.password
  8. Another way of supplying passwords is using the --P argument which will read a password from a console prompt.

    • 另一种提供密码的方法是使用 --P 参数,它将从控制台提示符读取密码。
  9. Cryptographic credentials can also be transmitted between nodes of a MapReduce cluster using insecure means. For example: sqoop import --connect jdbc:mysql://database.example.com/employees --username aaron --password 12345

    • 加密凭据也可以通过非安全手段在 MapReduce 集群的节点之间传输。例如:sqoop import --connect jdbc:mysql://database.example.com/employees --username aaron --password 12345
  10. Sqoop typically imports data from a table, use --table argument to select which table to import from. For example, --table employees. This argument can also identify a VIEW or other table-like entity in a database.

    • Sqoop 通常从表中导入数据,使用 --table 参数选择要从中导入的表。例如,--table employees。此参数还可以标识数据库中的视图或其他类表实体。
  11. Import control arguments:

    • 导入控制参数:
Argument Description 说明
--append Append data to an existing dataset in HDFS 将数据追加到 HDFS 中现有的数据集中
--as-avrodatafile Imports data to Avro Data Files 将数据导入为 Avro 数据文件
--as-sequencefile Imports data to SequenceFiles 将数据导入为 SequenceFiles
--as-textfile Imports data as plain text (default) 将数据导入为纯文本(默认)
--as-parquetfile Imports data to Parquet Files 将数据导入为 Parquet 文件
--boundary-query <statement> Boundary query to use for creating splits 用于创建分片的边界查询语句
--columns <col,col…> Columns to import from table 从表中导入的特定列
--delete-target-dir Delete the import target directory if it exists 如果导入目标目录已存在,则将其删除
--direct Use direct connector if exists for the database 如果数据库存在直接连接器,则使用它
--fetch-size <n> Number of entries to read from database at once. 一次从数据库读取的条目数。
--inline-lob-limit <n> Set the maximum size for an inline LOB 设置内联 LOB 的最大大小
-m, --num-mappers <n> Use n map tasks to import in parallel 使用 n 个 map 任务进行并行导入
-e, --query <statement> Import the results of statement. 导入查询语句的结果。
--split-by <column-name> Column of the table used to split work units. Cannot be used with --autoreset-to-one-mapper option. 用于切分工作单元的表列。不能与 --autoreset-to-one-mapper 选项同时使用。
--split-limit <n> Upper Limit for each split size. This only applies to Integer and Date columns. For date or timestamp fields it is calculated in seconds. 每个分片大小的上限。仅适用于整数和日期列。对于日期或时间戳字段,它以秒为单位计算。
--autoreset-to-one-mapper Import should use one mapper if a table has no primary key and no split-by column is provided. Cannot be used with --split-by <col> option. 如果表没有主键且未提供 split-by 列,则导入应使用一个 mapper。不能与 --split-by <col> 选项同时使用。
--table <table-name> Table to read 要读取的表名
--target-dir <dir> HDFS destination dir HDFS 目标目录
--temporary-rootdir <dir> HDFS directory for temporary files created during import (overrides default “_sqoop”) 导入期间创建临时文件的 HDFS 目录(覆盖默认的 “_sqoop”)
--warehouse-dir <dir> HDFS parent for table destination 表目标的 HDFS 父目录
--where <where clause> WHERE clause to use during import 导入期间使用的 WHERE 子句
-z, --compress Enable compression 启用压缩
--compression-codec <c> Use Hadoop codec (default gzip) 使用 Hadoop 编解码器(默认为 gzip)
--null-string <null-string> The string to be written for a null value for string columns 字符串列中 null 值的写入字符串
--null-non-string <null-string> The string to be written for a null value for non-string columns 非字符串列中 null 值的写入字符串
  1. The --null-string and --null-non-string arguments are optional. If not specified, then the string “null” will be used.
    • --null-string 和 --null-non-string 参数是可选的。如果未指定,将使用字符串 “null”。
  2. You can select a subset of columns and control their ordering by using the --columns argument. This should include a comma-delimited list of columns to import. For example: --columns "name, employee_id, jobtitle".
    • 您可以使用 --columns 参数选择列的子集并控制它们的顺序。这应包含以逗号分隔的导入列列表。例如:--columns "name, employee_id, jobtitle"
  3. You can control which rows are imported by adding a SQL WHERE clause to the import statement. By default, Sqoop generates statements of the form SELECT <column list> FROM <table name>. You can append a WHERE clause to this with the --where argument. For example: --where “id > 400”. Only rows where the id column has a value greater than 400 will be imported.
    • 您可以通过在导入语句中添加 SQL WHERE 子句来控制导入哪些行。默认情况下,Sqoop 生成 SELECT <列列表> FROM <表名> 格式的语句。您可以使用 --where 参数在此基础上追加 WHERE 子句。例如:–where “id > 400”。只有 id 列的值大于 400 的行才会被导入。

Free-form Query Imports / 自由格式查询导入

  1. Sqoop can also import the result set of an arbitrary SQL query. Instead of using the --table, columns and --where arguments, you can specify a SQL statement with the --query argument.
    • Sqoop 还可以导入任意 SQL 查询的结果集。除了使用 --table、columns 和 --where 参数外,您还可以通过 --query 参数指定 SQL 语句。
  2. When importing a free-form query, you must specify a destination directory with --target-dir.
    • 导入自由格式查询时,必须使用 --target-dir 指定目标目录。
  3. If you want to import the results of a query in parallel, then each map task will need to execute a copy of the query, with results partitioned by bounding conditions inferred by Sqoop. Your query must include the token $CONDITIONS which each Sqoop process will replace with a unique condition expression. You must also select a splitting column with --split-by.
    • 如果您想并行导入查询结果,每个 map 任务都需要执行查询的副本,并根据 Sqoop 推断的边界条件对结果进行分区。您的查询必须包含 $CONDITIONS 标记,每个 Sqoop 进程都会将其替换为唯一的条件表达式。您还必须使用 --split-by 选择一个切分列。

Controlling Parallelism / 并行控制

  1. Sqoop imports data in parallel from most database sources.
    • Sqoop 从大多数数据库源并行导入数据。
  2. You can specify the number of map tasks (parallel processes) to use to perform the import by using the -m or --num-mappers argument.
    • 您可以使用 -m 或 --num-mappers 参数指定用于执行导入的 map 任务(并行进程)数量。
  3. By default, four tasks are used. Some databases may see improved performance by increasing this value to 8 or 16.
    • 默认情况下使用四个任务。通过将此值增加到 8 或 16,某些数据库可能会看到性能提升。
  4. When performing parallel imports, Sqoop will identify the primary key column (if present) in a table and use it as the splitting column.
    • 在执行并行导入时,Sqoop 会识别表中的主键列(如果存在)并将其用作切分列。
  5. The low and high values for the splitting column are retrieved from the database, and the map tasks operate on evenly-sized components of the total range.
    • 切分列的最小值和最大值从数据库中检索,map 任务在总范围的均匀大小组件上运行。
  6. If the actual values for the primary key are not uniformly distributed across its range, then this can result in unbalanced tasks.
    • 如果主键的实际值在其范围内分布不均匀,则可能导致任务不平衡。
  7. You should explicitly choose a different column with the --split-by argument.
    • 您应该使用 --split-by 参数显式选择不同的列。
  8. If a table does not have a primary key defined and the --split-by <col> is not provided, then import will fail unless the number of mappers is explicitly set to one with the --num-mappers 1 option or the --autoreset-to-one-mapper option is used.
    • 如果表没有定义主键且未提供 --split-by <列名>,除非使用 --num-mappers 1 选项将 mapper 数量显式设置为 1,或者使用了 --autoreset-to-one-mapper 选项,否则导入将失败。
  9. The option --autoreset-to-one-mapper is typically used with the import-all-tables tool to automatically handle tables without a primary key in a schema.
    • --autoreset-to-one-mapper 选项通常与 import-all-tables 工具一起使用,以自动处理模式中没有主键的表。

Controlling the Import Process / 控制导入过程

  1. --target-dir is incompatible with --warehouse-dir.
    • --target-dir 与 --warehouse-dir 不兼容。
  2. When using direct mode, you can specify additional arguments which should be passed to the underlying tool.
    • 使用 direct(直接)模式时,您可以指定传递给底层工具的其他参数。
  3. By default, imports go to a new target location. If the destination directory already exists in HDFS, Sqoop will refuse to import and overwrite that directory’s contents.
    • 默认情况下,导入将进入新的目标位置。如果 HDFS 中已存在目标目录,Sqoop 将拒绝导入并覆盖该目录的内容。
  4. If you use the --append argument, Sqoop will import data to a temporary directory and then rename the files into the normal target directory in a manner that does not conflict with existing filenames in that directory.
    • 如果您使用 --append 参数,Sqoop 将把数据导入临时目录,然后以不与该目录中现有文件名冲突的方式将文件重命名为正常目标目录。

Incremental Import / 增量导入

  1. Incremental import: Sqoop provides an incremental import mode which can be used to retrieve only rows newer than some previously-imported set of rows.
    • 增量导入:Sqoop 提供了一种增量导入模式,可用于仅检索比之前导入的一组行更新的行。
  2. The following arguments control incremental imports:
    • 以下参数控制增量导入:
Argument Description 说明
--check-column (col) Specifies the column to be examined when determining which rows to import. (the column should not be of type CHAR/NCHAR/VARCHAR/VARNCHAR/ LONGVARCHAR/LONGNVARCHAR) 指定在确定要导入哪些行时要检查的列。(该列的类型不应为 CHAR/NCHAR/VARCHAR/VARNCHAR/ LONGVARCHAR/LONGNVARCHAR)
--incremental (mode) Specifies how Sqoop determines which rows are new. Legal values for mode include append and lastmodified. 指定 Sqoop 如何确定哪些行是新的。mode 的合法值包括 append 和 lastmodified。
--last-value (value) Specifies the maximum value of the check column from the previous import. 指定上一次导入时检查列的最大值。
  1. Sqoop supports two types of incremental imports: append and lastmodified. You can use the incremental argument to specify the type of incremental import to perform.
    • Sqoop 支持两种类型的增量导入:append 和 lastmodified。您可以使用 incremental 参数指定要执行的增量导入类型。
  2. You should specify append mode when importing a table where new rows are continually being added with increasing row id values. You specify the column containing the row’s id with --check-column. Sqoop imports rows where the check column has a value greater than the one specified with --last-value.
    • 当导入一个不断添加新行且行 ID 值递增的表时,应指定 append 模式。您需使用 --check-column 指定包含行 ID 的列。Sqoop 将导入检查列的值大于 --last-value 指定值的行。
  3. An alternate table update strategy supported by Sqoop is called lastmodified mode. You should use this when rows of the source table may be updated, and each such update will set the value of a last-modified column to the current timestamp. Rows where the check column holds a timestamp more recent than the timestamp specified with --last-value are imported.
    • Sqoop 支持的另一种表更新策略称为 lastmodified 模式。当源表中的行可能会被更新,且每次更新都会将 last-modified 列的值设置为当前时间戳时,应使用此模式。检查列中时间戳比 --last-value 指定的时间戳更新的行将被导入。
  4. At the end of an incremental import, the value which should be specified as --last-value for a subsequent import is printed to the screen. When running a subsequent import, you should specify last-value in this way to ensure you import only the new or updated data. This is handled automatically by creating an incremental import as a saved job, which is the preferred mechanism for performing a recurring incremental import. See the section on saved jobs later in this document for more information.
    • 在增量导入结束时,屏幕上会打印出在后续导入中应指定为 --last-value 的值。运行后续导入时,您应以这种方式指定 last-value,以确保仅导入新的或更新的数据。通过将增量导入创建为保存的作业(saved job),可以自动处理此问题,这是执行定期增量导入的首选机制。有关更多信息,请参阅本文档后面关于保存作业的部分。
  5. File Formats Selection to Import: You can import data in one of two file formats: delimited text or SequenceFiles.
    • 导入文件格式选择:您可以以两种文件格式之一导入数据:分隔文本或 SequenceFiles。
  6. Delimited text is the default import format. You can also specify it explicitly by using the --as-textfile argument. This argument will write string-based representations of each record to the output files, with delimiter characters between individual columns and rows. These delimiters may be commas, tabs, or other characters. (The delimiters can be selected; see “Output line formatting arguments.”)
    • 分隔文本是默认的导入格式。您也可以使用 --as-textfile 参数显式指定它。此参数将每个记录的基于字符串的表示形式写入输出文件,各列和各行之间有分隔符。这些分隔符可以是逗号、制表符或其他字符。(可以选择分隔符;请参阅“输出行格式化参数”。)
  7. Output line formatting arguments:
    • 输出行格式化参数:
Argument Description 说明
--enclosed-by <char> Sets the escape character 设置包围字符(原文此处描述有误,通常用于字段包围)
--escaped-by <char> Sets the escape character 设置转义字符
--fields-terminated-by <char> Sets the field separator character 设置字段分隔符
--lines-terminated-by <char> Sets the end-of-line character 设置行结束符
--mysql-delimiters Uses MySQL’s default delimiter set: fields: , lines: \n escaped-by: \ optionally-enclosed-by: ‘ 使用 MySQL 的默认分隔符集:字段:, 行:\n 转义:\ 可选包围:’
--optionally-enclosed-by <char> Sets a field enclosing character 设置字段包围字符(可选)
  1. Delimiters may be specified as:

    • 分隔符可以指定为:

    a. A character (–fields-terminated-by X)

    • 一个字符 (–fields-terminated-by X)

    b. An escape character (–fields-terminated-by \t). Supported escape characters are:

    • 一个转义字符 (–fields-terminated-by \t)。支持的转义字符包括:

    c. \b (backspace)

    • \b (退格)

    d. \n (newline)

    • \n (换行)

    e. \r (carriage return)

    • \r (回车)

    f. \t (tab)

    • \t (制表符)

    g. “ (double-quote)

    • “ (双引号)

    h. ‘ (single-quote)

    • ‘ (单引号)

    i. \ (backslash)

    • \ (反斜杠)

    j. \0 (NUL)

    • \0 (空字符)
      This will insert NULL characters between fields or lines, or will disable enclosing/escaping if used for one of the --enclosed-by, -optionally-enclosed-by, or escaped-by arguments.
    • 这将在字段或行之间插入 NULL 字符,或者如果用于 --enclosed-by、-optionally-enclosed-by 或 escaped-by 参数之一,则将禁用包围/转义。
  2. The default delimiters are:

    • 默认分隔符为:

    a. a comma (,) for fields,

    • 逗号 (,) 用于字段,

    b. a newline (\n) for records,

    • 换行符 (\n) 用于记录,

    c. no quote character, and

    • 无引号字符,以及

    d. no escape character.

    • 无转义字符。
  3. The --mysql-delimiters argument is a shorthand argument which uses the default delimiters for the mysqldump program. If you use the mysqldump delimiters in conjunction with a direct-mode import (with --direct), very fast imports can be achieved.

    • --mysql-delimiters 参数是一个简写参数,使用 mysqldump 程序的默认分隔符。如果将 mysqldump 分隔符与 direct 模式导入(使用 --direct)结合使用,可以实现非常快速的导入。
  4. While the choice of delimiters is most important for a text-mode import, it is still relevant if you import to SequenceFiles with --as-sequencefile. The generated class’ toString() method will use the delimiters you specify, so subsequent formatting of the output data will rely on the delimiters you choose.

    • 虽然分隔符的选择对于文本模式导入最为重要,但如果您使用 --as-sequencefile 导入 SequenceFiles,它仍然相关。生成的类的 toString() 方法将使用您指定的分隔符,因此输出数据的后续格式将依赖于您选择的分隔符。
  5. Input parsing arguments:

    • 输入解析参数:
Argument Description 说明
--input-enclosed-by <char> Sets a required field encloser 设置必选字段包围符
--input-escaped-by <char> Sets the input escape character 设置输入转义字符
--input-fields-terminated-by <char> Sets the input field separator 设置输入字段分隔符
--input-lines-terminated-by <char> Sets the input end-of-line character 设置输入行结束符
--input-optionally-enclosed-by <char> Sets a field enclosing character 设置输入字段包围字符(可选)
  1. When Sqoop imports data to HDFS, it generates a Java class which can reinterpret the text files that it creates when doing a delimited-format import. The delimiters are chosen with arguments such as fields-terminated-by; this controls both how the data is written to disk, and how the generated parse() method reinterprets this data. The delimiters used by the parse() method can be chosen independently of the output arguments, by using --input-fields-terminated-by, and so on.
    • 当 Sqoop 将数据导入 HDFS 时,它会生成一个 Java 类,该类可以在执行分隔格式导入时重新解释其创建的文本文件。分隔符通过 fields-terminated-by 等参数选择;这控制了数据如何写入磁盘,以及生成的 parse() 方法如何重新解释此数据。parse() 方法使用的分隔符可以通过使用 --input-fields-terminated-by 等参数独立于输出参数进行选择。
  2. SequenceFiles are a binary format that store individual records in custom record-specific data types. These data types are manifested as Java classes. Sqoop will automatically generate these data types for you. This format supports exact storage of all data in binary representations, and is appropriate for storing binary data (for example, VARBINARY columns), or data that will be principle manipulated by custom MapReduce programs (reading from SequenceFiles is higher-performance than reading from text files, as records do not need to be parsed).
    • SequenceFiles 是一种二进制格式,将单个记录存储在自定义的特定于记录的数据类型中。这些数据类型表现为 Java 类。Sqoop 会自动为您生成这些数据类型。这种格式支持所有数据的二进制精确存储,适用于存储二进制数据(例如 VARBINARY 列),或主要由自定义 MapReduce 程序操作的数据(从 SequenceFiles 读取比从文本文件读取性能更高,因为不需要解析记录)。
  3. Avro data files are a compact, efficient binary format that provides interoperability with applications written in other programming languages. Avro also supports versioning, so that when, e.g., columns are added or removed from a table, previously imported data files can be processed along with new ones.
    • Avro 数据文件是一种紧凑、高效的二进制格式,可提供与其他编程语言编写的应用程序的互操作性。Avro 还支持版本控制,因此,例如,当从表中添加或删除列时,可以处理以前导入的数据文件以及新文件。
  4. By default, data is not compressed. You can compress your data by using the deflate (gzip) algorithm with the z or --compress argument, or specify any Hadoop compression codec using the compression-codec argument. This applies to SequenceFile, text, and Avro files.
    • 默认情况下,数据不压缩。您可以使用 z 或 --compress 参数使用 deflate (gzip) 算法压缩数据,或使用 compression-codec 参数指定任何 Hadoop 压缩编解码器。这适用于 SequenceFile、文本和 Avro 文件。

Importing Data into Hive / 将数据导入 Hive

  1. Sqoop’s import tool’s main function is to upload your data into files in HDFS. If you have a Hive metastore associated with your HDFS cluster, Sqoop can also import the data into Hive by generating and executing a CREATE TABLE statement to define the data’s layout in Hive. Importing data into Hive is as simple as adding the --hive-import option to your Sqoop command line.
    • Sqoop 导入工具的主要功能是将数据上传到 HDFS 中的文件中。如果您的 HDFS 集群关联了 Hive 元存储,Sqoop 还可以通过生成并执行 CREATE TABLE 语句来定义数据在 Hive 中的布局,从而将数据导入 Hive。将数据导入 Hive 非常简单,只需在 Sqoop 命令行中添加 --hive-import 选项即可。
  2. If the Hive table already exists, you can specify the --hive-overwrite option to indicate that existing table in hive must be replaced. After your data is imported into HDFS or this step is omitted, Sqoop will generate a Hive script containing a CREATE TABLE operation defining your columns using Hive’s types, and a LOAD DATA INPATH statement to move the data files into Hive’s warehouse directory.
    • 如果 Hive 表已存在,您可以指定 --hive-overwrite 选项,表示必须替换 Hive 中的现有表。在数据导入 HDFS 或省略此步骤后,Sqoop 将生成一个 Hive 脚本,其中包含使用 Hive 类型定义列的 CREATE TABLE 操作,以及将数据文件移动到 Hive 仓库目录的 LOAD DATA INPATH 语句。
  3. Hive arguments:
    • Hive 参数:
Argument Description 说明
--hive-home <dir> Override $HIVE_HOME 覆盖 $HIVE_HOME
--hive-import Import tables into Hive (Uses Hive’s default delimiters if none are set). 将表导入 Hive(如果未设置分隔符,则使用 Hive 的默认分隔符)。
--hive-overwrite Overwrite existing data in the Hive table. 覆盖 Hive 表中的现有数据。
--create-hive-table If set, then the job will fail if the target hive table exists. By default this property is false. 如果设置,若目标 Hive 表存在,则作业将失败。默认情况下此属性为 false。
--hive-table <table-name> Sets the table name to use when importing to Hive. 设置导入 Hive 时使用的表名。
--hive-drop-import-delims Drops \n, \r, and \01 from string fields when importing to Hive. 导入 Hive 时从字符串字段中删除 \n, \r 和 \01。
--hive-delims-replacement Replace \n, \r, and \01 from string fields with user defined string when importing to Hive. 导入 Hive 时用用户定义的字符串替换字符串字段中的 \n, \r 和 \01。
--hive-partition-key Name of a hive field to partition are sharded on 用于分区的 Hive 字段名称
--hive-partition-value <v> String-value that serves as partition key for this imported into hive in this job. 作为此次导入 Hive 作业的分区键的字符串值。
--map-column-hive <map> Override default mapping from SQL type to Hive type for configured columns. If specify commas in this argument, use URL encoded keys and values, for example, use DECIMAL(1%2C%201) instead of DECIMAL (1, 1). 覆盖已配置列的 SQL 类型到 Hive 类型的默认映射。如果在此参数中指定逗号,请使用 URL 编码的键和值,例如,使用 DECIMAL(1%2C%201) 代替 DECIMAL (1, 1)。
  1. The import-all-tables tool imports a set of tables from an RDBMS to HDFS. Data from each table is stored in a separate directory in HDFS. For the import-all-tables tool to be useful, the following conditions must be met:

    • import-all-tables 工具将一组表从 RDBMS 导入到 HDFS。每个表的数据存储在 HDFS 的单独目录中。为了使 import-all-tables 工具发挥作用,必须满足以下条件:

    a. Each table must have a single-column primary key or --autoreset-to-one-mapper option must be used.

    • 每个表必须有一个单列主键,或者必须使用 --autoreset-to-one-mapper 选项。

    b. You must intend to import all columns of each table.

    • 您必须打算导入每个表的所有列。

    c. You must not intend to use non-default splitting column, nor impose any conditions via a WHERE clause.

    • 您不能打算使用非默认的切分列,也不能通过 WHERE 子句施加任何条件。

Import Results Validation / 导入结果验证

  1. To validate the exported results, you can use the --validation option.
    • 要验证导出结果,可以使用 --validation 选项。

Validation Argument / 验证参数

Argument Description 说明
--validate Enable validation of data copied, supports single table copy only. 启用数据复制验证,仅支持单表复制。
--validator <class-name> Specify validator class to use. 指定要使用的验证器类。
--validation-threshold <class-name> Specify validation threshold class to use. 指定要使用的验证阈值类。
--validation-failurehandler <class-name> Specify validation failure handler class to use. 指定要使用的验证失败处理程序类。
  1. The purpose of validation is to compare the differences in records before and after exporting, and thus know whether the exported results are as expected.

    • 验证的目的是比较导出前后的记录差异,从而了解导出的结果是否符合预期。
  2. There are 3 basic interfaces:

    • 有 3 个基本接口:

    a. Validation Threshold:- Determines if the error margin between the source and target are acceptable: Absolute, Percentage Tolerant, etc. Default implementation is Absolute Validation Threshold which ensures the row counts from source and targets are the same.

    • 验证阈值:- 确定源和目标之间的误差范围是否可接受:绝对、百分比容差等。默认实现是绝对验证阈值,它确保源和目标的行数相同。

    b. Validation FailureHandler:- Responsible for handling failures: log an error/warning, abort, etc. Default implementation is LogOnFailureHandler that logs a warning message to the configured logger.

    • 验证失败处理程序:- 负责处理失败:记录错误/警告、中止等。默认实现是 LogOnFailureHandler,它向配置的记录器记录警告消息。

    c. Validator:- Drives the validation logic by delegating the decision to Validation Threshold and delegating failure handling to ValidationFailureHandler. The default implementation is RowCount Validator which validates the row counts from source and the target.

    • 验证器:- 通过将决策委托给验证阈值并将失败处理委托给验证失败处理程序来驱动验证逻辑。默认实现是 RowCount 验证器,它验证源和目标的行数。
  3. These interfaces can be found under the org.apache.sqoop.validation package. Here is a simple example of exporting HDFS data to a bar table in a database and enabling row count verification: sqoop import --connect jdbc:mysql://db.example.com/foo --table bar --target-dir /results/bar_data -validate

    • 这些接口可以在 org.apache.sqoop.validation 包下找到。这是一个将 HDFS 数据导出到数据库中的 bar 表并启用行数验证的简单示例:sqoop import --connect jdbc:mysql://db.example.com/foo --table bar --target-dir /results/bar_data -validate

Importing Performance Optimization / 导入性能优化

  1. Sqoop import tasks use 4 Map tasks by default, which is not always the best choice. Map task number can be specified by -m or --num-mappers option, which can be considered as the following ways:

    • Sqoop 导入任务默认使用 4 个 Map 任务,这并不总是最佳选择。Map 任务数量可以通过 -m 或 --num-mappers 选项指定,可以考虑以下方式:

    a. When the amount of data is smaller than the Block Size defined in HDFS, only one Map task should be used, which can effectively reduce the execution time of MapReduce tasks, as well as reduce the number of generated files and save disk space.

    • 当数据量小于 HDFS 中定义的块大小时,应仅使用一个 Map 任务,这可以有效减少 MapReduce 任务的执行时间,以及减少生成的文件数量并节省磁盘空间。

    b. When the amount of data is large, performance can be improved by increasing the parallelism, but it is not always getting better to increase it. Normally the parallelism should not exceed the maximum number of virtual CPUs that can be requested from YARN for a MapReduce task on that node (the corresponding configuration item is yarn.scheduler.maximum-allocation-vcores, which can be configured in yarn-site.xml with a default value of 4).

    • 当数据量较大时,可以通过增加并行度来提高性能,但并非总是增加得越多越好。通常,并行度不应超过该节点上可从 YARN 为 MapReduce 任务请求的最大虚拟 CPU 数量(相应的配置项是 yarn.scheduler.maximum-allocation-vcores,可以在 yarn-site.xml 中配置,默认值为 4)。
  2. You can use --fetch-size to specify the maximum number of data to be read from the database at a time when executing the import, the default value is 1000. It is recommended to consider the following aspects.

    • 您可以使用 --fetch-size 指定执行导入时一次从数据库读取的最大数据量,默认值为 1000。建议考虑以下方面。

    a. Whether the table to be imported is a wide table, and whether it contains large object fields or long text fields.

    • 要导入的表是否为宽表,以及是否包含大对象字段或长文本字段。

    b. Database Performance

    • 数据库性能
  3. Using --direct mode to import sometimes can improve performance if the database supports it. Also, using the --relaxed-isolation option to instruct Sqoop to import data using the read uncommitted isolation level can improve data transfer speed if the database supports it.

    • 如果数据库支持,使用 --direct 模式导入有时可以提高性能。此外,如果数据库支持,使用 --relaxed-isolation 选项指示 Sqoop 使用读取未提交(read uncommitted)隔离级别导入数据,可以提高数据传输速度。