Chapter 5 / 第 5 章 A saved job has kept all the information for executing a specified sqoop command, once a saved job is created, it can be executed at any time you want. 保存的作业保留了执行指定 Sqoop 命令的所有信息,一旦创建了保存的作业,您可以在任何时候执行它。 By default, job descriptions are saved to a private repository stored...

Chapter 4第4章 The Sqoop export tool can export a set of files from HDFS back to RDBMS. Sqoop导出工具可以将一组文件从HDFS导出回关系型数据库(RDBMS)。 The target table must already exist in the database. The input files are read and parsed into a set of records according to the user-specified...

Chapter 3 / 第 3 章 The import tool can be used to import a single table from RDBMS to HDFS. Each row from a table is represented as a separate record in HDFS. 导入工具可用于将单个表从 RDBMS 导入到 HDFS。表中的每一行在 HDFS 中都表示为一个单独的记录。 Records can be stored in text format (one record per line), or in binary...

Chapter 2第 2 章 Sqoop is a collection of related tools. To use Sqoop, you specify the tool you want to use and the arguments that control the tool. Sqoop 是一组相关工具的集合。要使用 Sqoop,你需要指定想要使用的工具以及控制该工具的参数。 Sqoop-Tool - The advantage of using alias scripts is that you can avoid spelling errors by pressing...

Sqoop is a tool for efficient bulk data transfer between Hadoop Distributed File System (HDFS) and Relational Database Management System (RDBMS). Sqoop 是一个用于在 Hadoop 分布式文件系统 (HDFS) 与关系数据库管理系统 (RDBMS) 之间进行高效批量数据传输的工具。 Sqoop was incubated as an Apache top-level project in March 2012. Sqoop 于 2012 年...

FLUME NOTESFlume 笔记Apache Flume is a distributed, highly reliable, and highly available tool for collecting, aggregating, and transferring large amounts of log data from different sources to the central data warehouse. Apache Flume 是一个分布式、高可靠、高可用的工具,用于将来自不同数据源的大量日志数据收集、汇总并传输到中央数据仓库。 It is the top...

Chapter 8第8章Machine learning, a branch of Artificial Intelligence, is dedicated to creating models and algorithms that enable computers to learn from data and improve based on previous experiences without being explicitly programmed for each specific...

Chapter 7第7章Spark Streaming is an extension of Apache Spark that enables scalable, high-throughput, fault-tolerant stream processing of real-time data streams. Spark Streaming 是 Apache Spark 的一个扩展,它支持对实时数据流进行可扩展、高吞吐量、容错的流处理。 It allows you to process real-time data from various sources such as Kafka,...

Chapter 6第 6 章Spark SQLSpark SQLSpark SQL is a module for structured data processing in Apache Spark. It allows users to run SQL queries, use DataFrame and Dataset APIs, and seamlessly mix SQL queries with Spark programs.Spark SQL 是 Apache Spark 中用于结构化数据处理的模块。它允许用户运行 SQL 查询、使用 DataFrame 和 Dataset...

Table Management in HiveHive中的表管理Table management in Hive involves creating databases and tables, defining schemas, and organizing data storage. In this section, we cover how to create and drop databases and tables, and how to manage table properties such as partitions and...