What is Hive?什么是 Hive?Definition: Apache Hive is an open-source data warehouse framework built on Hadoop. 定义:Apache Hive 是一个构建在 Hadoop 之上的开源数据仓库框架。 It allows users to query and analyze large datasets stored in HDFS (Hadoop Distributed File System) using a SQL-like language called HiveQL...

Chapter 4第 4 章RDD (Resilient Distributed Dataset) is Spark’s core low-level abstraction. An RDD is an immutable, partitioned collection of elements that can be processed in parallel. It is resilient because Spark can recompute lost partitions using the RDD’s lineage (the sequence of operations that...

Chapter 5第5章Spark Job IntroductionSpark 作业介绍The transformation function of Spark RDD will not perform any action, and when Spark is executing the action function of RDD, RDD lineage graph is created automatically as we apply transformation(logical DAG). Spark RDD 的转换函数不会执行任何动作,当 Spark 执行 RDD...

Chapter 3第3章The Spark framework is developed in Scala, and therefore developing Spark applications in Scala is natural. While Spark provides APIs for Python, Java, and R, Scala remains the most native and concise language for Spark development. Spark 框架是用 Scala 开发的,因此使用 Scala 开发 Spark 应用程序是很自然的。虽然...

Chapter 2第2章Spark runs in a variety of modes and can be run in either local mode or pseudo-distributed mode on a stand-alone machine. When running in a Cluster mode in a distributed setup, the underlying resource scheduling can either use Mesos or Yarn, or Spark’s Standalone mode itself. Before each...

Chapter 1第1章Spark BackgroundSpark 背景Although MapReduce is suitable for most batch processing work and becomes the preferred technology for enterprise big data processing in the era of big data, its limitations prompt Spark. 虽然 MapReduce 适用于大多数批处理工作,并且成为大数据时代企业大数据处理的首选技术,但其局限性促使了 Spark 的诞生。 MapReduce...