📊 Big Data Concept

📊 大数据概念

What is Big Data?

什么是大数据？

“Big Data refers to information that is too large to be captured, managed, processed, and collated into a more positive purpose to help enterprises make business decisions within a reasonable time through current mainstream software tools.”
“大数据是指那些数据量巨大，无法在合理时间内通过当前主流软件工具进行捕获、管理、处理和整理，以帮助企业做出更积极业务决策的信息。”

Characteristics of Big Data

大数据的特征

Volume: Refers to the enormous size of data, crucial for determining its value.
Volume (数据量): 指数据的巨大规模，对确定其价值至关重要。
Variety: Involves heterogeneous sources and types of data, both structured and unstructured.
Variety (多样性): 涉及异构来源和类型的数据，包括结构化和非结构化数据。
Velocity: Indicates the speed at which data is generated and processed.
Velocity (速度): 表示数据生成和处理的速度。
Veracity: Relates to the quality of the content being analyzed.
Veracity (真实性): 关系到被分析内容的质量。

Examples of Big Data

大数据示例

Source	Description
Source / 来源	Description / 描述
NYSE	Generates about 1 terabyte of new trade data per day.
NYSE (纽约证券交易所)	每天产生约1TB的新交易数据。
Social Media	Facebook ingests over 500 terabytes of new data daily.
Social Media (社交媒体)	Facebook 每天接收超过500TB的新数据。
Travel	A single jet engine can generate over 10 terabytes of data in 30 minutes.
Travel (出行)	单个喷气发动机在30分钟内可产生超过10TB的数据。

Types of Big Data

大数据类型

Structured Data

结构化数据

Definition: Data that can be stored, accessed, and processed in a fixed format.
Definition (定义): 可以以固定格式存储、访问和处理的数据。
Example: An ‘Employee’ table in a database.
Example (示例): 数据库中的“员工”表。

Unstructured Data

非结构化数据

Definition: Data with unknown form or structure, posing challenges for processing and value extraction.
Definition (定义): 形式或结构未知的数据，给处理和价值提取带来挑战。
Example: The output from Google Search.
Example (示例): 谷歌搜索的输出结果。

Semi-Structured Data

半结构化数据

Definition: Contains both structured and unstructured forms but lacks a defined format (e.g., no table definition in relational DBMS).
Definition (定义): 包含结构化和非结构化形式，但缺乏明确定义的格式（例如，关系型数据库管理系统中没有表定义）。
Example: Personal data in an XML file.
Example (示例): XML文件中的个人数据。

The 4 V’s of Big Data

大数据的4V特征

Characteristic	Description
Characteristic / 特征	Description / 描述
Volume	The size of data, which plays a crucial role in determining its value.
Volume (数据量)	数据的规模，在确定其价值方面起着至关重要的作用。
Variety	The different types of data sources, including emails, photos, videos, and more.
Variety (多样性)	不同类型的数据来源，包括电子邮件、照片、视频等。
Velocity	The speed at which data is generated from various sources like social media and sensors.
Velocity (速度)	数据从各种来源（如社交媒体和传感器）生成的速度。
Veracity	The quality of data analyzed; high-veracity data is valuable, while low-veracity data contains noise.
Veracity (真实性)	所分析数据的质量；高真实性数据有价值，而低真实性数据包含噪音。

Advantages of Big Data Processing

大数据处理的优势

Benefit	Description
Benefit / 优势	Description / 描述
Utilizing Outside Intelligence	Access to social data allows organizations to refine their business strategies.
Utilizing Outside Intelligence (利用外部情报)	访问社交数据使组织能够改进其业务战略。
Improved Customer Service	New systems using Big Data technologies enhance feedback evaluation.
Improved Customer Service (改善客户服务)	使用大数据技术的新系统增强了反馈评估。
Early Risk Identification	Enables businesses to identify product/service risks early.
Early Risk Identification (早期风险识别)	使企业能够及早识别产品/服务风险。
Better Operational Efficiency	Facilitates the creation of staging areas for data before processing.
Better Operational Efficiency (提高运营效率)	便于在处理数据之前创建数据暂存区。

Hadoop Ecosystem

Hadoop 生态系统

Key Tools and Features

关键工具和特性

HBase:
- An open-source, non-relational distributed database (NoSQL).
- 一个开源的、非关系型分布式数据库 (NoSQL)。
- Modeled after Google’s BigTable, capable of handling large datasets.
- 模仿谷歌的BigTable构建，能够处理大规模数据集。
- Runs on HDFS and provides fault-tolerant storage for sparse data.
- 在HDFS上运行，并为稀疏数据提供容错存储。
Apache Hive:
- Data warehouse software built on Apache Hadoop for querying large datasets.
- 构建在Apache Hadoop之上的数据仓库软件，用于查询大规模数据集。
- Converts SQL queries into MapReduce jobs for data processing.
- 将SQL查询转换为MapReduce作业进行数据处理。
- Drawbacks include lack of transaction support and slow query speed.
- 缺点包括缺乏事务支持和查询速度慢。
Apache Storm:
- A distributed real-time computing system for processing streaming data.
- 用于处理流数据的分布式实时计算系统。
- Capable of achieving high processing rates, such as one million computations per second.
- 能够实现高处理速率，例如每秒一百万次计算。

Main Features of Hadoop

Hadoop的主要特性

Designed for storing huge datasets across multiple commodity hardware.
设计用于在多个商用硬件上存储海量数据集。
Employs a distributed approach for handling massive volumes of information.
采用分布式方法处理海量信息。

Summary of Hadoop Ecosystem Tools

Hadoop生态系统工具总结

Tool	Description
Tool / 工具	Description / 描述
HBase	NoSQL database for large datasets, fault-tolerant storage.
HBase	用于大规模数据集的NoSQL数据库，提供容错存储。
Hive	Data warehouse for managing and querying large datasets.
Hive	用于管理和查询大规模数据集的数据仓库。
Storm	Real-time computing system for streaming data processing.
Storm	用于流数据处理的实时计算系统。

🖥️ Apache Storm and Hadoop

🖥️ Apache Storm 和 Hadoop

Apache Storm

Scalability

可扩展性

Storm is designed to be scalable, allowing users to simply add machines and adjust the corresponding topology settings.
Storm 被设计成可扩展的，用户只需添加机器并调整相应的拓扑设置即可。

Cluster Coordination

集群协调

Utilizes Hadoop Zookeeper for cluster coordination, which ensures the reliable operation of large clusters.
使用 Hadoop Zookeeper 进行集群协调，确保大型集群的可靠运行。

Fault Tolerance

容错性

“Once the topology is submitted, Storm runs it until the topology is abolished or closed. If an error occurs during execution, Storm reassigns tasks.”
“一旦提交拓扑，Storm会一直运行它，直到拓扑被废除或关闭。如果在执行过程中发生错误，Storm会重新分配任务。”

In distributed systems, node failure does not affect the application, providing enhanced fault tolerance.
在分布式系统中，节点故障不会影响应用程序，从而提供增强的容错能力。

Low Latency

低延迟

Storm is a real-time computing system that requires low latency to process information efficiently.
Storm 是一个实时计算系统，需要低延迟才能高效处理信息。

Apache Zookeeper

Role in Hadoop Ecosystem

在Hadoop生态系统中的角色

Apache Zookeeper acts as a coordinator for Hadoop jobs, managing various services in a distributed environment.
Apache Zookeeper 充当Hadoop作业的协调器，管理分布式环境中的各种服务。
Prior to Zookeeper, coordinating services was time-consuming and complex due to issues with synchronization and configuration maintenance.
在Zookeeper出现之前，由于同步和配置维护问题，协调服务既耗时又复杂。

Benefits of Zookeeper

Zookeeper的优势

Simplifies synchronization, configuration maintenance, grouping, and naming, saving time and improving efficiency.
简化了同步、配置维护、分组和命名，节省了时间并提高了效率。

Apache Sqoop

Data Import and Export

数据导入和导出

Sqoop is an Apache project that allows users to extract data from relational databases into Hadoop for processing and can import analysis results back to the database.
Sqoop 是一个Apache项目，允许用户从关系数据库中提取数据到Hadoop进行处理，并可以将分析结果导回到数据库中。

Import Process

导入过程

The import process runs a MapReduce job that connects to databases (e.g., MySQL) and reads data from tables.
导入过程运行一个 MapReduce作业，该作业连接到数据库（例如MySQL）并从表中读取数据。
By default, it runs four map tasks to speed up the import, with each task writing to separate files within the same directory.
默认情况下，它运行四个map任务以加快导入速度，每个任务将数据写入同一目录下的不同文件中。

Hadoop Overview

Hadoop概述

Introduction to Hadoop

Hadoop简介

Hadoop is an open-source framework overseen by the Apache Software Foundation, written in Java, for storing and processing large data sets on commercial hardware clusters.
Hadoop 是一个由Apache软件基金会监管的开源框架，用Java编写，用于在商用硬件集群上存储和处理大规模数据集。

Components of Hadoop

Hadoop的组件

Consists of two main components:
由两个主要组件构成：
- Hadoop Distributed File System (HDFS)
- Hadoop 分布式文件系统 (HDFS)
- YARN (Yet Another Resource Negotiator)
- YARN (另一种资源协调器)

History of Hadoop

Hadoop的历史

Year	Event
Year / 年份	Event / 事件
2002	Doug Cutting and Mike Cafarella start working on Apache Nutch, facing big data challenges.
2002	Doug Cutting 和 Mike Cafarella 开始开发 Apache Nutch，面临大数据挑战。
2003	Google introduces GFS (Google File System) for efficient data access.
2003	谷歌推出 GFS (谷歌文件系统) 以实现高效数据访问。
2004	Google releases a white paper on Map Reduce, simplifying data processing.
2004	谷歌发布关于 Map Reduce 的白皮书，简化了数据处理。
2005	Introduction of NDFS (Nutch Distributed File System) that includes Map Reduce.
2005	推出包含 Map Reduce 的 NDFS (Nutch分布式文件系统)。
2006	Cutting leaves Google for Yahoo, introduces Hadoop with HDFS; first version 0.1.0 is released.
2006	Cutting 离开谷歌加入雅虎，引入带有 HDFS 的Hadoop；发布第一个版本0.1.0。
2007	Yahoo operates two clusters of 1000 machines.
2007	雅虎运营着两个由1000台机器组成的集群。
2008	Hadoop sorts 1 terabyte of data on a 900-node cluster in 209 seconds.
2008	Hadoop在900个节点的集群上用209秒对1TB数据进行了排序。
2013	Release of Hadoop 2.2.
2013	发布Hadoop 2.2。
2017	Release of Hadoop 3.0.
2017	发布Hadoop 3.0。

Hadoop Distributions

Hadoop发行版

Many companies have developed proprietary distributions of Hadoop, including:
许多公司开发了专有的Hadoop发行版，包括：
- Cloudera Hadoop Distribution
- Cloudera Hadoop 发行版
- Hortonworks Hadoop Distribution
- Hortonworks Hadoop 发行版
- MapR Hadoop Distribution
- MapR Hadoop 发行版
- Pivotal HD
- Pivotal HD

Criteria to Evaluate Hadoop Distributions

评估Hadoop发行版的标准

Criteria	Description
Criteria / 标准	Description / 描述
Performance	Emphasis on low latency and raw performance. Early projects focused on throughput, but current needs require real-time capabilities.
Performance (性能)	强调低延迟和原始性能。早期项目侧重于吞吐量，但当前需求需要实时能力。
Scalability	Ability to scale across nodes, tables, and files without heavy administrative burdens or excessive costs.
Scalability (可扩展性)	能够在节点、表和文件之间进行扩展，而不会带来沉重的管理负担或过高的成本。
Reliability	Hadoop is fault-tolerant, ensuring data reliability even in the event of node failures due to data replication.
Reliability (可靠性)	Hadoop具有容错性，通过数据复制确保即使发生节点故障也能保证数据可靠性。

Real-time Use Case

实时用例

Example: An online gaming company using Hadoop to track millions of users and billions of events can leverage real-time analysis to increase revenue by providing timely advice based on streaming data.
示例：一家在线游戏公司使用Hadoop跟踪数百万用户和数十亿事件，可以通过基于流数据的实时分析提供及时的建议来增加收入。

Conclusion on Reliability

关于可靠性的结论

Hadoop detects and handles faults reliably, storing data effectively across clustered machines, ensuring that even with node failures, data remains accessible.
Hadoop能够可靠地检测和处理故障，有效地在集群机器上存储数据，确保即使节点发生故障，数据仍然可以访问。

🗄️ Big Data Concepts

🗄️ 大数据概念

Data Warehouse Software

数据仓库软件

Which of the following is the data warehouse software that provides query and management of large data sets stored in a distributed environment?
以下哪项是提供对存储在分布式环境中的大规模数据集进行查询和管理的数据仓库软件？

A. Sqoop
A. Sqoop
B. Hive (Correct Answer)
B. Hive (正确答案)
C. Zookeeper
C. Zookeeper
D. HBase
D. HBase

Activity 1.1: Understanding Case Study of DiDi

活动1.1：理解滴滴出行案例研究

Questions and Answers

问题与解答

Who introduced Hadoop?
谁引入了Hadoop？
- A. James Gosling
- A. James Gosling
- B. Bjarne Stroustrup
- B. Bjarne Stroustrup
- C. Dennis MacAlistair Ritchie
- C. Dennis MacAlistair Ritchie
- D. Doug Cutting (Correct Answer)
- D. Doug Cutting (正确答案)
Which of the following is not a common processing tool for big data?
以下哪项不是大数据的常用处理工具？
- A. Hive
- A. Hive
- B. Zookeeper (Correct Answer)
- B. Zookeeper (正确答案)
- C. HBase
- C. HBase
- D. ETL
- D. ETL
What’s wrong with the Hadoop description?
关于Hadoop的描述，以下哪项是错误的？
- A. The core design of Hadoop framework is HDFS and MapReduce
- A. Hadoop框架的核心设计是HDFS和MapReduce
- B. Hadoop can dynamically move data between nodes and ensure the dynamic balance of each node.
- B. Hadoop可以在节点之间动态移动数据，并确保每个节点的动态平衡。
- C. Hadoop automatically saves multiple copies of data and automatically redistributes failed tasks
- C. Hadoop自动保存数据的多个副本，并自动重新分配失败的任务
- D. Hadoop can only store text files (Correct Answer)
- D. Hadoop只能存储文本文件 (正确答案)
Apache ____________ coordinates with various services in a distributed environment.
Apache ____________ 在分布式环境中协调各种服务。
- A. Zookeeper (Correct Answer)
- A. Zookeeper (正确答案)
- B. HBase
- B. HBase
- C. Hadoop
- C. Hadoop
- D. Hive
- D. Hive
Which of the following provides a database system between NoSQL and RDBMS?
以下哪项提供了介于NoSQL和RDBMS之间的数据库系统？
- A. Hive
- A. Hive
- B. HBase (Correct Answer)
- B. HBase (正确答案)
- C. Sqoop
- C. Sqoop
- D. Storm
- D. Storm

Key Concepts of Big Data

大数据的关键概念

Definition of Big Data:
大数据的定义：

“A collection of data that is huge in volume, yet growing exponentially with time.”
“一个数据量巨大，并且随着时间呈指数级增长的数据集合。”

Examples of Big Data

大数据示例

NYSE
纽约证券交易所
Social Media
社交媒体
Travel
出行

Types of Big Data

大数据类型

Structured
结构化数据
Unstructured
非结构化数据
Semi-structured
半结构化数据

The Four Vs of Big Data

大数据的四个V特征

V	Description
V / V特征	Description / 描述
Volume	The scale of data
Volume (数据量)	数据的规模
Variety	The different types of data
Variety (多样性)	不同类型的数据
Value	The worth of the data stored
Value (价值)	存储数据的价值
Velocity	The speed at which data is generated and processed
Velocity (速度)	数据生成和处理的速度

Big Data Ecosystem Tools

大数据生态系统工具

The following tools are commonly used within the Big Data ecosystem:
以下工具通常在大数据生态系统中使用：

Tool	Purpose
Tool / 工具	Purpose / 用途
Hadoop	Framework for distributed storage and processing
Hadoop	用于分布式存储和处理的框架
HBase	NoSQL database that runs on top of HDFS
HBase	运行在HDFS之上的NoSQL数据库
Hive	Data warehouse software for querying and managing data
Hive	用于查询和管理数据的数据仓库软件
Storm	Real-time computation system
Storm	实时计算系统
Zookeeper	Coordination service for distributed applications
Zookeeper	分布式应用程序的协调服务
Sqoop	Tool for transferring data between Hadoop and relational databases
Sqoop	用于在Hadoop和关系数据库之间传输数据的工具
Mahout	Library for scalable machine learning algorithms
Mahout	可扩展机器学习算法库

Variants of Hadoop

Hadoop的变体 (发行版)

Cloudera Hadoop Distribution
Cloudera Hadoop 发行版
Hortonworks Hadoop Distribution
Hortonworks Hadoop 发行版
MapR Hadoop Distribution
MapR Hadoop 发行版
Pivotal HD
Pivotal HD

Criteria to Evaluate Hadoop Distribution

评估Hadoop发行版的标准

When evaluating Hadoop distributions, consider the following criteria:
评估Hadoop发行版时，请考虑以下标准：

Criteria	Description
Criteria / 标准	Description / 描述
Performance	Speed and efficiency in handling data
Performance (性能)	处理数据的速度和效率
Scalability	Ability to grow with increasing data volumes
Scalability (可扩展性)	随着数据量增加而增长的能力
Reliability	Consistency and dependability of the system
Reliability (可靠性)	系统的一致性和可靠性