📊 Big Data Concept
📊 大数据概念
What is Big Data?
什么是大数据?
“Big Data refers to information that is too large to be captured, managed, processed, and collated into a more positive purpose to help enterprises make business decisions within a reasonable time through current mainstream software tools.”
“大数据是指那些数据量巨大,无法在合理时间内通过当前主流软件工具进行捕获、管理、处理和整理,以帮助企业做出更积极业务决策的信息。”
Characteristics of Big Data
大数据的特征
- Volume: Refers to the enormous size of data, crucial for determining its value.
- Volume (数据量): 指数据的巨大规模,对确定其价值至关重要。
- Variety: Involves heterogeneous sources and types of data, both structured and unstructured.
- Variety (多样性): 涉及异构来源和类型的数据,包括结构化和非结构化数据。
- Velocity: Indicates the speed at which data is generated and processed.
- Velocity (速度): 表示数据生成和处理的速度。
- Veracity: Relates to the quality of the content being analyzed.
- Veracity (真实性): 关系到被分析内容的质量。
Examples of Big Data
大数据示例
| Source | Description |
|---|---|
| Source / 来源 | Description / 描述 |
| NYSE | Generates about 1 terabyte of new trade data per day. |
| NYSE (纽约证券交易所) | 每天产生约1TB的新交易数据。 |
| Social Media | Facebook ingests over 500 terabytes of new data daily. |
| Social Media (社交媒体) | Facebook 每天接收超过500TB的新数据。 |
| Travel | A single jet engine can generate over 10 terabytes of data in 30 minutes. |
| Travel (出行) | 单个喷气发动机在30分钟内可产生超过10TB的数据。 |
Types of Big Data
大数据类型
Structured Data
结构化数据
- Definition: Data that can be stored, accessed, and processed in a fixed format.
- Definition (定义): 可以以固定格式存储、访问和处理的数据。
- Example: An ‘Employee’ table in a database.
- Example (示例): 数据库中的“员工”表。
Unstructured Data
非结构化数据
- Definition: Data with unknown form or structure, posing challenges for processing and value extraction.
- Definition (定义): 形式或结构未知的数据,给处理和价值提取带来挑战。
- Example: The output from Google Search.
- Example (示例): 谷歌搜索的输出结果。
Semi-Structured Data
半结构化数据
- Definition: Contains both structured and unstructured forms but lacks a defined format (e.g., no table definition in relational DBMS).
- Definition (定义): 包含结构化和非结构化形式,但缺乏明确定义的格式(例如,关系型数据库管理系统中没有表定义)。
- Example: Personal data in an XML file.
- Example (示例): XML文件中的个人数据。
The 4 V’s of Big Data
大数据的4V特征
| Characteristic | Description |
|---|---|
| Characteristic / 特征 | Description / 描述 |
| Volume | The size of data, which plays a crucial role in determining its value. |
| Volume (数据量) | 数据的规模,在确定其价值方面起着至关重要的作用。 |
| Variety | The different types of data sources, including emails, photos, videos, and more. |
| Variety (多样性) | 不同类型的数据来源,包括电子邮件、照片、视频等。 |
| Velocity | The speed at which data is generated from various sources like social media and sensors. |
| Velocity (速度) | 数据从各种来源(如社交媒体和传感器)生成的速度。 |
| Veracity | The quality of data analyzed; high-veracity data is valuable, while low-veracity data contains noise. |
| Veracity (真实性) | 所分析数据的质量;高真实性数据有价值,而低真实性数据包含噪音。 |
Advantages of Big Data Processing
大数据处理的优势
| Benefit | Description |
|---|---|
| Benefit / 优势 | Description / 描述 |
| Utilizing Outside Intelligence | Access to social data allows organizations to refine their business strategies. |
| Utilizing Outside Intelligence (利用外部情报) | 访问社交数据使组织能够改进其业务战略。 |
| Improved Customer Service | New systems using Big Data technologies enhance feedback evaluation. |
| Improved Customer Service (改善客户服务) | 使用大数据技术的新系统增强了反馈评估。 |
| Early Risk Identification | Enables businesses to identify product/service risks early. |
| Early Risk Identification (早期风险识别) | 使企业能够及早识别产品/服务风险。 |
| Better Operational Efficiency | Facilitates the creation of staging areas for data before processing. |
| Better Operational Efficiency (提高运营效率) | 便于在处理数据之前创建数据暂存区。 |
Hadoop Ecosystem
Hadoop 生态系统
Key Tools and Features
关键工具和特性
HBase:
- An open-source, non-relational distributed database (NoSQL).
- 一个开源的、非关系型分布式数据库 (NoSQL)。
- Modeled after Google’s BigTable, capable of handling large datasets.
- 模仿谷歌的BigTable构建,能够处理大规模数据集。
- Runs on HDFS and provides fault-tolerant storage for sparse data.
- 在HDFS上运行,并为稀疏数据提供容错存储。
Apache Hive:
- Data warehouse software built on Apache Hadoop for querying large datasets.
- 构建在Apache Hadoop之上的数据仓库软件,用于查询大规模数据集。
- Converts SQL queries into MapReduce jobs for data processing.
- 将SQL查询转换为MapReduce作业进行数据处理。
- Drawbacks include lack of transaction support and slow query speed.
- 缺点包括缺乏事务支持和查询速度慢。
Apache Storm:
- A distributed real-time computing system for processing streaming data.
- 用于处理流数据的分布式实时计算系统。
- Capable of achieving high processing rates, such as one million computations per second.
- 能够实现高处理速率,例如每秒一百万次计算。
Main Features of Hadoop
Hadoop的主要特性
- Designed for storing huge datasets across multiple commodity hardware.
- 设计用于在多个商用硬件上存储海量数据集。
- Employs a distributed approach for handling massive volumes of information.
- 采用分布式方法处理海量信息。
Summary of Hadoop Ecosystem Tools
Hadoop生态系统工具总结
| Tool | Description |
|---|---|
| Tool / 工具 | Description / 描述 |
| HBase | NoSQL database for large datasets, fault-tolerant storage. |
| HBase | 用于大规模数据集的NoSQL数据库,提供容错存储。 |
| Hive | Data warehouse for managing and querying large datasets. |
| Hive | 用于管理和查询大规模数据集的数据仓库。 |
| Storm | Real-time computing system for streaming data processing. |
| Storm | 用于流数据处理的实时计算系统。 |
🖥️ Apache Storm and Hadoop
🖥️ Apache Storm 和 Hadoop
Apache Storm
Scalability
可扩展性
- Storm is designed to be scalable, allowing users to simply add machines and adjust the corresponding topology settings.
- Storm 被设计成可扩展的,用户只需添加机器并调整相应的拓扑设置即可。
Cluster Coordination
集群协调
- Utilizes Hadoop Zookeeper for cluster coordination, which ensures the reliable operation of large clusters.
- 使用 Hadoop Zookeeper 进行集群协调,确保大型集群的可靠运行。
Fault Tolerance
容错性
“Once the topology is submitted, Storm runs it until the topology is abolished or closed. If an error occurs during execution, Storm reassigns tasks.”
“一旦提交拓扑,Storm会一直运行它,直到拓扑被废除或关闭。如果在执行过程中发生错误,Storm会重新分配任务。”
- In distributed systems, node failure does not affect the application, providing enhanced fault tolerance.
- 在分布式系统中,节点故障不会影响应用程序,从而提供增强的容错能力。
Low Latency
低延迟
- Storm is a real-time computing system that requires low latency to process information efficiently.
- Storm 是一个实时计算系统,需要低延迟才能高效处理信息。
Apache Zookeeper
Role in Hadoop Ecosystem
在Hadoop生态系统中的角色
- Apache Zookeeper acts as a coordinator for Hadoop jobs, managing various services in a distributed environment.
- Apache Zookeeper 充当Hadoop作业的协调器,管理分布式环境中的各种服务。
- Prior to Zookeeper, coordinating services was time-consuming and complex due to issues with synchronization and configuration maintenance.
- 在Zookeeper出现之前,由于同步和配置维护问题,协调服务既耗时又复杂。
Benefits of Zookeeper
Zookeeper的优势
- Simplifies synchronization, configuration maintenance, grouping, and naming, saving time and improving efficiency.
- 简化了同步、配置维护、分组和命名,节省了时间并提高了效率。
Apache Sqoop
Data Import and Export
数据导入和导出
- Sqoop is an Apache project that allows users to extract data from relational databases into Hadoop for processing and can import analysis results back to the database.
- Sqoop 是一个Apache项目,允许用户从关系数据库中提取数据到Hadoop进行处理,并可以将分析结果导回到数据库中。
Import Process
导入过程
- The import process runs a MapReduce job that connects to databases (e.g., MySQL) and reads data from tables.
- 导入过程运行一个 MapReduce作业,该作业连接到数据库(例如MySQL)并从表中读取数据。
- By default, it runs four map tasks to speed up the import, with each task writing to separate files within the same directory.
- 默认情况下,它运行四个map任务以加快导入速度,每个任务将数据写入同一目录下的不同文件中。
Hadoop Overview
Hadoop概述
Introduction to Hadoop
Hadoop简介
- Hadoop is an open-source framework overseen by the Apache Software Foundation, written in Java, for storing and processing large data sets on commercial hardware clusters.
- Hadoop 是一个由Apache软件基金会监管的开源框架,用Java编写,用于在商用硬件集群上存储和处理大规模数据集。
Components of Hadoop
Hadoop的组件
- Consists of two main components:
- 由两个主要组件构成:
- Hadoop Distributed File System (HDFS)
- Hadoop 分布式文件系统 (HDFS)
- YARN (Yet Another Resource Negotiator)
- YARN (另一种资源协调器)
History of Hadoop
Hadoop的历史
| Year | Event |
|---|---|
| Year / 年份 | Event / 事件 |
| 2002 | Doug Cutting and Mike Cafarella start working on Apache Nutch, facing big data challenges. |
| 2002 | Doug Cutting 和 Mike Cafarella 开始开发 Apache Nutch,面临大数据挑战。 |
| 2003 | Google introduces GFS (Google File System) for efficient data access. |
| 2003 | 谷歌推出 GFS (谷歌文件系统) 以实现高效数据访问。 |
| 2004 | Google releases a white paper on Map Reduce, simplifying data processing. |
| 2004 | 谷歌发布关于 Map Reduce 的白皮书,简化了数据处理。 |
| 2005 | Introduction of NDFS (Nutch Distributed File System) that includes Map Reduce. |
| 2005 | 推出包含 Map Reduce 的 NDFS (Nutch分布式文件系统)。 |
| 2006 | Cutting leaves Google for Yahoo, introduces Hadoop with HDFS; first version 0.1.0 is released. |
| 2006 | Cutting 离开谷歌加入雅虎,引入带有 HDFS 的Hadoop;发布第一个版本0.1.0。 |
| 2007 | Yahoo operates two clusters of 1000 machines. |
| 2007 | 雅虎运营着两个由1000台机器组成的集群。 |
| 2008 | Hadoop sorts 1 terabyte of data on a 900-node cluster in 209 seconds. |
| 2008 | Hadoop在900个节点的集群上用209秒对1TB数据进行了排序。 |
| 2013 | Release of Hadoop 2.2. |
| 2013 | 发布Hadoop 2.2。 |
| 2017 | Release of Hadoop 3.0. |
| 2017 | 发布Hadoop 3.0。 |
Hadoop Distributions
Hadoop发行版
- Many companies have developed proprietary distributions of Hadoop, including:
- 许多公司开发了专有的Hadoop发行版,包括:
- Cloudera Hadoop Distribution
- Cloudera Hadoop 发行版
- Hortonworks Hadoop Distribution
- Hortonworks Hadoop 发行版
- MapR Hadoop Distribution
- MapR Hadoop 发行版
- Pivotal HD
- Pivotal HD
Criteria to Evaluate Hadoop Distributions
评估Hadoop发行版的标准
| Criteria | Description |
|---|---|
| Criteria / 标准 | Description / 描述 |
| Performance | Emphasis on low latency and raw performance. Early projects focused on throughput, but current needs require real-time capabilities. |
| Performance (性能) | 强调低延迟和原始性能。早期项目侧重于吞吐量,但当前需求需要实时能力。 |
| Scalability | Ability to scale across nodes, tables, and files without heavy administrative burdens or excessive costs. |
| Scalability (可扩展性) | 能够在节点、表和文件之间进行扩展,而不会带来沉重的管理负担或过高的成本。 |
| Reliability | Hadoop is fault-tolerant, ensuring data reliability even in the event of node failures due to data replication. |
| Reliability (可靠性) | Hadoop具有容错性,通过数据复制确保即使发生节点故障也能保证数据可靠性。 |
Real-time Use Case
实时用例
- Example: An online gaming company using Hadoop to track millions of users and billions of events can leverage real-time analysis to increase revenue by providing timely advice based on streaming data.
- 示例:一家在线游戏公司使用Hadoop跟踪数百万用户和数十亿事件,可以通过基于流数据的实时分析提供及时的建议来增加收入。
Conclusion on Reliability
关于可靠性的结论
- Hadoop detects and handles faults reliably, storing data effectively across clustered machines, ensuring that even with node failures, data remains accessible.
- Hadoop能够可靠地检测和处理故障,有效地在集群机器上存储数据,确保即使节点发生故障,数据仍然可以访问。
🗄️ Big Data Concepts
🗄️ 大数据概念
Data Warehouse Software
数据仓库软件
Which of the following is the data warehouse software that provides query and management of large data sets stored in a distributed environment?
以下哪项是提供对存储在分布式环境中的大规模数据集进行查询和管理的数据仓库软件?
- A. Sqoop
- A. Sqoop
- B. Hive (Correct Answer)
- B. Hive (正确答案)
- C. Zookeeper
- C. Zookeeper
- D. HBase
- D. HBase
Activity 1.1: Understanding Case Study of DiDi
活动1.1:理解滴滴出行案例研究
Questions and Answers
问题与解答
Who introduced Hadoop?
谁引入了Hadoop?- A. James Gosling
- A. James Gosling
- B. Bjarne Stroustrup
- B. Bjarne Stroustrup
- C. Dennis MacAlistair Ritchie
- C. Dennis MacAlistair Ritchie
- D. Doug Cutting (Correct Answer)
- D. Doug Cutting (正确答案)
Which of the following is not a common processing tool for big data?
以下哪项不是大数据的常用处理工具?- A. Hive
- A. Hive
- B. Zookeeper (Correct Answer)
- B. Zookeeper (正确答案)
- C. HBase
- C. HBase
- D. ETL
- D. ETL
What’s wrong with the Hadoop description?
关于Hadoop的描述,以下哪项是错误的?- A. The core design of Hadoop framework is HDFS and MapReduce
- A. Hadoop框架的核心设计是HDFS和MapReduce
- B. Hadoop can dynamically move data between nodes and ensure the dynamic balance of each node.
- B. Hadoop可以在节点之间动态移动数据,并确保每个节点的动态平衡。
- C. Hadoop automatically saves multiple copies of data and automatically redistributes failed tasks
- C. Hadoop自动保存数据的多个副本,并自动重新分配失败的任务
- D. Hadoop can only store text files (Correct Answer)
- D. Hadoop只能存储文本文件 (正确答案)
Apache ____________ coordinates with various services in a distributed environment.
Apache ____________ 在分布式环境中协调各种服务。- A. Zookeeper (Correct Answer)
- A. Zookeeper (正确答案)
- B. HBase
- B. HBase
- C. Hadoop
- C. Hadoop
- D. Hive
- D. Hive
Which of the following provides a database system between NoSQL and RDBMS?
以下哪项提供了介于NoSQL和RDBMS之间的数据库系统?- A. Hive
- A. Hive
- B. HBase (Correct Answer)
- B. HBase (正确答案)
- C. Sqoop
- C. Sqoop
- D. Storm
- D. Storm
Key Concepts of Big Data
大数据的关键概念
- Definition of Big Data:
大数据的定义:“A collection of data that is huge in volume, yet growing exponentially with time.”
“一个数据量巨大,并且随着时间呈指数级增长的数据集合。”
Examples of Big Data
大数据示例
- NYSE
- 纽约证券交易所
- Social Media
- 社交媒体
- Travel
- 出行
Types of Big Data
大数据类型
- Structured
- 结构化数据
- Unstructured
- 非结构化数据
- Semi-structured
- 半结构化数据
The Four Vs of Big Data
大数据的四个V特征
| V | Description |
|---|---|
| V / V特征 | Description / 描述 |
| Volume | The scale of data |
| Volume (数据量) | 数据的规模 |
| Variety | The different types of data |
| Variety (多样性) | 不同类型的数据 |
| Value | The worth of the data stored |
| Value (价值) | 存储数据的价值 |
| Velocity | The speed at which data is generated and processed |
| Velocity (速度) | 数据生成和处理的速度 |
Big Data Ecosystem Tools
大数据生态系统工具
The following tools are commonly used within the Big Data ecosystem:
以下工具通常在 大数据生态系统中使用:
| Tool | Purpose |
|---|---|
| Tool / 工具 | Purpose / 用途 |
| Hadoop | Framework for distributed storage and processing |
| Hadoop | 用于分布式存储和处理的框架 |
| HBase | NoSQL database that runs on top of HDFS |
| HBase | 运行在HDFS之上的NoSQL数据库 |
| Hive | Data warehouse software for querying and managing data |
| Hive | 用于查询和管理数据的数据仓库软件 |
| Storm | Real-time computation system |
| Storm | 实时计算系统 |
| Zookeeper | Coordination service for distributed applications |
| Zookeeper | 分布式应用程序的协调服务 |
| Sqoop | Tool for transferring data between Hadoop and relational databases |
| Sqoop | 用于在Hadoop和关系数据库之间传输数据的工具 |
| Mahout | Library for scalable machine learning algorithms |
| Mahout | 可扩展机器学习算法库 |
Variants of Hadoop
Hadoop的变体 (发行版)
- Cloudera Hadoop Distribution
- Cloudera Hadoop 发行版
- Hortonworks Hadoop Distribution
- Hortonworks Hadoop 发行版
- MapR Hadoop Distribution
- MapR Hadoop 发行版
- Pivotal HD
- Pivotal HD
Criteria to Evaluate Hadoop Distribution
评估Hadoop发行版的标准
When evaluating Hadoop distributions, consider the following criteria:
评估Hadoop发行版时,请考虑以下标准:
| Criteria | Description |
|---|---|
| Criteria / 标准 | Description / 描述 |
| Performance | Speed and efficiency in handling data |
| Performance (性能) | 处理数据的速度和效率 |
| Scalability | Ability to grow with increasing data volumes |
| Scalability (可扩展性) | 随着数据量增加而增长的能力 |
| Reliability | Consistency and dependability of the system |
| Reliability (可靠性) | 系统的一致性和可靠性 |