NoSQL Chapter 4 - NoSQL - Big Data | Dumpling's Blog = My Port

📊 MongoDB Aggregation Framework

📊 MongoDB 聚合框架

Purpose: Enables complex data analysis and transformation similar to SQL’s GROUP BY and JOIN.
目的：实现复杂的数据分析和转换，类似于 SQL 的 GROUP BY 和 JOIN。
Pipeline Approach: Data flows through multiple stages, each performing a specific operation.
管道方法：数据流经多个阶段，每个阶段执行特定的操作。

Common Stages:

常用阶段：

$match: Filters documents.
$match：过滤文档。
$group: Groups documents by a specific key.
$group：按特定键对文档进行分组。
$project: Reshapes documents.
$project：重塑文档。
$sort: Orders results.
$sort：对结果进行排序。

🔄 Transactions in MongoDB

🔄 MongoDB 中的事务

Transaction: A logical group of operations ensuring data integrity.

事务：确保数据完整性的逻辑操作组。

ACID Compliance: Ensures Atomicity, Consistency, Isolation, Durability.
ACID 合规性：确保原子性、一致性、隔离性、持久性。
Use Case: In e-commerce, when a customer places an order:
用例：在电子商务中，当客户下订单时：
- Deduct inventory.
- 扣减库存。
- Record order details.
- 记录订单详情。
- Update order history.
- 更新订单历史。
Rollback Mechanism: If any operation fails, the transaction is aborted.
回滚机制：如果任何操作失败，事务将被中止。

⚙️ Replication

⚙️ 复制

Replication: Keeping identical data copies across multiple servers for high availability and safety.

复制：在多个服务器上保留相同的数据副本，以实现高可用性和安全性。

Recommended: For all production deployments.
建议：用于所有生产部署。
Replica Set: A configuration (e.g., ecommerceReplicaSet) with multiple members to ensure data redundancy and failover support.
副本集：一种配置（例如 ecommerceReplicaSet），包含多个成员以确保数据冗余和故障转移支持。

📈 Aggregation Pipeline Details

📈 聚合管道详情

Concept: A sequence of processing stages where documents pass through operations.
概念：文档通过操作的一系列处理阶段。
Tunable Stages: Each stage can be parameterized with operators to modify fields or perform arithmetic operations.
可调阶段：每个阶段都可以使用操作符进行参数化，以修改字段或执行算术运算。

Example of Aggregation Pipeline:

聚合管道示例：

Initial Filter: Use $match to filter documents.
初始过滤：使用 $match 过滤文档。
Further Processing: Apply additional filters or transformations later in the pipeline.
进一步处理：在管道的后续阶段应用额外的过滤器或转换。

Sample Query:

示例查询：

To find companies founded in 2004:

查找成立于 2004 年的公司：

1
2
3

db.companies.aggregate([
  { $match: { founded_year: 2004 } }
])

📋 Example Output Transformation:

📋 示例输出转换：

Adding a project stage to limit output fields:

添加一个 project 阶段来限制输出字段：

db.companies.aggregate([
  { $match: { founded_year: 2004 } },
  { $project: { _id: 0, name: 1, founded_year: 1 } }
])

📚 Key Takeaways:

📚 关键要点：

Aggregation Framework: Essential for complex data processing and analytics.
聚合框架：对于复杂数据处理和分析至关重要。
Transactions: Crucial for maintaining data integrity in multi-document operations.
事务：对于在多文档操作中维护数据完整性至关重要。
Replication: Vital for ensuring data availability and fault tolerance in production environments.
复制：对于确保生产环境中的数据可用性和容错性至关重要。

🛠️ Aggregation Framework Overview

🛠️ 聚合框架概述

Aggregation — Process of transforming data into a summary format
聚合 — 将数据转换为摘要格式的过程
Pipeline — A sequence of data processing stages
管道 — 数据处理阶段的序列

📋 Aggregation Pipeline Stages

📋 聚合管道阶段

Match Stage
Match 阶段
- Filters documents based on criteria.
- 根据条件过滤文档。
- Example: {$match: {founded_year: 2004}}
- 示例：{$match: {founded_year: 2004}}
Project Stage
Project 阶段
- Reshapes documents and selects fields.
- 重塑文档并选择字段。
- Example: {$project: {_id: 0, name: 1}}
- 示例：{$project: {_id: 0, name: 1}}
Limit Stage
Limit 阶段
- Restricts the number of documents returned.
- 限制返回的文档数量。
- Example: {$limit: 5}
- 示例：{$limit: 5}
Sort Stage
Sort 阶段
- Orders documents based on specified fields.
- 根据指定字段对文档进行排序。
- Example: {$sort: {name: 1}} (ascending order)
- 示例：{$sort: {name: 1}} (升序)
Skip Stage
Skip 阶段
- Skips a specified number of documents.
- 跳过指定数量的文档。
- Example: {$skip: 10}
- 示例：{$skip: 10}

🔍 Aggregating Data Effectively

🔍 有效聚合数据

Order of Stages Matters:
阶段顺序很重要：
- Place the limit stage before the project stage to enhance performance.
- 将 limit 阶段放在 project 阶段之前以提高性能。
- Sorting should occur before limiting if order is important.
- 如果顺序很重要，则应在限制之前进行排序。

Example Pipeline to Retrieve Company Names:

检索公司名称的示例管道：

db.companies.aggregate([
  {$match: {founded_year: 2004}},
  {$limit: 5},
  {$project: {_id: 0, name: 1}}
])

📊 Types of Expressions in Aggregation

📊 聚合中的表达式类型

Boolean Expressions: Use AND, OR, NOT.
布尔表达式：使用 AND、OR、NOT。
Set Expressions: Work with arrays (intersection, union).
集合表达式：处理数组（交集、并集）。
Comparison Expressions: Range filters.
比较表达式：范围过滤器。
Arithmetic Expressions: Basic math operations.
算术表达式：基本数学运算。
String Expressions: Text manipulation.
字符串表达式：文本操作。
Array Expressions: Handle and manipulate array data.
数组表达式：处理和操作数组数据。
Variable Expressions: Use literals and conditionals.
变量表达式：使用字面量和条件。
Accumulators: Calculate sums, averages, and statistics.
累加器：计算总和、平均值和统计数据。

🔧 Deep Dive: Project Stage Operations

🔧 深入探讨：Project 阶段操作

Can promote nested fields using dot notation:
可以使用点表示法提升嵌套字段：

db.companies.aggregate([
  {$match: {"funding_rounds.investments.financial_org.permalink": "greylock"}},
  {$project: {
    _id: 0,
    name: 1,
    ipo: "$ipo.pub_year",
    valuation: "$ipo.valuation_amount",
    funders: "$funding_rounds.investments.financial_org.permalink"
  }}
])

Example Document Structure:

示例文档结构：

Company Document:
公司文档：
- Fields: _id, name, category_code, founded_year, ipo, funding_rounds
字段：_id、name、category_code、founded_year、ipo、funding_rounds

📈 Key Takeaways

📈 关键要点

Optimize Aggregation Pipelines: Place limiting stages strategically to reduce processing load.
优化聚合管道：策略性地放置限制阶段以减少处理负载。
Understand Each Stage: Know the function of match, project, limit, sort, and skip for effective data query construction.
理解每个阶段：了解 match、project、limit、sort 和 skip 的功能，以有效地构建数据查询。
Use Expressions Wisely: Leverage various expressions to enhance querying capabilities and data manipulation.
明智地使用表达式：利用各种表达式来增强查询能力和数据操作。

📊 Aggregation Framework

📊 聚合框架

Introduction to Aggregation

聚合简介

Aggregation is a framework used to process data and return computed results.
聚合是一个用于处理数据并返回计算结果的框架。
It allows operations such as filtering, transforming, and combining data.
它允许进行诸如过滤、转换和组合数据等操作。

Key Components of Aggregation

聚合的关键组件

Match Stage: Filters documents based on specified criteria.
Match 阶段：根据指定条件过滤文档。
- Example: {$match: {"funding_rounds.investments.financial_org.permalink": "greylock"}}
- 示例：{$match: {"funding_rounds.investments.financial_org.permalink": "greylock"}}
Project Stage: Reshapes documents to include only specified fields.
Project 阶段：重塑文档以仅包含指定字段。
- Example: {$project: {name: 1, amount: "$funding_rounds.raised_amount"}}
- 示例：{$project: {name: 1, amount: "$funding_rounds.raised_amount"}}
Unwind Stage: Deconstructs an array field into separate documents, allowing each element to be processed individually.
Unwind 阶段：将数组字段分解为单独的文档，允许单独处理每个元素。
- Example: {$unwind: "$funding_rounds"}
- 示例：{$unwind: "$funding_rounds"}

Using the Unwind Stage

使用 Unwind 阶段

The Unwind Stage creates a document for each element in the specified array field.

Unwind 阶段为指定数组字段中的每个元素创建一个文档。

Example Aggregation Pipeline:

聚合管道示例：

db.companies.aggregate([
    {$match: {"funding_rounds.investments.financial_org.permalink": "greylock"}},
    {$unwind: "$funding_rounds"},
    {$project: {name: 1, amount: "$funding_rounds.raised_amount", year: "$funding_rounds.funded_year"}}
])

Array Expressions

数组表达式

Filter Expression: A way to select a subset of elements in an array based on specified criteria.

过滤器表达式：一种根据指定条件选择数组中元素子集的方法。

Example of usage:
用法示例：

{ $filter: {
    input: "$funding_rounds",
    as: "round",
    cond: { $gte: ["$$round.raised_amount", 100000000] }
}}

Understanding the Output

理解输出

Output documents can have fields like name, amount, and year.
输出文档可以包含 name、amount 和 year 等字段。
Each funding round processed will yield separate documents for clarity.
为清晰起见，处理的每个融资轮次都将产生单独的文档。

🚀 Key Terms

🚀 关键术语

Aggregation: Process of computing results from data.
聚合：从数据计算结果的过程。
Match: Filters documents.
Match：过滤文档。
Project: Reshapes output documents.
Project：重塑输出文档。
Unwind: Breaks down arrays into individual documents.
Unwind：将数组分解为单个文档。
Filter: Selects specific elements from an array.
Filter：从数组中选择特定元素。

📊 Aggregation Framework

📊 聚合框架

Overview of Aggregation

聚合概述

Aggregation is a way to process data and return computed results.
聚合是一种处理数据并返回计算结果的方法。
It is similar to SQL’s GROUP BY command, allowing for the combination of multiple documents to perform aggregate operations.
它类似于 SQL 的 GROUP BY 命令，允许组合多个文档以执行聚合操作。

Key Operators

关键操作符

$match

Filters documents based on specified criteria.
根据指定条件过滤文档。
Example: { $match: { "founded_year": 2010 } } selects documents founded in 2010.
示例：{ $match: { "founded_year": 2010 } } 选择成立于 2010 年的文档。

$group

Groups documents by specified field(s) and performs aggregation.
按指定字段对文档进行分组并执行聚合。
Example:

示例：

{
  $group: {
    _id: { founded_year: "$founded_year" },
    average_number_of_employees: { $avg: "$number_of_employees" }
  }
}

Using Array Operators

使用数组操作符

$arrayElemAt

Selects an element from an array at a specified index.
从数组中选择指定索引处的元素。
Example:

示例：

{ $project: {
    first_round: { $arrayElemAt: ["$funding_rounds", 0] },
    last_round: { $arrayElemAt: ["$funding_rounds", -1] }
  }
}

Example Output

输出示例

Output from an aggregation might resemble:

聚合的输出可能类似于：

{
  "name": "vufind",
  "founded_year": 2010,
  "first_round": { ... },
  "last_round": { ... }
}

🔄 Relationships and Aggregation

🔄 关系和聚合

Relationship Field

关系字段

Contains data about individuals associated with companies.
包含与公司相关的个人的数据。
Structure:

结构：

"relationships": [
  {
    "is_past": false,
    "title": "Founder and CEO",
    "person": { "first_name": "Mark", "last_name": "Zuckerberg" }
  },
  ...
]

Counting Relationships

计算关系数量

Example aggregation to count relationships:

计算关系数量的聚合示例：

db.companies.aggregate([
  { $match: { "relationships.person": { $ne: null } } },
  { $unwind: "$relationships" },
  { $group: {
      _id: "$relationships.person",
      count: { $sum: 1 }
    }
  },
  { $sort: { count: -1 } }
])

Sample Output

示例输出

The output lists persons and the count of their relationships:

输出列出人员及其关系计数：

{
  "_id": { "first_name": "Tim", "last_name": "H" },
  "count": 5
}

🗂️ Practical Applications

🗂️ 实际应用

Aggregation allows for valuable insights such as:
聚合可以提供有价值的见解，例如：
- Average metrics by group (e.g., average number of employees by founding year).
- 按组划分的平均指标（例如，按成立年份划分的平均员工人数）。
- Relationship dynamics (who is connected to many companies).
- 关系动态（谁与许多公司有联系）。

🗂️ MongoDB Aggregation Framework

🗂️ MongoDB 聚合框架

🧩 Aggregation Basics

🧩 聚合基础

Aggregation: Process of transforming data into a summary form.
聚合：将数据转换为摘要形式的过程。
Purpose: Analyze and report on data, such as sales and customer behavior.
目的：分析和报告数据，例如销售和客户行为。

🔍 Key Aggregation Stages

🔍 关键聚合阶段

$match: Filters documents based on specified criteria.
$match：根据指定条件过滤文档。
$group: Groups documents by specified fields, allowing for calculations.
$group：按指定字段对文档进行分组，允许进行计算。
$sort: Orders documents based on specified fields.
$sort：根据指定字段对文档进行排序。
$project: Reshapes documents by including or excluding fields.
$project：通过包含或排除字段来重塑文档。

⚙️ Transactions in MongoDB

⚙️ MongoDB 中的事务

📜 Definition of a Transaction

📜 事务的定义

A transaction is a logical unit of processing that includes one or more database operations, ensuring either full completion or failure.

事务是一个逻辑处理单元，包含一个或多个数据库操作，确保完全完成或完全失败。

🔑 ACID Properties

🔑 ACID 属性

Atomicity: All operations in a transaction are completed or none are.
原子性：事务中的所有操作要么全部完成，要么全部不完成。
Consistency: Database moves from one valid state to another.
一致性：数据库从一个有效状态转换到另一个有效状态。
Isolation: Transactions run independently without interference.
隔离性：事务独立运行，互不干扰。
Durability: Once committed, changes persist despite failures.
持久性：一旦提交，即使发生故障，更改也会持久存在。

ACID Compliance:

ACID 合规性：

A database is ACID-compliant when it adheres to these properties, ensuring data integrity.

当数据库遵守这些属性时，即为 ACID 合规，从而确保数据完整性。

🛠️ Using Transactions in MongoDB

🛠️ 在 MongoDB 中使用事务

Transaction APIs:

事务 API：

API	Core API 核心 API	Callback API 回调 API
Transaction Start	Requires explicit start call	Automatically starts with a callback function
事务启动	需要显式启动调用	使用回调函数自动启动
Error Handling	Requires manual error handling	Automatically includes error-handling logic
错误处理	需要手动错误处理	自动包含错误处理逻辑
Session Handling	Requires explicit session parameter	Requires explicit session parameter
会话处理	需要显式会话参数	需要显式会话参数

🛒 Example Usage:

🛒 用法示例：

Core API Example:
核心 API 示例：
- Define operations for placing an order and updating inventory.
- 定义下订单和更新库存的操作。
Callback API Example:
回调 API 示例：
- Pass a function that includes transaction operations.

传递一个包含事务操作的函数。

🔄 Retry Logic in Transactions

🔄 事务中的重试逻辑

Implement retry logic to handle transient errors during transactions.
实现重试逻辑以处理事务期间的瞬时错误。
Key Functions:
关键函数：
- commit_with_retry(session): Handles commit attempts.
- commit_with_retry(session)：处理提交尝试。
- run_transaction_with_retry(txn_func, session): Runs transactions with retries on errors.
- run_transaction_with_retry(txn_func, session)：在出错时带重试运行事务。

🛠️ Transactions in MongoDB

🛠️ MongoDB 中的事务

Purpose of Transactions:
事务的目的：

Transactions ensure data integrity and atomicity for multiple operations.

事务确保多个操作的数据完整性和原子性。
Key Features:
主要特性：
- Provide consistency across multiple operations.
- 在多个操作之间提供一致性。
- Should be used sparingly, given the flexibility of MongoDB’s document model.
- 鉴于 MongoDB 文档模型的灵活性，应谨慎使用。

🔄 Replication in MongoDB

🔄 MongoDB 中的复制

Definition:
定义：

Replication is the process of keeping identical copies of data across multiple servers.

复制是在多个服务器上保留相同数据副本的过程。
Benefits:
优点：
- Enhances data availability and safety.
- 提高数据可用性和安全性。
- Allows continued access to data even if one or more servers fail.
- 即使一个或多个服务器发生故障，也允许继续访问数据。
Replica Set:
副本集：
- A configuration of multiple MongoDB servers, including one primary and several secondaries.
- 多个 MongoDB 服务器的配置，包括一个主服务器和几个辅助服务器。
- The primary handles write operations, while secondaries maintain copies of the primary’s data.
- 主服务器处理写操作，而辅助服务器维护主服务器数据的副本。

Setting Up a Replica Set:

设置副本集：

Create Data Directories:
创建数据目录：
- Linux/Mac: mkdir -p ~/data/rs{1,2,3}
- Linux/Mac：mkdir -p ~/data/rs{1,2,3}
- Windows: md c:\data\rs1 c:\data\rs2 c:\data\rs3
- Windows：md c:\data\rs1 c:\data\rs2 c:\data\rs3
Start MongoDB Instances:

启动 MongoDB 实例：

Run the following commands in separate terminals:

在单独的终端中运行以下命令：

Linux/Mac:

Linux/Mac:

1
2
3

mongod --replSet mdbDefGuide --dbpath ~/data/rs1 --port 27017 --smallfiles --oplogSize 200
  mongod --replSet mdbDefGuide --dbpath ~/data/rs2 --port 27018 --smallfiles --oplogSize 200
  mongod --replSet mdbDefGuide --dbpath ~/data/rs3 --port 27019 --smallfiles --oplogSize 200

Windows:

Windows:

1
2
3

mongod --replSet mdbDefGuide --dbpath c:\data\rs1 --port 27017 --smallfiles --oplogSize 200
mongod --replSet mdbDefGuide --dbpath c:\data\rs2 --port 27018 --smallfiles --oplogSize 200
mongod --replSet mdbDefGuide --dbpath c:\data\rs3 --port 27019 --smallfiles --oplogSize 200

Initiate the Replica Set:

初始化副本集：

Connect to one instance:
连接到一个实例：
1
mongo --port 27017
Create and initiate config:

创建并初始化配置：

rsconf = {
  _id: "mdbDefGuide",
  members: [
    {_id: 0, host: "localhost:27017"},
    {_id: 1, host: "localhost:27018"},
    {_id: 2, host: "localhost:27019"}
  ]
}
rs.initiate(rsconf)

📊 Observing Replication

📊 观察复制

Check Replica Set Status:
检查副本集状态：
- Use rs.status() to view the status of the replica set, including primary and secondary members.
- 使用 rs.status() 查看副本集的状态，包括主成员和辅助成员。
Writing Data:
写入数据：
- Connect to the primary and perform write operations to test replication:
- 连接到主服务器并执行写操作以测试复制：
  1
  2
  3
  use test
  for (i = 0; i < 1000; i++) { db.coll.insert({count: i}) }
  db.coll.count() // Should return 1000

📊 MongoDB: Aggregation Framework, Transactions, and Replication

📊 MongoDB：聚合框架、事务和复制

🧩 Key Concepts

🧩 关键概念

Aggregation Framework:
聚合框架：
- Utilizes a pipeline approach for data analysis.
- 利用管道方法进行数据分析。
- Common stages include:
- 常用阶段包括：
  - $match: Filters documents based on criteria.
  - $match：根据条件过滤文档。
  - $group: Groups documents together.
  - $group：将文档分组。
  - $project: Reshapes documents by including/excluding fields.
  - $project：通过包含/排除字段来重塑文档。
  - $sort: Orders documents based on specified fields.
  - $sort：根据指定字段对文档进行排序。
  - $limit: Restricts the number of documents passing through the pipeline.
  - $limit：限制通过管道的文档数量。
  - $skip: Skips a specified number of documents.
  - $skip：跳过指定数量的文档。
Transactions:
事务：
- Ensure ACID compliance for operations across multiple documents and collections.
- 确保跨多个文档和集合的操作符合 ACID。
- Maintain data integrity during multi-document operations.
- 在多文档操作期间维护数据完整性。
Replication:
复制：
- Provides high availability and data redundancy.
- 提供高可用性和数据冗余。
- A replica set consists of multiple servers maintaining identical data copies for failover support.
- 副本集由多个服务器组成，这些服务器维护相同的数据副本以支持故障转移。

🔍 Important Commands and Usages

🔍 重要命令和用法

Check Primary Status:
检查主节点状态：

Use db.isMaster() to determine the primary and secondary members of a replica set.

使用 db.isMaster() 来确定副本集的主节点和从节点成员。
Reading from Secondaries:
从从节点读取：
- By default, clients cannot read from secondaries. To allow this, use:
- 默认情况下，客户端无法从从节点读取。要允许这样做，请使用：
  1
  secondaryConn.setSlaveOk()
Error Handling:
错误处理：
- Attempting to read from a secondary without permission will return:
- 尝试在没有权限的情况下从从节点读取将返回：
  1
  2
  3
  4
  5
  {
  "ok": 0,
  "errmsg": "not master and slaveOk=false",
  "code": 13435
  }
Writing to Secondaries:
向从节点写入：
- Clients cannot perform write operations directly on secondaries. Writes are only accepted through replication.
- 客户端不能直接在从节点上执行写操作。写入只能通过复制来接受。
Automatic Failover:
自动故障转移：
- If the primary goes down, one of the secondaries is automatically elected as primary.
- 如果主节点宕机，其中一个从节点将自动被选为主节点。