Flume - FlumeKafkaSqoop - Big Data | Dumpling's Blog = My Port

FLUME NOTES

Flume 笔记

Apache Flume is a distributed, highly reliable, and highly available tool for collecting, aggregating, and transferring large amounts of log data from different sources to the central data warehouse.

Apache Flume 是一个分布式、高可靠、高可用的工具，用于将来自不同数据源的大量日志数据收集、汇总并传输到中央数据仓库。

It is the top project of the Apache Software Foundation (ASF).

它是 Apache 软件基金会（ASF）的顶级项目。

Flume supports customizing various data senders in the log system to collect data; At the same time, Flume provides the ability to simply process data and write to various data receivers (customizable).

Flume 支持在日志系统中自定义各种数据发送器来收集数据；同时，Flume 提供了对数据进行简单处理并写入各种数据接收器（可自定义）的能力。

Currently, Flume has two versions: Flume0.9X version is collectively called Flume-og.

目前，Flume 有两个版本：Flume0.9X 版本统称为 Flume-og。

Flume1.X version is collectively called Flume-ng.

Flume1.X 版本统称为 Flume-ng。

Advantage

优点

Flume can store data generated by applications in any centralized storage, such as HDFS and HBase.

Flume 可以将应用程序产生的数据存储在任何集中式存储中，如 HDFS 和 HBase。

When the speed of data collection exceeds the data to be written, that is, when the collected information encounters a peak, the collected information is very large, even exceeding the system’s ability to write data.

当数据收集的速度超过数据写入的速度时，即当收集的信息遇到高峰、信息量巨大甚至超过系统的写入能力时。

At this time, Flume will make adjustments between the data producer and the data receiver to ensure that it can provide stable data between the two.

此时，Flume 会在数据生产者和数据接收者之间进行调节，以确保两者之间能够提供稳定的数据流。

Provides context routing characteristics.

提供上下文路由特性。

Flume’s pipeline is transaction based, which ensures the consistency of data transmission and reception.

Flume 的流水线是基于事务的，这保证了数据收发的一致性。

Flume is reliable, fault tolerant, scalable, easy to manage, and customizable.

Flume 具有可靠性、容错性、可扩展性、易于管理且可定制的特点。

Feature

特性

Flume can efficiently store the log information collected from multiple website servers into HDFS/HBase.

Flume 可以高效地将从多个网站服务器收集的日志信息存储到 HDFS/HBase 中。

With Flume, we can quickly transfer the data obtained from multiple servers to Hadoop.

通过 Flume，我们可以快速地将从多个服务器获取的数据传输到 Hadoop。

In addition to log information, Flume can also be used to access and collect large-scale social network node event data, such as Facebook, Twitter, e-commerce sites such as Amazon, flipkart, etc.

除了日志信息，Flume 还可以用于访问和收集大规模社交网络节点事件数据，如 Facebook、Twitter 以及 Amazon、flipkart 等电子商务网站。

It supports various types of access resource data and outgoing data.

它支持各种类型的访问资源数据和输出数据。

Support multi-path traffic, multi pipe access traffic, multi pipe outgoing traffic, context routing, etc.

支持多路径流量、多管道接入流量、多管道输出流量、上下文路由等。

It can be expanded horizontally.

它可以进行水平扩展。

Structure

结构

Event: a data unit with an optional message header, the basic unit of Flume data transmission, which sends data from the source to the destination in the form of an event.

Event（事件）：带有可选消息头的单元，是 Flume 数据传输的基本单位，以事件的形式将数据从源发送到目的地。

Agent: an independent Flume process responsible for data collection, including the components Source, Channel and Sink.

Agent（代理）：一个独立的 Flume 进程，负责数据收集，包括 Source（源）、Channel（通道）和 Sink（接收器）组件。

Source: data source, which is used to consume the events passed to the component. Each agent can have a data source.

Source（数据源）：用于消耗传递给组件的事件。每个代理可以拥有一个数据源。

Channel: connect the Source and Sink, which is a sort of queue, a temporary storage of transit events.

Channel（通道）：连接 Source 和 Sink，是一种队列，用于临时存储中转事件。

Sink: output end, which reads and removes the event from the channel, and passes the event to the next agent (if any) in the Flow Pipeline.

Sink（接收器）：输出端，它从通道中读取并移除事件，并将事件传递给流流水线中的下一个代理（如果有）。

The entire Flume is actually an agent function. It receives data from the Source and transmits it to Sink through the Channel.

整个 Flume 实际上就是一个代理功能。它从 Source 接收数据，并通过 Channel 传输到 Sink。

If there is big data to process, it will directly go to Sink to extract data. Flume is to collect data from the Source to Sink.

如果有大数据需要处理，它会直接到 Sink 提取数据。Flume 的作用就是将数据从 Source 收集到 Sink。

To define the flow in a single agent, you need to connect the source and sink through the channel.

要在单个代理中定义流，需要通过通道连接源和接收器。

You need to list all sources, sinks, and channels in the configuration file, and then point source and sink to channels.

你需要在配置文件中列出所有的源、接收器和通道，然后将源和接收器指向通道。

A source can connect to multiple channels, but a sink can only connect to one channel.

一个源可以连接到多个通道，但一个接收器只能连接到一个通道。

Flume Core Component: Source

Flume 核心组件：Source

Exec Source: The source runs the given Unix command at startup and expects the process to generate data continuously on the standard output.(stderr information will be discarded unless the attribute log StdErr is set to true).If the process exits for any reason, the source will also exit and will not continue to generate data.

Exec Source：该源在启动时运行给定的 Unix 命令，并期望进程在标准输出上持续生成数据。（除非将属性 log StdErr 设置为 true，否则 stderr 信息将被丢弃）。如果进程由于任何原因退出，源也将退出，并且不会继续生成数据。

Spooling Directory Source: This source lets you ingest data by placing files to be ingested into a “spooling” directory on disk.

Spooling Directory Source：此源允许你通过将要摄取的文件放入磁盘上的“轮询”目录来摄取数据。

This source will watch the specified directory for new files, and will parse events out of new files as they appear.

该源将监视指定目录中的新文件，并在新文件出现时解析出事件。

The event parsing logic is pluggable.

事件解析逻辑是可插拔的。

After a given file has been fully read into the channel, completion by default is indicated by renaming the file or it can be deleted or the trackerDir is used to keep track of processed files.

在给定文件被完全读取到通道后，默认情况下通过重命名文件来表示完成，或者可以将其删除，或者使用 trackerDir 来跟踪已处理的文件。

Unlike the Exec source, this source is reliable and will not miss data, even if Flume is restarted or killed.

与 Exec 源不同，此源是可靠的，即使 Flume 重新启动或被终止，也不会丢失数据。

Taildir Source: Watch the specified files, and tail them in nearly real-time once detected new lines appended to the each files.

Taildir Source：监视指定的文件，一旦检测到每个文件末尾追加了新行，就近乎实时地对其进行 tail 操作。

If the new lines are being written, this source will retry reading them in wait for the completion of the write.

如果新行正在写入，该源将重试读取，以等待写入完成。

This source is reliable and will not miss data even when the tailing files rotate.

该源是可靠的，即使在 tail 的文件发生滚动（rotate）时也不会丢失数据。

It periodically writes the last read position of each files on the given position file in JSON format.

它会定期将每个文件的最后读取位置以 JSON 格式写入给定的位置文件。

If Flume is stopped or down for some reason, it can restart tailing from the position written on the existing position file.

如果 Flume 因某种原因停止或宕机，它可以从现有位置文件中记录的位置重新开始 tail 操作。

Kafka Source: Kafka Source is an Apache Kafka consumer that reads messages from Kafka topics.

Kafka Source：Kafka Source 是一个 Apache Kafka 消费者，从 Kafka 主题中读取消息。

You can configure them with the same Consumer Group so each will read a unique set of partitions for the topics.

你可以为它们配置相同的消费者组，这样每个源都会读取主题中唯一的一组分区。

HTTP Source: A source which accepts Flume Events by HTTP POST and GET.

HTTP Source：一种通过 HTTP POST 和 GET 接收 Flume 事件的源。

HTTP requests are converted into flume events by a pluggable “handler” which must implement the HTTPSourceHandler interface.

HTTP 请求通过一个可插拔的“处理程序（handler）”转换为 Flume 事件，该处理程序必须实现 HTTPSourceHandler 接口。

This handler takes a HttpServletRequest and returns a list of flume events.

此处理程序接收 HttpServletRequest 并返回 Flume 事件列表。

Flume Core Component: Channel

Flume 核心组件：Channel

Channel is the buffer pool where events are staged on the agent.

Channel 是 Agent 上暂存事件的缓冲池。

Event is added by Source and deleted after consumption by Sink.

事件由 Source 添加，并在被 Sink 消费后删除。

Currently, Flume supports the following channel types:

目前，Flume 支持以下通道类型：

Channel Type	Explanation	解释
Memory Channel	Event data is stored in memory	事件数据（Event data）直接存储在内存中。
JDBC Channel	The Event is stored persistently through a database. Currently, only Derby is supported.	事件通过数据库进行持久化存储。目前仅支持 Derby 数据库。
File Channel	Event data is stored in disk file	事件数据存储在磁盘文件中。
Kafka Channel	Store Event to Kafka cluster (must be installed separately)	将事件存储到 Kafka 集群中（Kafka 集群必须单独安装）。
Spillable Memory Channel	Store Events on memory queues and disks. The memory queue serves as the primary storage. When the memory is full, it will be saved to disk	将事件存储在内存队列和磁盘上。内存队列作为主要存储位置；当内存已满时，数据将被保存到磁盘。
Pseudo Transaction Channel	It is only used for unit test purposes and is not suitable for production purposes.	仅用于单元测试目的，不适用于生产环境。
Custom Channel	Implement the channel interface to customize a channel	通过实现 Channel 接口来自定义一个通道。

Memory Channel: Memory Channel stores event queues in memory.

Memory Channel（内存通道）：Memory Channel 在内存中存储事件队列。

The maximum number of queues is the set value of capacity.

队列的最大数量是设置的容量值（capacity）。

It is very suitable for scenarios with high throughput requirements, but it also has a cost.

它非常适合吞吐量要求高的场景，但也有代价。

When a failure occurs, all events in the memory at that time will be lost.

当发生故障时，当时内存中的所有事件都会丢失。

Flume Core Component: Sink

Flume 核心组件：Sink

HDFS Sink: This Sink writes events to the Hadoop distributed file system (that is, HDFS).

HDFS Sink：此接收器将事件写入 Hadoop 分布式文件系统（即 HDFS）。

Currently supports the creation of text and sequencefiles. It supports compression of two file types.

目前支持创建文本文件和序列文件（sequencefiles）。它支持这两种文件类型的压缩。

File Roll Sink: File Roll Sink stores events to the local file system.

File Roll Sink：File Roll Sink 将事件存储到本地文件系统。

Null Sink: Discard all events read from the channel.

Null Sink：丢弃从通道读取的所有事件。

HBase2 Sink: This Sink writes data to HBase.

HBase2 Sink：此接收器将数据写入 HBase。

The Hbase configuration is obtained from the first hbase-site.xml encountered in the class path.

Hbase 配置是从类路径中遇到的第一个 hbase-site.xml 获取的。

Configure the implementation class of the specified HbaseEventSerializer interface to convert Event to HBase put or increments.

配置指定的 HbaseEventSerializer 接口的实现类，将 Event 转换为 HBase 的 put 或 increments。

Then write these puts and increments to HBase.

然后将这些 put 和 increment 写入 HBase。

Kafka Sink: It means Flume sends data to Kafka.

Kafka Sink：这意味着 Flume 将数据发送到 Kafka。

Kafka is the destination for the data pipeline. Flume writes messages into a Kafka topic.

Kafka 是数据流水线的目的地。Flume 将消息写入 Kafka 主题。

Flume does not consume from Kafka in this case.

在这种情况下，Flume 不会从 Kafka 中消费。