Ying

1. DDIA-数据系统基础-笔记

2024-07-20T12:16:21+00:00

这本书非常值得一读，对我来讲，主要有两点：

系统性：之前都是在各个点上去深入理解，这本书描述了一个 N 维的结构，串起来这些点，理论认知更全面。之后拿理论再去套新的系统，效率更高，直觉也更准确。
不断地思考和解惑：需要进一步思考，同时继续看下去，又不断的佐证/推翻自己的想法。例如之前了解了 raft，仍然不清楚客户端如何知道哪个是 leader？quorum 机制，当 w + r > n，一定会读到一个最新的数据，但是怎么找到这个数据？书里逐渐给出了演变的历史和答案。

这篇笔记记录“第一部分-数据系统基础”的心得，这一部分主要解决了：

系统设计的目标是什么
数据逻辑上应该怎么存，怎么查
数据物理上应该怎么存，怎么查
数据物理上的字节应该怎么编码

1. 可靠、可扩展与可维护的应用系统

1.1. 可靠性

每次提到可靠性，第一时间想到的总是 Jeff Dean 的 stanford-295-talk 的 page4：

The Joys of Real Hardware
Typical first year for a new cluster:
~0.5 overheating (power down most machines in <5 mins, ~1-2 days to recover)
~1 PDU failure (~500-1000 machines suddenly disappear, ~6 hours to come back)
~1 rack-move (plenty of warning, ~500-1000 machines powered down, ~6 hours)
~1 network rewiring (rolling ~5% of machines down over 2-day span)
~20 rack failures (40-80 machines instantly disappear, 1-6 hours to get back)
~5 racks go wonky (40-80 machines see 50% packetloss)
~8 network maintenances (4 might cause ~30-minute random connectivity losses)
~12 router reloads (takes out DNS and external vips for a couple minutes)
~3 router failures (have to immediately pull traffic for an hour)
~dozens of minor 30-second blips for dns
~1000 individual machine failures
~thousands of hard drive failures
slow disks, bad memory, misconfigured machines, flaky machines, etc.

作为架构，应该牢记这几个数字。

可靠性的难点，来自于上述问题。
可靠性的保证，除了上述问题，还需要考虑：规范的上线流程、测试/沙盒环境、监控/报警/日志系统。

1.2. 可扩展性

作者举了一个 Twitter 的经典场景：“当用户查看时间线时，首先查找所有的关注对象，列出这些人的所有tweet，最后以时间为序来排序合并。”

有两种思路来设计这套系统：

读更新：
写更新：

方案取决于实际压力数值(现在以及预估未来)，核心交互有两处:

发布 tweet 消息：4.6k qps, 峰值 12k qps
主页时间线(Hometimeline)浏览：300k qps

单独看 qps 不大，不过这里的挑战在于“巨大的扇出(fan-out)”(每个用户会关注很多人，也会被很多人圈粉)，假定每个人平均有 75 个 follower. 上述两个主要场景，反馈到存储层，压力又高了一个数量级。

我们比较 push 和 pull 的优缺点:

push: 写触发，每发布一次，更新到所有订阅者各自的 recipient 中。优点是每个 recipient 独立，缺点是写入压力随 follower 个数放大，以及无效写入(很久才会被读取，或者不会读取)
pull: 读触发，每浏览一次，执行一次复杂的查询和排序。优点是数据流简单，没有无效计算，缺点则是读取压力随 followee 个数放大

Twitter 的实际做法是混合了 push && pull: 普通用户发布 push，大 v 发布 pull，用户浏览时做 merge.
这类实际问题的答案值得进一步思考，我在看书时临时想到的：

读 cache: 每个用户读取后最终排序的内容都是不一样的，所以读 cache 只能加到 tweets 表
隔离: push 的好处在于方便隔离，recipient 挂了也只会影响部分而不是全部用户；那 pull 能否做到这点？比如将大V 和普通用户分开存储，感觉理论上也是可行的，但是成本相比 push 要大？
push&&pull 或者隔离引入的另一个问题是大V的判断依据，以及如何动态生效

可扩展性要做出假设，不能任意维度扩展。比如任务调度系统，假设后续需求都会围绕着任务类型、数据源、调度性能这些能力，因此就需要在这些维度可扩展。

1.3. 可维护性

空

2. 数据模型与查询语言

2.1. 关系模型与文档模型

数据模型是为了解决如何最佳的表示数据关系

历史上的探索有几种：

层次模型：一棵大树，每个记录只有一个父结点，表达多对多、join 非常困难，不得不手动维护多条重复记录。
网状模型：一个记录可能有多个父结点，使用时需要手动选择访问路径。
关系模型：数据被组织成关系(relations)，在SQL中称为表(table)，其中每个关系都是元组(tuples)的无序集合(在SQL中称为行)。这种模型目前最为人们接受，就像面向对象的思想一样，天然符合我们的认知。
文档模型：数据采用 1 对多的方式存储，我觉得比较像是层次模型。书里典型的例子是存储简历，整个数据就像是一个大的 json.

1 2 现在已经逐渐看不到了，在当时则是更多的为了适应硬件的限制条件。
3-SQL 非常成功
但是实际场景也有一些关系数据库满足不了的诉求：

比关系数据库更好的扩展性需求，包括支持超大数据集或超高写入吞吐量(我实际使用时最强的诉求)
普遍偏爱免费和开源软件而不是商业数据库产品
关系模型不能很好地支持一些特定的查询操作
对关系模式一些限制性感到沮丧，渴望更具动态和表达力的数据模型

以至于后来很多数据库给自己贴上了 NOSQL 的标签，作者直言不讳的说道：

它其实并不代表具体的某些技术，它最初只是作为一个吸引人眼球的Twitter标签频频出现在2009年的开源、分布式以及非关系数据库的见面会上

我很喜欢作者这种一阵见血的表达，使得我们不被迷惑在数据库厂商、云厂商兜售的各种名词里(湖仓一体、流批一体、LakeHouse/LakeWareHouse/DataLake/DeltaLake etc.)

回到 NOSQL 概念本身，4-文档模型典型的诸如 MongoDB

对比关系模型和文档模型：

	关系模型	文档模型
应用代码	1. 使用多个表表示一对多的关系 2. 支持 join(多对多的关系)	1. 天然表达了一对多的关系，比如简历，name-positions-education-contact_info 2. 在引用嵌套类上要复杂一些，比如“指定 name 的 education 的第 2 项 ” 3. 需要应用程序发起多次请求后自行在内存里实现 join
模式灵活性	写入时强校验，数据规范有保障	读取时解释，使用方便、自然
查询数据局部性	内容存储在多个表，读取全部需要花费更多的磁盘 IO 和时间	文档的全部内容都存储在一块，读取方便；但是只读取部分、更新时不方便

在最近蒋晓伟大佬分享的分布式 Data Warebase - 让数据涌现智能也引用了这个观点：

数据模型是表达信息的语言，有了这种语言后，数据就从比特升级为了表记录或者文档

我的理解：模型关注的是如何表达实体之间的关系，但是又会影响到实现方案，即使当前 PostgreSQL、MySQL 都对 JSON 文档提供了相应支持，但实现方案上差别很大。同时，像 Redis、HBase、ElasticSearch 这些，似乎又不属于上述的模型，或许数据库都在朝着 multi-model 的方向演进。
使用单一的模型，也无法表达所有的场景。

2.2. 数据查询语言

SQL 遵循了关系代数的结构，这点在Calcite-2：关系代数、架构与处理流程笔记里介绍过。

Elasticsearch 作为文档模型，也支持了 SQL 语法：

GET bank/_search
{
    "query": {
        "bool": {
            "must": [
                {"match": {"gender": "F"}},
                {"match": {"age": 28}}
            ]
        }
    }
}

GET _xpack/sql
{
  "query":  "select * from online_trace_2021_02_20 where datatype='U' limit 1"
}

作者对比了声明式查询和命令式查询，我觉得声明式的更优，尽量复用通用的解析器、优化器生成物理执行方案，业务研发可以专注在如何用好数据库上。
在大数据领域，HiveSQL/SparkSQL 可以表达 Spark/MapReduce 任务，FlinkSQL 也可以实现实时任务。
当然 SQL 的表达能力是有限的，实际可能混用最为普遍。
用一张 flink 的图能够比较清楚的说明数据查询语言的层级，层级越高，表达越简洁，能够表达的含义也越来越少：

2.3. 图状数据模型

空

3. 数据存储与检索

3.1. 数据库核心：数据结构

看一个最简单的数据库例子：

#!/bin/bash

db_set() {
  echo "$1,$2" >> database
}

db_get() {
  grep "^$1," database | sed -e "s/^$1,//" | tail -n 1
}

看到这个时心里一乐🤪，在写leveldb笔记开篇这篇笔记时，也想过一个类似的开头。

Bitcask(Riak中的默认存储引擎)所采用的核心做法是哈希索引：内存中存储 HashMap，key的语义不变，value 为文件对应的 offset.这种方式听上去过于简单，但是确实可行。缺点则是内存要求高、磁盘随机读、无法支持 range 查询等。

更加推荐的是两种索引结构：LSM-Tree 和 B-Tree.

LSM-Tree：

典型应用如 leveldb，SSTable 排序数据，LSM-Tree 管理 MemTable 和 SSTable，充分利用了磁盘的顺序写，适用于读最近写入数据的场景。当查询不存在的 key 时，会查询到最后一层，因此还使用了 BloomFilter 提前过滤。

B-Tree：

B-tree底层的基本写操作是使用新数据覆盖磁盘上的旧页，即原地修改。它假设覆盖不会改变页的磁盘存储位置，也就是说，当页被覆盖时，对该页的所有引用保持不变。这点跟 leveldb 的追加写是个鲜明的对比。

B-tree中一个页所包含的子页引用数量称为分支因子，例如图里分支因子为 6. 大多数数据库可以适合3~4层的B-tree，因此不需要遍历非常深的页面层次即可找到所需的页(分支因子为500的4KB页的四级树可以存储高达256TB:(500 + 500**2 + 500**3 + 500**4)*4/(10**9))

LSM-Tree 和 B-Tree 的对比：

对比项	LSM-Tree	B-Tree
写压力	没有随机写，写压力更小但是 compaction 会导致严重的写放大 WAL(书里好像忽略了)	WAL(顺序写)+写入页(随机写)
压缩	merge 后整体压缩方便	碎片多，预留固定页大小，压缩困难
事务语义	不支持	支持

LSM-Tree 和 B-Tree 是非常经典的两种索引结构。
二级索引、全文索引、模糊索引，则是在 KV 索引的基础上进一步复杂化。
同时注意数据结构是紧随着硬件性能演进的。

3.2. 事务处理与分析处理

数据存储，大致有两种用途：

后端交互：例如博客的评论、交易的订单等
业务分析：例如评论的来源城市分类、今天的交易量等

前者称为 OLTP(Transaction)，后者称为 OLAP(Analytic).两者的比较：

属性	OLTP	OLAP
主要读特征	基于键，每次查询返回少量的记录	对大量记录进行汇总
主要写特征	随机访问，低延迟写入用户的输入	批量导入(ETL)或事件流
典型使用场景	终端用户，通过网络应用程序	内部分析师，为决策提供支持
数据表征	最新的数据状态(当前时间点)	随着时间而变化的所有事件历史
数据规模	GB到TB	TB到PB

最初数据库是同时支持了 OLTP 和 OLAP 的场景的，但是随着查询越来越复杂，比如：

需要 join 不同 mysql 实例的数据，或者 join 不同存储类型(比如mysql, tidb)，来分析数据
直接查询线上 mysql 实例压力过大，查询从库/备库也会因为扫描大量数据存在性能问题
日志类数据没有写入 mysql，但是同样需要数据分析
历史数据的对比，比如对比今天和昨天的数据，而数据库不支持 snapshot
routine 的分析，最好是每天固定生成一次数据，而不是每次分析都要现执行 SQL
。。。

基于上述众多的原因，逐渐形成了数据仓库这个分支，专门用于业务分析。
而数据仓库也因为时效性的要求区别(天级、小时级、分钟级、秒级等)，衍生出不同搭建方案和 OLAP 的选型，典型的如 Hive 和 ClickHouse。因此从时效性的角度，OLTP 是 online，OLAP 则是混合了 near-online、offline 多种场景。之所以会产生这两个名词，本质上还是写入和分析的不同诉求。

3.3. 列式存储

在大多数OLTP数据库中，存储以面向行的方式布局：来自表的一行的所有值彼此相邻存储。文档数据库也是类似，整个文档通常被存储为一个连续的字节序列。

提出列存储的概念来自于一个观察：大部分情况，我们读取的都是该行少量字段而不是全部字段。

而基于列存而不是行存，存储上也有优势：

读取部分列而不是整行，降低了 IO 压力
同一列的数据类型相同，排序后前缀大多相同，压缩率高
如果该列的值可枚举，使用位图统一表达在存储和计算上都具有很大优势，其中位图个数=枚举值个数，位图 bit 数=行数

不只是在存储引擎，大数据在文件格式上也偏爱列存储，之前尝试整理过一篇入门笔记：大数据列存储文件格式。

在列存储的基础上，又进一步引入了向量化的概念。由于相同列相邻存储且格式相同，因此可以充分利用 CPU 的 SIMD 指令集加速计算，目前 SparkSQL/Presto 等都在逐步支持(Gluten、Velox)，发展很快。我对这块很感兴趣，可惜受限于人手一直没能实践。

4. 数据编码与演化

为什么会需要数据编码格式？内存里的数据，保存在对象、结构体、列表、数组、哈希表和树等结构，不相邻的数据，使用指针指向。

但是当写入磁盘、网络发送时，就需要一段连续的字节，也就有了编码/解码，即序列化/反序列化。

编码应当是语言无关的
JSON/XML/CSV 作为文本格式，可读性好，适用于部分场景（比如我负责的任务调度系统，产出数据默认用 csv/txt 提供，而 DolphinScheduler 原生则使用了 json）；但是文本格式也有缺点：体积大、效率低、不支持二进制
Thrift/Protobuf 作为二进制格式，使用广泛。其中作者提到的一些 Map、required 等特性，在 PB3 里都做了改进，有段时间我对 PB 的编码和接口设计很着迷，总结过几篇笔记
Avro 没太看懂，似乎是在文件里包含了编码格式，因此是“自解释”的；同时由于应用于 hdfs，因此文件里多一些格式相关的字节，大小上完全可以忽略

为什么会需要数据流？本质上就是完成数据交换。

数据流有几种形式：

基于数据库：数据写入数据库，再由自身/其他程序读出
基于服务(REST RPC)
基于消息传递

基于服务的话，要明确使用 REST 还是 RPC. RPC 框架主要侧重于同一组织内多项服务之间的请求，通常发生在同一数据中心内。组织内服务之间，我也倾向于使用 RPC 而不是 REST，同时尝试过统一组内的 proto，但是阻力很大。注意虽然叫做 RPC，但区别是调用本地进程只会有两种结果：成功或者失败，而 RPC 一旦发生 Timeout，一切都是未知的。

基于消息传递的优点很多：

缓冲区
接收方崩溃不会丢数据
一条消息发给多个接收方，订阅即可
发送/接收隔离

适合于仅发送而不是数据交换的场景。

文本、二进制的数据编码分别适用于不同的场景。
二进制编码需要做到紧凑、性能高，以及向前向后的兼容性。
比如protobuf 里 unknown 字段的处理：假定 A -> B -> C 三个模块，即使 A C 使用 v2，B 使用 v1 版本，也一样能够保证数据不丢。
大数据则发展出了 Avro 格式用于写 hdfs 文件的场景。
数据交流可以基于数据库、REST or RPC、消息。基于消息的方式，让我想到大数据里的实时数仓架构，Event-Driven 组成一条条 pipeline.

Flink - Timer定时器

2024-07-06T07:30:04+00:00

1. Timer

系统收到 1 条数据，计算，输出 0~N 条数据，这种 Event-driven 的方式是最简单和自然的。

但实际上，由于存在乱序、丟数，以及业务周期性更新的需求，计算还会依赖于时间触发，例如Dataflow Model的第三节 IMPLEMENTATION & DESIGN 的场景。

这些场景依赖于 Timer，Flink 的窗口也是基于 Timer 实现。

Timer 不是简单的时间触发回调，Process Function#Timers文档介绍了 Timers，主要有几点：

ProcessingTime 和 EventTime 都是由 TimerService 维护的，定时触发用户的 onTimer 方法
TimerService 会对 key + timestamp 去重
onTimer 和 processElement 是串行的，使用者实现逻辑即可，不用担心并发导致不一致的问题
Timer 会存储到 state，因此是 Fault Tolerance 的

这几个 feature 都是易用性、稳定性上非常重要的设计。

我们自己实现的话，也会非常复杂，比如：EventTime 是靠水位线/数据触发，ProcessingTime 则是靠系统时间触发，两者是如何都回调到onTimer方法的；onTimer的回调，跟processElement是如何做到串行的？TimerService 是如何管理多个 timer 的，如何保证顺序性，如何存储的？扩缩容时，timer 的 state 还能被不同并发正常读取么，如何保证 timer 不丢？

这篇笔记，试着解释 timer 的用法和原理(基于 flink 1.14.5 版本)。

2. KeyedProcessFunction

看一个KeyedProcessFunction使用 timer 的例子：

  val sourceStream = SourceUtils.generateKafkaSensorReadingStream(env, sensorReadingTopic, bootStrapServers, groupId)

  sourceStream.keyBy(_.id)
    .process(new KeyedProcessFunction[String, SensorReading, String] {
      val logger: Logger = LoggerFactory.getLogger(this.getClass)

      override def processElement(value: SensorReading, ctx: KeyedProcessFunction[String, SensorReading, String]#Context, out: Collector[String]): Unit = {
        logger.info(s"processElement:${value} ctx.timestamp:${ctx.timestamp} ctx.timerService:${ctx.timerService} ctx.getCurrentKey:${ctx.getCurrentKey}")
        val nextProcessTimestamp = System.currentTimeMillis() + 300000L
        logger.info(s"nextProcessTimestamp:${nextProcessTimestamp}")

        ctx.timerService().registerProcessingTimeTimer(nextProcessTimestamp)
      }

      override def onTimer(timestamp: Long, ctx: KeyedProcessFunction[String, SensorReading, String]#OnTimerContext, out: Collector[String]): Unit = {
        logger.info(s"onTimer:${timestamp} ctx.getCurrentKey:${ctx.getCurrentKey}")
      }
    })

收到数据后，注册一个当前时间 + 5min 的 timer，该 timer 会在指定时间触发执行onTimer方法。
如果使用registerEventTimeTimer，在数据的水位线超过该时间后，也是相同的效果。

KeyedProcessFunction#Context.timerService()返回TimerService，是一个接口，支持了定时、删除和读取 ProcessingTime/EventTime：

public interface TimerService {
    String UNSUPPORTED_REGISTER_TIMER_MSG = "Setting timers is only supported on a keyed streams.";
    String UNSUPPORTED_DELETE_TIMER_MSG = "Deleting timers is only supported on a keyed streams.";

    long currentProcessingTime();
    long currentWatermark();

    void registerProcessingTimeTimer(long time);
    void registerEventTimeTimer(long time);

    void deleteProcessingTimeTimer(long time);
    void deleteEventTimeTimer(long time);
}

除了KeyedProcessFunction，Flink - 窗口理论、实现里的 trigger 也都支持访问TimerService，区别在于 trigger 里没有暴露TimerService，而是直接提供了 register/delete 相关的接口：

    public interface TriggerContext {
        long getCurrentProcessingTime();
        long getCurrentWatermark();

        void registerProcessingTimeTimer(long time);
        void registerEventTimeTimer(long time);

        void deleteProcessingTimeTimer(long time);
        void deleteEventTimeTimer(long time);
    }

其触发机制是一样的。

KeyedProcessFunction#Context.timerService()返回的实际是SimpleTimerService：

public class SimpleTimerService implements TimerService {

    private final InternalTimerService<VoidNamespace> internalTimerService;

    public SimpleTimerService(InternalTimerService<VoidNamespace> internalTimerService) {
        this.internalTimerService = internalTimerService;
    }

}

这是暴露给用户的类，方法实现都 delegate 给了 internalTimerService, 这是 Flink 内部的实现类。

ProcessFunction#Context 不支持 TimerService ，代码里会抛出异常:

public class ProcessOperator<IN, OUT>
        extends AbstractUdfStreamOperator<OUT, ProcessFunction<IN, OUT>>
        implements OneInputStreamOperator<IN, OUT> {

    private class ContextImpl extends ProcessFunction<IN, OUT>.Context implements TimerService {

        @Override
        public void registerEventTimeTimer(long time) {
            throw new UnsupportedOperationException(UNSUPPORTED_REGISTER_TIMER_MSG);
        }
    }
}

这么实现代码的好处，是用户看到的接口及实现都非常简洁，复杂度留在了internalTimerService.

2. KeyedProcessOperator And WindowOperator

KeyedProcessFunction 在 ExecGraph 对应的算子是 KeyedProcessOperator，ProcessWindowFunction 则是 WindowOperator.

上一篇笔记贴过 WindowOperator.onEventTime 的调用栈，其实两者是非常像的：

通过上面两张调用栈的对比图，可以总结到：

无论是哪种 operator，

onEventTime都是随着processElement调用的，即处理数据时提取 watermark，进而触发了该方法: InternalTimeServiceManagerImpl.advanceWatermark -> InternalTimerServiceImpl.advanceWatermark
onProcessingTime和onEventTime都是随着MailboxProcessor.runMailboxLoop调用的: StreamTask.invokeProcessingTimeCallback -> InternalTimerServiceImpl.onProcessingTime
两者都有一个非常重要的类:InternalTimerServiceImpl，分别调用了 advanceWatermark 和 onProcessingTime 方法

InternalTimerServiceImpl实际就是上一节InternalTimerService的子类。

3. KeyedProcessOperator 源码分析

3.1. KeyedProcessOperator.open

KeyedProcessOperator.open方法里，构造SimpleTimerService传入了internalTimerService:

public class KeyedProcessOperator<K, IN, OUT>
        extends AbstractUdfStreamOperator<OUT, KeyedProcessFunction<K, IN, OUT>>
        implements OneInputStreamOperator<IN, OUT>, Triggerable<K, VoidNamespace> {
    public void open() throws Exception {
        // ...
        InternalTimerService<VoidNamespace> internalTimerService =
                getInternalTimerService("user-timers", VoidNamespaceSerializer.INSTANCE, this);

        TimerService timerService = new SimpleTimerService(internalTimerService);

        context = new ContextImpl(userFunction, timerService);
    }

WindowOperator.open方法类似：

public class WindowOperator<K, IN, ACC, OUT, W extends Window>
        extends AbstractUdfStreamOperator<OUT, InternalWindowFunction<ACC, OUT, K, W>>
        implements OneInputStreamOperator<IN, OUT>, Triggerable<K, W> {
    protected transient InternalTimerService<W> internalTimerService;

    public void open() throws Exception {
        // ...
        internalTimerService = getInternalTimerService("window-timers", windowSerializer, this);

getInternalTimerService是共同的基类AbstractStreamOperator的方法，只是参数不同，同时第三个参数传入了 operator 自身。

                AbstractStreamOperator            
                           ▲                      
                           │                                         
               AbstractUdfStreamOperator          
                  ▲                  ▲            
                  │                  │                    
KeyedProcessOperator                WindowOperator

该方法返回了InternalTimerService.

3.2. AbstractStreamOperator.getInternalTimerService

从该方法可以看到，InternalTimerService 是由 InternalTimerServiceManager 管理的

abstract class AbstractStreamOperator
    // 引入了 namespace key 的概念
    // name: "user-timers", "window-timers"
    // namespaceSerializer: 
    // triggerable: timer的回调, operator 自身
    public <K, N> InternalTimerService<N> getInternalTimerService(
            String name, TypeSerializer<N> namespaceSerializer, Triggerable<K, N> triggerable) {
        if (timeServiceManager == null) {
            throw new RuntimeException("The timer service has not been initialized.");
        }
        @SuppressWarnings("unchecked")
        InternalTimeServiceManager<K> keyedTimeServiceHandler =
                (InternalTimeServiceManager<K>) timeServiceManager;
        KeyedStateBackend<K> keyedStateBackend = getKeyedStateBackend();
        checkState(keyedStateBackend != null, "Timers can only be used on keyed operators.");
        return keyedTimeServiceHandler.getInternalTimerService(
                name, keyedStateBackend.getKeySerializer(), namespaceSerializer, triggerable);
    }

3.3. InternalTimerServiceImpl

InternalTimerService相比暴露给用户的 TimerService，多了 namespace 的概念；InternalTimerServiceImpl是真正的实现类，管理了注册的时间戳、ProcessingTimeService、以及持久化。

这里可以看到第 2 节里非常重要的两个方法：onProcessingTime和advanceWatermark，都是根据传入的时间戳，不断从注册队列里取出时间，填充需要传给用户函数的数据，触发 trigger 对应的 onProcessingTime或者onEventTime方法。

// 存储注册的时间队列， 真正触发回调的timerservice, 
InternalTimerServiceImpl implements InternalTimerService
    private final ProcessingTimeService processingTimeService

    KeyGroupedInternalPriorityQueue<TimerHeapInternalTimer<K, N>> processingTimeTimersQueue;
    KeyGroupedInternalPriorityQueue<TimerHeapInternalTimer<K, N>> eventTimeTimersQueue;

    startTimerService
        // 从队列里取出第一个 timer
        InternalTimer<K, N> headTimer = processingTimeTimerQueue.peek()
        // 注册回调，时间为取出的 timer，函数为 onProcessingTime，注意 processingTimeService 会 wrap 以确保串行执行
        nextTimer = processingTimeService.registerTimer(headTimer.getTimestamp(), this::onProcessingTime);

     private void onProcessingTime(long time)
        InternalTimer<K, N> timer;

        // 从 processingTimeTimerQueue 取出 timer，直到未达到触发时间
        // 调用 triggerTarget.onProcessingTime
        // 这里 InternalTimerServiceImpl.triggerTarget 即为 KeyedProcessOperator
        // 继续注册最近的一个 timer
        while ((timer = processingTimeTimersQueue.peek()) != null && timer.getTimestamp() <= time) {
            processingTimeTimersQueue.poll();
            keyContext.setCurrentKey(timer.getKey());
            triggerTarget.onProcessingTime(timer);
        }
        // 注册新的 timer (这套方式，添加新的 timer 时应该也需要这个逻辑？)
        if (timer != null && nextTimer == null) {
            nextTimer =
                    processingTimeService.registerTimer(
                            timer.getTimestamp(), this::onProcessingTime);
        }

    public void advanceWatermark(long time) throws Exception {
        currentWatermark = time;

        InternalTimer<K, N> timer;

        while ((timer = eventTimeTimersQueue.peek()) != null && timer.getTimestamp() <= time) {
            eventTimeTimersQueue.poll();
            keyContext.setCurrentKey(timer.getKey());
            triggerTarget.onEventTime(timer);
        }
    }

InternalTimeServiceManager主要是按照 name 返回不同的InternalTimerServiceImpl，为了避免重复创建，这一层缓存是必要的:

InternalTimeServiceManagerImpl implements InternalTimeServiceManager
    private final Map<String, InternalTimerServiceImpl<K, ?>> timerServices;

    getInternalTimerService
        // 先查找 timerServices 是否存在 name，是的话直接返回；否则创建 InternalTimerServiceImpl，插入到 timerServices 返回
        InternalTimerServiceImpl<K, N> timerService =
                registerOrGetTimerService(name, timerSerializer);

        // 
        timerService.startTimerService(
                timerSerializer.getKeySerializer(),
                timerSerializer.getNamespaceSerializer(),
                triggerable);

        return timerService;

3.4. KeyGroupedInternalPriorityQueue 和 TimerHeapInternalTimer

TimerHeapInternalTimer包含了四元组，timestamp 越小则越小：

public final class TimerHeapInternalTimer<K, N>
        implements InternalTimer<K, N>, HeapPriorityQueueElement {
    @Nonnull private final K key;
    @Nonnull private final N namespace;
    private final long timestamp;
    private transient int timerHeapIndex;    

    public int hashCode() {
        int result = (int) (timestamp ^ (timestamp >>> 32));
        result = 31 * result + key.hashCode();
        result = 31 * result + namespace.hashCode();
        return result;
    }

    public int comparePriorityTo(@Nonnull InternalTimer ?> other) {
        return Long.compare(timestamp, other.getTimestamp());
    }

同时可以看到其 hashCode 跟 timestamp、key、namespace 都有关系。

namespace 的概念最难理解，我的理解是：

对于 KeyedProcessFunction，不需要区分，因此其 namespace 只有一个，即 VoidNamespaceSerializer.INSTANCE
对于 ProcessWindowFunction，回调跟 window 有关，因为同一时刻可能存在多个 window，而 window 之间的数据是互不影响的，所以 namespace 各不相同

也就是起到了分组隔离的作用。

KeyGroupedInternalPriorityQueue实现了二叉堆，同时按照 keyGroup 管理(方便扩缩容)

3.5. ProcessingTimeService

靠 ProcessingTime 触发的话，就必须引入单独的触发线程了，这里封装的也很绕(尽量理解思想吧，这类开源项目，代码级别变动的太频繁了)

首先看一下StreamTask

class StreamTask
    // createTimerService 创建，new SystemProcessingTimeService，或者传入
    TimerService timerService

    public ProcessingTimeServiceFactory getProcessingTimeServiceFactory() {
        return mailboxExecutor ->
                new ProcessingTimeServiceImpl(
                        timerService,
                        callback -> deferCallbackToMailbox(mailboxExecutor, callback));
    }

    ProcessingTimeCallback deferCallbackToMailbox(
            MailboxExecutor mailboxExecutor, ProcessingTimeCallback callback) {
        return timestamp -> {
            mailboxExecutor.execute(
                    () -> invokeProcessingTimeCallback(callback, timestamp),
                    "Timer callback for %s @ %d",
                    callback,
                    timestamp);
        };
    }

    private void invokeProcessingTimeCallback(ProcessingTimeCallback callback, long timestamp) {
        try {
            callback.onProcessingTime(timestamp);
        } catch (Throwable t) {
            handleAsyncException("Caught exception while processing timer.", new TimerException(t));
        }
    }

这里构造了ProcessingTimeServiceImpl，传入给 3.3 节里的InternalTimerServiceImpl.processingTimeService

而ProcessingTimeServiceImpl最重要的，是做了一层 wrapper，实现上则又都 delegate 给了内部的timeSerivce成员变量：

// timer 操作都交给成员变量 timerService，支持 wrap callback 和注册的 timer 个数计数
class ProcessingTimeServiceImpl implements ProcessingTimeService
    private final TimerService timerService

    //  注册的是 wrap 后的 callback
    // 传入的值是 StreamTask.deferCallbackToMailbox，再在 mailboxExecutor 里执行 invokeProcessingTimeCallback ->  callback.onProcessingTime
    // 这样就达到 单线程处理 的效果了
    processingTimeCallbackWrapper

实际管理时间回调线程的，则是SystemProcessingTimeService类：

// 单线程的 timerService，支持传入时间戳，到时间后调用 callback
// 可以多次调用 timestamp + callback
class SystemProcessingTimeService implements TimerService extends ProcessingTimeService
    ScheduledThreadPoolExecutor timerService;

    ScheduledFuture registerTimer(long timestamp, ProcessingTimeCallback callback)
        // wrapOnTimerCallback 将 callback + timestamp 封装到 ScheduledTask
        timerService.schedule(wrapOnTimerCallback(callback, timestamp), delay, TimeUnit.MILLISECONDS)

不要小看了这层 wrapper，正是这个封装，使得回调方法和处理数据一样，都放到了mailboxExecutor里执行，因此也就起到了串行的效果。

4. 总结

Timer 在数据计算场景是不可或缺的，Flink 在 Timer 的管理和接口上，花了很大的巧思：

处理数据和 timer 回调是串行的，这部分主要是通过都放到了mailboxExecutor执行实现的
Process和 window 的场景都需要 timer(或者说 window 就是用 timer 实现的)，因此 timer 引入了 name、namespace 的概念
InternalTimerServiceImpl管理了所有的 key 及 timer，对象个数跟 Operator 个数有关？(“user-timers” 和 “window-timers”)
TimerHeapInternalTimer的 hashCode 跟 (key, timestamp, namespace) 有关，也就回答了文档里说同一个 key + timestamp 会去重的问题了
time 的回调函数是固定的，这一点降低了实现的复杂度
InternalTimeServiceManagerImpl缓存了 name -> InternalTimerServiceImpl，那如果有两个 KeyedProcessOperator ，是否会导致异常？还是因为不会 chain 的一块，所以不会出现这种情况。
timer 的持久化是在InternalTimerServiceImpl实现的
SystemProcessingTimeService回归到了我们最自然理解的Timer，注册、回调等

一直以来，Flink 的代码我看的不多，这类大型工程，更适合的是一个兴趣小组，每个组员定期分享的模式，效率会高很多。经常是忙活一阵之后，发现新版本的变动已经很大了，兴趣索然。因此笔记不对的地方，也欢迎指正。

更重要的，是理解其实现的目的、思路和瓶颈，而不单纯是代码，以在必要时解决开源版本无法解决的问题，或者预判自己的场景里可能的瓶颈。

Flink - 窗口理论、实现

2024-06-30T06:36:47+00:00

1. Theory - Dataflow Model

Dataflow Model这篇论文，对大数据的处理范式做了总结，并且提出了一套处理模型。

1.1. Window

数据转换操作，有两种：

ParDo: 1->N，例如 map/flatmap/filter 等，这类操作在 unbound 和 bound 数据集上没有区别
GroupByKey: 聚合操作在 bound 数据集很自然；在 unbound 数据集上，既然数据不会结束，就需要解决何时聚合的问题。

之前在批处理和流处理的思考这篇笔记里提到过，bound 数据集其实也是在 unbound 数据集的一个划分，通常情况是 1 天/小时的数据。而论文则用更抽象的角度，提出了 window 的概念，用于在unbound 的数据集上，人为划分出一个 bound 的数据集合，GroupByKey 变成 GroupByKeyAndWindow.

有了 window 的概念，还需要拆解更进一步的定义解决：

Where in event time they are being computed : 计算哪些数据
When in processing time they are materialized : 何时计算数据

1.2. Where - assign and merge

数据应该属于哪个窗口：

Set AssignWIndows(T datum): 数据到达后，应该划分到哪些窗口，以 SlidingWindow 为例：
Set MergeWindows(Set windows): 典型的如 session 窗口，只从当前数据判断不出窗口范围，依赖收到后续数据判断窗口结束后，对窗口进行 merge，生成需要计算的窗口。以 SessionWindow 为例：

1.3. When - triggers and incremental processing

窗口何时结束，开始计算窗口内的数据。但是水位线过快、过慢都有问题。

A useful insight in addressing the completeness problem is that the Lambda Architecture effectively sidesteps the issue…

从 Lambda 架构的经验看：或许可以尽快触发水位线，同时保证能够处理后续数据(已经 trigger 的窗口迟到的数据)，以达到最终一致性。

注意这里经常提到水位线，但是窗口跟水位线并不绑定，比如 CountWindow SessionWindow 等

2. Implement - Flink

对论文里的定义，Flink 实现的语义基本都是一致的。这一节介绍代码部分(例子及源码均使用 1.14版本)。

2.1. Example

使用 window 计算 5s 窗口内最小温度值的例子：

  // case class SensorReading(id: String, temperature: Double, eventTime: Long = -1L)

  env.fromElements(
      SensorReading("sensor_a", 2.1, 1000L),
      SensorReading("sensor_a", 1.1, 2000L),
      SensorReading("sensor_b", 0.1, 2500L),
      SensorReading("sensor_a", 3.6, 3000L),
      SensorReading("sensor_a", 4.5, 4000L),
      SensorReading("sensor_a", 5.0, 5000L),
    ).assignAscendingTimestamps(_.eventTime)
    .keyBy(_.id)
    .window(TumblingEventTimeWindows.of(Time.seconds(5)))
    .process(new ProcessWindowFunction[SensorReading, (String, Double, Long), String, TimeWindow] {
      override def process(key: String, context: Context, elements: Iterable[SensorReading], out: Collector[(String, Double, Long)]): Unit = {
        println(s"now we process with key: $key and window: ${context.window}")
        println(s"elements: \n\t${elements.toList.mkString("\n\t")}")
        out.collect(key, elements.minBy(_.temperature).temperature, context.window.getEnd)
      }
    })
    //    .reduce((a, b) => SensorReading(a.id, a.temperature.min(b.temperature), 0L))
    .print("min temperature ")

指定SensorReading第三个参数为事件时间，窗口类型为 TumblingWindow，周期=5s，然后通过process计算窗口内的温度最小值。为了简便说明，数据的 id 是相同的。

输出：

now we process with key: sensor_a and window: TimeWindow{start=0, end=5000}
elements: 
    SensorReading(sensor_a,2.1,1000)
    SensorReading(sensor_a,1.1,2000)
    SensorReading(sensor_a,3.6,3000)
    SensorReading(sensor_a,4.5,4000)
min temperature > (sensor_a,1.1,5000)
now we process with key: sensor_b and window: TimeWindow{start=0, end=5000}
elements: 
    SensorReading(sensor_b,0.1,2500)
min temperature > (sensor_b,0.1,5000)
now we process with key: sensor_a and window: TimeWindow{start=5000, end=10000}
elements: 
    SensorReading(sensor_a,5.0,5000)
min temperature > (sensor_a,5.0,10000)

代码实际流程，在 Flink 实现里分为几部分：
注：代码里是 TumblingWindow，因此 1 条数据只归属到 1 个窗口

包含了：生成时间戳/水位线、WindowAssigner、Trigger、Evictor、Process. 通过组合这几部分，我们可以实现非常复杂的窗口及计算，flink 也内置了一些定制化的实现，在使用以及参考源码上都很有价值。

现在将这几部分拆解看看。

2.2. TimeDomains and Watermark

数据处理的时间分为 EventTime 和 ProcessTime，后面我们会看到这两种时间在处理上的区别。

EventTime 有两个生成的时机：数据源或者处理过程。
SourceFunction容易理解，但是实际场景我用的很少(EventTime 跟数据相关，很少跟数据源相关，注意Flink 官方更推荐这种形式，不解)

public interface SourceFunction<T> extends Function, Serializable {

    interface SourceContext<T> {
        void collectWithTimestamp(T element, long timestamp);

        void emitWatermark(Watermark mark);

        // ...
    }
}

第二种则是在处理过程里，比如上一小节例子里的assignAscendingTimestamps, extractAscendingTimestamp即提取时间戳：

  def assignAscendingTimestamps(extractor: T => Long): DataStream[T] = {
    val cleanExtractor = clean(extractor)
    val extractorFunction = new AscendingTimestampExtractor[T] {
      def extractAscendingTimestamp(element: T): Long = {
        cleanExtractor(element)
      }
    }
    asScalaStream(stream.assignTimestampsAndWatermarks(extractorFunction))
  }

AscendingTimestampExtractor则内置了 watermark 的生成策略：

@Deprecated
@PublicEvolving
public abstract class AscendingTimestampExtractor<T> implements AssignerWithPeriodicWatermarks<T> {
    @Override
    public final long extractTimestamp(T element, long elementPrevTimestamp) {
        final long newTimestamp = extractAscendingTimestamp(element);
        if (newTimestamp >= this.currentTimestamp) {
            this.currentTimestamp = newTimestamp;
            return newTimestamp;
        } else {
            violationHandler.handleViolation(newTimestamp, this.currentTimestamp);
            return newTimestamp;
        }
    }

    @Override
    public final Watermark getCurrentWatermark() {
        return new Watermark(
                currentTimestamp == Long.MIN_VALUE ? Long.MIN_VALUE : currentTimestamp - 1);
    }
}

可以看到：由于是递增的时间戳，所以当前的水位线，可以定义为最近的事件时间(ms)-1

Flink 实现了常用的WatermarkStrategy.forMonotonousTimestamps和WatermarkStrategy.forBoundedOutOfOrderness，分别用于时间单调递增以及最大延迟时间这两种情况。内置的实现或者接口，不同版本变化较大，但是本质上都是做两件事: 为数据生成时间戳以及在合适的时机生成 watermark

2.3. WindowAssigner

例子里的TumblingEventTimeWindows即是一种WindowAssigner，重点关注assignWindows方法的实现，返回指定了起始时间的单个窗口：

public class TumblingEventTimeWindows extends WindowAssigner<Object, TimeWindow> {
    private static final long serialVersionUID = 1L;

    private final long size;

    private final long globalOffset;

    private Long staggerOffset = null;

    private final WindowStagger windowStagger;

    protected TumblingEventTimeWindows(long size, long offset, WindowStagger windowStagger) {
        if (Math.abs(offset) >= size) {
            throw new IllegalArgumentException(
                    "TumblingEventTimeWindows parameters must satisfy abs(offset) < size");
        }

        this.size = size;
        this.globalOffset = offset;
        this.windowStagger = windowStagger;
    }

    @Override
    public Collection<TimeWindow> assignWindows(
            Object element, long timestamp, WindowAssignerContext context) {
        if (timestamp > Long.MIN_VALUE) {
            if (staggerOffset == null) {
                staggerOffset =
                        windowStagger.getStaggerOffset(context.getCurrentProcessingTime(), size);
            }
            // Long.MIN_VALUE is currently assigned when no timestamp is present
            long start =
                    TimeWindow.getWindowStartWithOffset(
                            timestamp, (globalOffset + staggerOffset) % size, size);
            return Collections.singletonList(new TimeWindow(start, start + size));
        } else {
            throw new RuntimeException(
                    "Record has Long.MIN_VALUE timestamp (= no timestamp marker). "
                            + "Is the time characteristic set to 'ProcessingTime', or did you forget to call "
                            + "'DataStream.assignTimestampsAndWatermarks(...)'?");
        }
    }

    @Override
    public Trigger<Object, TimeWindow> getDefaultTrigger(StreamExecutionEnvironment env) {
        return EventTimeTrigger.create();
    }
}

如果想要了解前面提到的mergeWindows，可以查看EventTimeSessionWindows.withGap/withDynamicGap ProcessingTimeSessionWindows.withGap/withDynamicGap的源码。

2.4. Trigger

例子里没有指定 trigger，因此实际用到的是 default 的实现:

    public WindowedStream(KeyedStream<T, K> input, WindowAssigner super T, W> windowAssigner) {

        this.input = input;

        this.builder =
                new WindowOperatorBuilder<>(
                        windowAssigner,
                        windowAssigner.getDefaultTrigger(input.getExecutionEnvironment()),
                        input.getExecutionConfig(),
                        input.getType(),
                        input.getKeySelector(),
                        input.getKeyType());
    }

windowAssigner即上一小节的TumblingEventTimeWindows，getDefaultTrigger的实现也已经给出，即返回了EventTimeTrigger(windowAssigner 和 trigger 是解耦的，EventTimeSessionWindows也使用的该 trigger)。

EventTimeTrigger的实现，重点关注：

onElement: 窗口每新增一条数据调用，返回结果有 FIRE, CONTINUE, PURGE, FIRE_AND_PURGE，这里如果超过了窗口，则返回 FIRE，否则注册 eventtimer，返回 CONTINUE，注册的时间为窗口的结束时间
onEventTime: 注册的时间服务回调函数
clear: 清理回调

public class EventTimeTrigger extends Trigger<Object, TimeWindow> {
    private static final long serialVersionUID = 1L;

    private EventTimeTrigger() {}

    @Override
    public TriggerResult onElement(
            Object element, long timestamp, TimeWindow window, TriggerContext ctx)
            throws Exception {
        if (window.maxTimestamp() <= ctx.getCurrentWatermark()) {
            // if the watermark is already past the window fire immediately
            return TriggerResult.FIRE;
        } else {
            ctx.registerEventTimeTimer(window.maxTimestamp());
            return TriggerResult.CONTINUE;
        }
    }

    @Override
    public TriggerResult onEventTime(long time, TimeWindow window, TriggerContext ctx) {
        return time == window.maxTimestamp() ? TriggerResult.FIRE : TriggerResult.CONTINUE;
    }

    @Override
    public void clear(TimeWindow window, TriggerContext ctx) throws Exception {
        ctx.deleteEventTimeTimer(window.maxTimestamp());
    }
}

之前最开始看到的时候，对官网解释的 FIRE, PURGE 非常不解，比如上述代码实现没有 PURGE，难道数据就不清理了？实际不是这样，当窗口周期结束后，也会清理数据。

我们也可以自定义实现同时根据 key 的个数或者 EventTime 触发的窗口，来观察其调用栈及枚举值的含义(完整例子)：

class EventTimeAndCountTrigger(maxCount: Long = 3) extends Trigger[Any, TimeWindow] {
  val logger: Logger = LoggerFactory.getLogger(this.getClass)
  val curCountDescriptor = new ReducingStateDescriptor[Long]("counter", (a, b) => a + b , classOf[Long])

  override def onElement(t: Any, l: Long, w: TimeWindow, triggerContext: Trigger.TriggerContext): TriggerResult = {
    val curCount = triggerContext.getPartitionedState(curCountDescriptor)
    curCount.add(1L)
    val result = if (curCount.get() >= maxCount || w.maxTimestamp <= triggerContext.getCurrentWatermark) {
      curCount.clear()
      // 比如窗口时间周期内，提前因为达到 maxCount 触发
      // FIRE: maxCount 触发窗口内已经收到的数据参与计算；之后到达 maxTimestamp，这些数据仍然会计算一次
      // FIRE_AND_PURGE: maxCount 触发窗口内已经收到的数据参与计算；之后到达 maxTimestamp，这些数据不会再计算一次了
//      TriggerResult.FIRE
      TriggerResult.FIRE_AND_PURGE
    } else {
      triggerContext.registerEventTimeTimer(w.maxTimestamp)
      TriggerResult.CONTINUE
    }

    logger.info(s"onElement t:${t} l:${l} w:${w} ${Integer.toHexString(System.identityHashCode(w))} result:${result}")

    result
  }

  override def onProcessingTime(l: Long, w: TimeWindow, triggerContext: Trigger.TriggerContext): TriggerResult = {
    logger.info(s"onProcessingTime l:${l} w:${w} ${Integer.toHexString(System.identityHashCode(w))}")
    TriggerResult.CONTINUE
  }

  override def onEventTime(l: Long, w: TimeWindow, triggerContext: Trigger.TriggerContext): TriggerResult = {
    val stack = Thread.currentThread().getStackTrace.map(_.toString)
      .mkString("\n\t")
    logger.info(s"stack:\n${stack}")

    val result = if (l == w.maxTimestamp) TriggerResult.FIRE
    else TriggerResult.CONTINUE
    logger.info(s"onEventTime l:${l} w:${w} ${Integer.toHexString(System.identityHashCode(w))} result:${result}")

    result
  }

  override def clear(w: TimeWindow, triggerContext: Trigger.TriggerContext): Unit = {
    logger.info(s"clear w:${w} ${Integer.toHexString(System.identityHashCode(w))}")

    triggerContext.deleteEventTimeTimer(w.maxTimestamp)
    triggerContext.getPartitionedState(curCountDescriptor).clear()
  }
}

比如KeyedStream.countWindow的 Trigger 就用 FIRE_AND_PURGE 以清理数据：

class KeyedStream
    public WindowedStream<T, KEY, GlobalWindow> countWindow(long size) {
        return window(GlobalWindows.create()).trigger(PurgingTrigger.of(CountTrigger.of(size)));
    }

onElement onProcessingTime onEventTime 对应窗口不同的触发模式，可以组合按照事件时间、处理时间、数据本身等各种维度触发窗口。

2.5. Evictor

Evictor 是 flink 单独实现的，用于窗口前、后的数据清理。

不过注意，作用在窗口上的 reduce/sum/min 这类方法，flink 做了优化：只存储聚合数据而不是全部原始数据:

而 Evictor 在语义上需要保留全部数据，因此注意状态过大的问题。

2.6. ReduceFunction/AggregateFunction/ProcessWindowFunction

窗口的结果可以交给这几种函数处理：

ReduceFunction/AggregateFunction: 只存储和输出聚合结果，每条数据到达后都触发方法计算
ProcessWindowFunction: 存储全部数据，窗口 trigger 后触发方法计算，传入窗口内的全部数据，同时支持获取窗口的元信息

也可以组合 1 2，这样能够聚合的同时获取窗口元信息，例如：

.reduce(, new ProcessWindowFuncation{...}))
.aggregate(, new ProcessWindowFuncation{...}))

这块实现简单，一些测试用例在Bigdata-Systems，就不多介绍了。

3. Source - WindowOperator

.window方法，对应的底层算子即WindowOperator，flink 实现的调用栈为：

    flowchart TB
    A("StreamTask.processInput") --> B["StreamOneInputProcessor.processInput"] --> C("AbstractStreamTaskNetworkInput.emitNext") --> D(" AbstractStreamTaskNetworkInput.processElement")
    D --> E("OneInputStreamTask$StreamTaskNetworkOutput.emitRecord") --> F("WindowOperator.processElement")

处理数据的入口，是在 WindowOperator.processElement

class WindowOperator {
    private transient InternalAppendingState<K, W, IN, ACC, ACC> windowState;

    @Override
    public void processElement(StreamRecord<IN> element) throws Exception {
        // 数据交给 windowAssigner , 返回所属的 N 个 window
        final Collection<W> elementWindows =
                windowAssigner.assignWindows(
                        element.getValue(), element.getTimestamp(), windowAssignerContext);

        // if element is handled by none of assigned elementWindows
        boolean isSkippedElement = true;

        // 当前的 key，即 keyBy 指定的分区 key
        final K key = this.<K>getKeyedStateBackend().getCurrentKey();
        logger.info("YING element:{} getCurrentKey:{}", element, key);

        // Merge窗口的处理，注意是否 MergingWindowAssigner 和 elementWindows 个数没有必然关系
        if (windowAssigner instanceof MergingWindowAssigner) {
            // 获取 merge 后的大 window
            W stateWindow = mergingWindows.getStateWindow(actualWindow);
            // 之后的处理跟 else 逻辑很像
        } else {
            for (W window : elementWindows) {

                // 指定 windowState 的 nm，添加数据；这样每个 window(包含了key) 都有单独的 windowState
                windowState.setCurrentNamespace(window); 
                windowState.add(element.getValue());

                triggerContext.key = key;
                triggerContext.window = window;

                // 调用 trigger，根据返回结果判断是否触发计算
                TriggerResult triggerResult = triggerContext.onElement(element);

                if (triggerResult.isFire()) {
                    ACC contents = windowState.get();
                    if (contents == null) {
                        continue;
                    }
                    // 处理窗口内数据
                    emitWindowContents(window, contents);
                }

                if (triggerResult.isPurge()) {
                    windowState.clear();
                }
                registerCleanupTimer(window);
            }
        }

        // ...
    }

    public void onEventTime(InternalTimer<K, W> timer) throws Exception {
        // ...
        // 调用内置 or 自定义的 trigger.onEventTime，根据返回结果判断是否触发计算
        TriggerResult triggerResult = triggerContext.onEventTime(timer.getTimestamp());

        if (triggerResult.isFire()) {
            ACC contents = windowState.get();
            if (contents != null) {
                emitWindowContents(triggerContext.window, contents);
            }
        }
        // ...
    }

    public void onProcessingTime(InternalTimer<K, W> timer) throws Exception {
        // ...
        // 调用内置 or 自定义的 trigger.onProcessingTime，根据返回结果判断是否触发计算
        TriggerResult triggerResult = triggerContext.onProcessingTime(timer.getTimestamp());

        if (triggerResult.isFire()) {
            ACC contents = windowState.get();
            if (contents != null) {
                emitWindowContents(triggerContext.window, contents);
            }
        }
        // ...
    }

    private void emitWindowContents(W window, ACC contents) throws Exception {
        timestampedCollector.setAbsoluteTimestamp(window.maxTimestamp());
        processContext.window = window;

        // 调用用户实现的 ProcessWindowFunction 方法
        userFunction.process(
                triggerContext.key, window, processContext, contents, timestampedCollector);
    }
}

emitWindowContents真正调用用户函数执行计算，入口可能有 3 处：processElement onEventTime onProcessingTime.

其中onEventTime和processElement很像，都是数据触发，入口均在StreamTask.processInput：

    flowchart TB
    A("StreamTask.processInput") --> B["StreamOneInputProcessor.processInput"] --> C("AbstractStreamTaskNetworkInput.emitNext") --> D(" AbstractStreamTaskNetworkInput.processElement")
    D --> E("StatusWatermarkValve.inputWatermark") --> F("StatusWatermarkValve.findAndOutputNewMinWatermarkAcrossAlignedChannels") --> G("OneInputStreamTask$StreamTaskNetworkOutput.emitWatermark") --> H("AbstractStreamOperator.processWatermark") --> I("InternalTimeServiceManagerImpl.advanceWatermark") --> J("InternalTimerServiceImpl.advanceWatermark") --> K("WindowOperator.onEventTime") --> L("WindowOperator$Context.onEventTime")

onProcessingTime略有不同，因为不是靠数据触发的，所以需要单独线程ScheduledThreadPoolExecutor触发，具体实现在 SystemProcessingTimeService.

这段代码也对应到了 2.1 小节图片里的流程。

此外还可以总结到几点：

window 存储了元数据，数据本身存储在 state, state 还存储了 key、keyGroupRange、serializer 等
收到的一条数据，如果分配到了多个窗口，那么数据也是 copy 的；因此，比如 size=1day,slide=1second 的滑动窗口，会导致状态很大(1 条数据被存储了 86400 次)

4. Timer

前面介绍了 Window，结尾还想再说一说 Timer.

ProcessFunction里引入了Timer，比如对于KeyedProcessFunction:

可以通过 ctx.timerService 获取 TimerService，然后注册 time(event/process)，获取处理时间，watermark 等等
可以实现 onTimer 方法，在 time 触发后调用该方法
processElement和onTimer方法不会被同时调用，因此不需要担心同步问题。但这也意味着处理onTimer逻辑是会阻塞处理数据的。

因此 KeyedProcessFunction 也可以实现窗口效果，例如对一段时间内的每个 key 的值求和:

object UseTimerAsWindowApp extends App {
  val env = StreamExecutionEnvironment.getExecutionEnvironment

  val sourceStream = SourceUtils.generateLocalSensorReadingStream(env)
    .assignAscendingTimestamps(r => r.eventTime)

  useTimerAsWindow(sourceStream)

  env.execute("UseTimerAsWindowApp")

  private def useTimerAsWindow(sourceStream: DataStream[SensorReading]): Unit = {
    sourceStream.keyBy(_.id)
      .process(new OneMinuteWindowProcessFunction)
  }

  private class OneMinuteWindowProcessFunction extends KeyedProcessFunction[String, SensorReading, String] {
    val logger = LoggerFactory.getLogger(classOf[OneMinuteWindowProcessFunction])

    private lazy val sumState = getRuntimeContext.getState(new ValueStateDescriptor[Double]("sum", classOf[Double]))
    override def processElement(value: SensorReading, ctx: KeyedProcessFunction[String, SensorReading, String]#Context, out: Collector[String]): Unit = {
      logger.info(s"processElement: ${value} i: ${value.id} timestamp:${ctx.timestamp()} currentProcessingTime:${ctx.timerService().currentProcessingTime()} currentWatermark:${ctx.timerService().currentWatermark()} getCurrentKey:${ctx.getCurrentKey}")

      if (sumState.value() == 0) {
        if (value.id.equals("sensor_a")) {
          val windowEndTimer: Long = (ctx.timestamp() / 60000L + 1) * 60000L - 1
          logger.info(s"register windowEndTimer: $windowEndTimer")
          ctx.timerService().registerEventTimeTimer(windowEndTimer)
        } else {
          val windowEndTimer: Long = (ctx.timestamp() / 120000L + 1) * 120000L - 1
          logger.info(s"register windowEndTimer: $windowEndTimer")
          ctx.timerService().registerEventTimeTimer(windowEndTimer)
        }

        sumState.update(value.temperature)
      } else {
        sumState.update(sumState.value() + value.temperature)
      }
    }

    override def onTimer(timestamp: Long, ctx: KeyedProcessFunction[String, SensorReading, String]#OnTimerContext, out: Collector[String]): Unit = {
      logger.info(s"collect getCurrentKey:${ctx.getCurrentKey} sumState.value:${sumState.value()} ctx.timestamp:${ctx.timestamp()} timestamp:${timestamp}")
      out.collect(s"${ctx.getCurrentKey} ${sumState.value()}")

      sumState.clear()
    }
  }
}

使用自定义 timer，可以设计出更加灵活的逻辑，比如不同 key 指定不同的统计时间，根据 key 的不同值指定不同时间等。阿里云的DataStream的Timer使用最佳实践里也提到了用于发送无数据的心跳。
不过我觉得云厂商这种文档不够严谨。这个例子恰好呼应了论文里的考量点，没有数据是上游异常还是确实无数据，此时我们是应当尽快发送心跳包触发计算还是继续等待水位线？心跳包应当是数据源发送还是可以在处理函数里发送？都是值得进一步考虑的设计。

5. Summary

Flink 的窗口在设计、语义上都跟 G 家的 The Dataflow Model 一致，实现上则依赖了 Timer，同时窗口机制也和时间、时间类型是解耦的。通过组合窗口的各个阶段，可以组合出复杂的业务逻辑。对于更复杂的场景，则可以使用 timer，不过就需要更加注意 state 的处理了。

What I cannot create, I do not understand-读《深度学习入门》

2024-06-22T01:31:50+00:00

尽管是本入门书籍，不过仍然读的很慢，前后花了十几个小时。读这本书的时候，大学里高等代数里的矩阵乘法、秩的概念、最优化理论里的梯度下降、小波分析，楞是一点都想不起来🙈。

1. 感知机

感知机的想法很直白：

其中 x 是输入，w 是权重，θ是阈值。y 是输出，大于阈值输出 1，小于阈值输出 0.

通过设置不同的权重和阈值，我们可以使用感知机实现与门(两个输入均为 1 时输出1，其他时候则输出 0)、与非门(跟与门相反)、或门(输入有 1 个是 1，输出就为 1)。

但是异或门无法用线性函数表示，需要用非线性函数或者多层感知机。

注：读到这里的时候，想到了《智慧的疆界》一书里：

明斯基在书中最后给出了他对多层感知机的评价和结论：“研究两层乃至更多层的感知机是没有价值的。”因此多层感知机在没来及被大家深入探究之前，就被明斯基直接判处了死刑。

2. 神经网络

实现与门需要手动配置权重，神经网络相比感知机，最核心的目标是自动生成权重值

因此引入了几个概念：
激活函数：y=σ(W⋅x+b) ，表示神经元的输出，常见的例如 Sigmoid 函数。跟阶跃函数相比，相同点是其 y 轴取值范围、函数曲线的趋势都是一致的，不同点在于平滑程度，因此当微调权重的时候，函数值也会发生变化。
Sigmoid 函数的定义：$\sigma(x) = \frac{1}{1 + \exp(-x)}$ 输出层所用的激活函数，要根据求解问题的性质决定。一般地，回归问题可以使用恒等函数，二元分类问题可以使用 sigmoid 函数，多元分类问题可以使用 softmax 函数。

训练数据(监督数据)和测试数据：使用训练数据进行学习，寻找最优的参数；使用测试数据评价训练得到的模型的实际能力，以追求模型的泛化能力。

损失函数：神经网络的学习中所用的指标称为损失函数（loss function），损失函数可以使用任意函数，但一般用均方误差和交叉熵误差。比如识别手写数字的场景，为什么不直接使用识别精度作为指标？本质上是避免导数为 0，使得微调产生效果，跟激活函数的思想类似。

梯度法：为了逐步降低损失函数的值，使用随机梯度下降法(SGD)来寻找最小值（或者尽可能小的值），这个过程就会不断地更新权重参数。梯度法有点像贪心算法，梯度是当前函数值减小最多的方向，因此迭代的过程是局部不断减少。但是，无法保证梯度所指的方向就是函数的最小值或者真正应该前进的方向。实际上，在复杂的函数中，梯度指示的方向基本上都不是函数值最小处。

\[\mathbf{W} \leftarrow \mathbf{W} - \eta \frac{\partial L}{\partial \mathbf{W}}\]

更新的权重参数记为$\mathbf{W}$，损失函数关于 $\mathbf{W}$ 的梯度记为 $\frac{\partial L}{\partial \mathbf{W}}$, $\eta$ 记为学习率，学习率（数学式中记为 η）的值很重要。学习率过小，会导致学习花费过多时间；反过来，学习率过大，则会导致学习发散而不能正确进行。实际取值为 0.01 0.001 等，这样就实现了不断更新权重查找“最小值”的效果。

3. 误差反向传播法

通过数值微分计算梯度的方法简单，但是耗时，而误差反向传播法则非常高效。

链式法则：如果某个函数由复合函数表示，则该复合函数的导数可以用构成复合函数的各个函数的导数的乘积表示。 反向传播：即由输出来反推输入，计算输入多大程度上会影响输出

sigmoid 的反向传播公式：

神经网络的学习步骤：

mini-batch: 从训练数据中随机选择一部分数据
计算梯度: 计算损失函数关于各个权重参数的梯度
更新参数: 将权重参数沿梯度方向进行微小的更新
重复 1 2 3

相比来讲，误差反向传播法求梯度的效率更高。

注：这一节没太看懂，总体来说，感觉是一轮 forward 后，根据结果的偏差及每个节点的导数，来反推哪些节点的权重值要降低/升高，所以在实现里也需要记录 forward 输出的值。

4. SGD 的优化

SGD 方法对于呈延伸状的图形，比如：

搜索时，呈“之”字形朝最小值 (0, 0) 移动:

为了改正SGD的缺点，Momentum 公式引入了速度的概念，AdaGard 公式则引入学习率衰减，Adam 公式则融合了以上两者：

这 4 种方法各有各的特点，都有各自擅长解决的问题和不擅长解决的问题。

5. 总结

书里接下来介绍了卷积神经网络、强化学习等，包括权重初始值、损失函数最小值搜索算法、超参数，都需要根据实际情况摸索。以及一直在提斯坦福大学的课程 CS231n，不过我个人读这本书是希望能够听懂/看懂算法里常提到的名词，因此没有继续深入学习。有时候在想，如果在大学就学习这些课程，没准自己现在就是一名算法工程师了。

回到题目本身，作者提到了“What I cannot create, I do not understand.”，这种刨根问底的态度，我觉得是非常值得学习和保持的。

一个环境导致读取 Kakfa TimeoutException 的问题

2024-06-08T06:50:26+00:00

最近遇到一个比较奇怪的现象：相同 Flink 任务，换个运行环境就会报读取 kafka 失败，但是排查环境跟 Kafka 源的连通性又没有问题。

线上匆忙解决了，今天简化代码验证，感觉值得总结一版。

1. TimeoutException

org.apache.kafka.common.errors.TimeoutException: Timeout expired while fetching topic metadata

这个报错在读取 Kafka 时容易遇到，往往是 client 跟 bootstrap.server 的网络问题，或者 server 本身不可用导致。

但是从我的情况看，任务代码及配置是一致的，报错跟环境相关，唯一的疑点是任务 KafkaConsumer 配置的 bootstrap.server 存在多个已经下线的节点。

2. Kafka 及 Flink 代码分析

Kafka 代码里，该报错在Fetcher.getTopicMetadata：

public class Fetcher<K, V> implements Closeable {
    public Map<String, List<PartitionInfo>> getTopicMetadata(MetadataRequest.Builder request, Timer timer) {
        do {
            RequestFuture<ClientResponse> future = sendMetadataRequest(request);
            client.poll(future, timer);

            ...

            timer.sleep(retryBackoffMs);
        } while (timer.notExpired());

        throw new TimeoutException("Timeout expired while fetching topic metadata");
    }
}

Flink 代码里，调用Fetcher.getTopicMetadata方法的栈：

    flowchart TB
    A("FlinkKafkaConsumerBase.open") --> B["partitionDiscoverer.discoverPartitions"] --> C("getAllPartitionsForTopics") --> D(" kafkaConsumer.partitionsFor")
    D --> E("fetcher.getTopicMetadata")

    F("FlinkKafkaConsumerBase.run") --> G["runWithPartitionDiscovery"] --> H("createAndStartDiscoveryLoop") --> B

open: 用于初始化读取的 offset(比如任务配置了 SPECIFIC_OFFSETS，但是没有指定全部 partition 的情况)
run: 检测 TopicPartition 变化(配置了flink.partition-discovery.interval-millis)

粗看代码，使用比较简单，也没有跟机器环境有关的部分。

3. 简化流程

到这思路就卡住了，继续查看 kafka 源码耗时长，flink 调试流程又繁琐。因此尝试简化代码仅调用KafkaConsumer.partitionsFor，惊喜的发现跟 flink 任务的行为一致：有的环境正常，有的打印了相同的报错。

于是查看正常执行时 Kafka TRACE 日志(x y 是配置的 broker 地址，x 是已经下线的节点)：

[INFO] 2024-06-08 10:14:26.382 org.apache.kafka.clients.consumer.ConsumerConfig:[347] - ConsumerConfig values:
...
[DEBUG] 2024-06-08 10:14:26.432 org.apache.kafka.clients.consumer.KafkaConsumer:[699] - [Consumer clientId=consumer-g-1, groupId=g] Initializing the Kafka consumer
[DEBUG] 2024-06-08 10:14:26.604 org.apache.kafka.clients.consumer.KafkaConsumer:[815] - [Consumer clientId=consumer-g-1, groupId=g] Kafka consumer initialized
[TRACE] 2024-06-08 10:14:26.738 org.apache.kafka.clients.NetworkClient:[700] - [Consumer clientId=consumer-g-1, groupId=g] Found least loaded node x.x.x.x:9092 (id: -18 rack: null) with no active connection
[DEBUG] 2024-06-08 10:14:26.742 org.apache.kafka.clients.NetworkClient:[950] - [Consumer clientId=consumer-g-1, groupId=g] Initiating connection to node x.x.x.x:9092 (id: -18 rack: null) using address /x.x.x.x
[TRACE] 2024-06-08 10:14:26.751 org.apache.kafka.clients.NetworkClient:[697] - [Consumer clientId=consumer-g-1, groupId=g] Found least loaded connecting node x.x.x.x:9092 (id: -18 rack: null)
...
[TRACE] 2024-06-08 10:14:33.745 org.apache.kafka.clients.NetworkClient:[697] - [Consumer clientId=consumer-g-1, groupId=g] Found least loaded connecting node x.x.x.x:9092 (id: -18 rack: null)
[DEBUG] 2024-06-08 10:14:33.763 org.apache.kafka.common.network.Selector:[607] - [Consumer clientId=consumer-g-1, groupId=g] Connection with /x.x.x.x disconnected
java.net.ConnectException: Connection timed out
    at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
    at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:716)
    at org.apache.kafka.common.network.PlaintextTransportLayer.finishConnect(PlaintextTransportLayer.java:50)
    at org.apache.kafka.common.network.KafkaChannel.finishConnect(KafkaChannel.java:216)
    at org.apache.kafka.common.network.Selector.pollSelectionKeys(Selector.java:531)
    at org.apache.kafka.common.network.Selector.poll(Selector.java:483)
    at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:547)
    at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:262)
    at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:233)
    at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:212)
    at org.apache.kafka.clients.consumer.internals.Fetcher.getTopicMetadata(Fetcher.java:368)
    at org.apache.kafka.clients.consumer.KafkaConsumer.partitionsFor(KafkaConsumer.java:1926)
    at org.apache.kafka.clients.consumer.KafkaConsumer.partitionsFor(KafkaConsumer.java:1894)
    at cn.izualzhy.SimpleKafkaConsumer.listPartitions(SimpleKafkaConsumer.java:34)
    at cn.izualzhy.SimpleKafkaConsumer.main(SimpleKafkaConsumer.java:59)
[DEBUG] 2024-06-08 10:14:33.764 org.apache.kafka.clients.NetworkClient:[891] - [Consumer clientId=consumer-g-1, groupId=g] Node -18 disconnected.
[WARN] 2024-06-08 10:14:33.765 org.apache.kafka.clients.NetworkClient:[756] - [Consumer clientId=consumer-g-1, groupId=g] Connection to node -18 (/x.x.x.x:9092) could not be established. Broker may not be available.
[WARN] 2024-06-08 10:14:33.766 org.apache.kafka.clients.NetworkClient:[1024] - [Consumer clientId=consumer-g-1, groupId=g] Bootstrap broker x.x.x.x:9092 (id: -18 rack: null) disconnected
[DEBUG] 2024-06-08 10:14:33.813 org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient:[593] - [Consumer clientId=consumer-g-1, groupId=g] Cancelled request with header RequestHeader(apiKey=METADATA, apiVersion=9, clientId=consumer-g-1, correlationId=0) due to node -18 being disconnected
[TRACE] 2024-06-08 10:14:33.914 org.apache.kafka.clients.NetworkClient:[700] - [Consumer clientId=consumer-g-1, groupId=g] Found least loaded node y.y.y.y:9092 (id: -56 rack: null) with no active connection

日志里有个非常重要的信息，尝试连接 x 节点几秒后，连接失败接着尝试 y 节点

而异常情况，则是会一直尝试连接 x.x.x.x 直到超时失败

使用 telnet 连接 x 节点：

# time telnet x.x.x.x 9092
Trying x.x.x.x...
telnet: connect to address x.x.x.x: Connection timed out

real    0m7.091s
user    0m0.001s
sys 0m0.000s

% time telnet x.x.x.x 9092
Trying x.x.x.x...
telnet: connect to address x.x.x.x: Connection timed out
telnet x.x.x.x 9092  0.00s user 0.00s system 0% cpu 1:03.14 total

跟上述代码一致，执行时间存在较大差别。因此可以猜测原因跟 socket 配置有关。

4. tcp_syn_retries

以一段代码来说明 socket 连接的超时时间:

public class NonBlockingSocketChannelWithRetry {
    public static void main(String[] args) {
        String host = "192.0.2.1";  // 使用一个无法访问的IP地址来模拟连接超时
        int port = 9092;            // Kafka通常使用的端口
        DateTimeFormatter dtf = DateTimeFormatter.ofPattern("yyyy-MM-dd HH:mm:ss");

        try (Selector selector = Selector.open();
             SocketChannel socketChannel = SocketChannel.open()) {

            socketChannel.configureBlocking(false); // 设置为非阻塞模式
            System.out.println(LocalDateTime.now().format(dtf) + " - Attempting to connect to " + host + ":" + port);

            if (!socketChannel.connect(new InetSocketAddress(host, port))) {
                socketChannel.register(selector, SelectionKey.OP_CONNECT);
                while (selector.select() > 0) {  // 无超时，直到有事件发生
//              while (selector.select(10000) > 0) {  // 超时10s
                    Iterator<SelectionKey> keyIterator = selector.selectedKeys().iterator();
                    while (keyIterator.hasNext()) {
                        SelectionKey key = keyIterator.next();
                        keyIterator.remove();
                        if (key.isConnectable()) {
                            if (socketChannel.finishConnect()) {
                                ...
    }
}

完整代码在 NonBlockingSocketChannelWithRetry

select指定时间时，超时跟该时间一致；
如果select没有指定超时时间，则跟tcp_syn_retries有关

TCP 建立连接，如果未收到 SYN+ACK，则 client 会一直尝试发送 SYN，直到达到tcp_syn_retries次数，每次重试间隔是2的幂次方(RFC 6298)¹，测试机器上：

net.ipv4.tcp_syn_retries = 5

这样就解释了为什么前面 telnet 在 63s 秒后超时退出。

因此修改机器配置可以解决，实际上 Kafka 在高版本也引入了 socket.connection.setup.timeout.ms socket.connection.setup.timeout.max.ms³，来避免超时时间跟机器环境强相关。进一步，当我们自己使用 RPC 时，应当显示设置超时时间；读写 Kafka 时，使用 LB 而不是 broker 列表也是一个好习惯。

5. Ref

Reading《Stream Processing with Apache Flink》-2nd

2024-05-03T09:07:21+00:00

1 Chapter7: Stateful Operators And Applications

1.1 Implementing Stateful Functions

Keyed State:

ValueState[T]: single value. The value can be read using ValueState.value() and updated with ValueState.update(value: T)
ListState[T]: list of elements. 常用接口有add addAll get update
MapState[K, V]: map of keys and values. get put contains remove
ReducingState[T]: 类似 ListState, 但是不存储全部 list，而是 immediately aggregates value using a ReduceFunction
AggregatingState[I, O]: 类似 reduce 和 aggregate 的关系，更加通用化

val sensorData: DataStream[SensorReading]  = ???
// partition and key the stream on the sensor ID
val keyedData: KeyedStream[SensorReading, String] = sensorData
  .keyBy(_.id)

// apply a stateful FlatMapFunction on the keyed stream which 
// compares the temperature readings and raises alerts
val alerts: DataStream[(String, Double, Double)] = keyedData
  .flatMap(new TemperatureAlertFunction(1.7))

class TemperatureAlertFunction(val threshold: Double)
    extends RichFlatMapFunction[SensorReading, (String, Double, Double)] {

  // the state handle object
  private var lastTempState: ValueState[Double] = _

  override def open(parameters: Configuration): Unit = {
    // create state descriptor
    val lastTempDescriptor = 
      new ValueStateDescriptor[Double]("lastTemp", classOf[Double])
    // obtain the state handle
    lastTempState = getRuntimeContext.getState[Double](lastTempDescriptor)
  }

  override def flatMap(
      reading: SensorReading, 
      out: Collector[(String, Double, Double)]): Unit = {
    // fetch the last temperature from state
    val lastTemp = lastTempState.value()
    // check if we need to emit an alert
    val tempDiff = (reading.temperature - lastTemp).abs
    if (tempDiff > threshold) {
      // temperature changed by more than the threshold
      out.collect((reading.id, reading.temperature, tempDiff))
    }
    // update lastTemp state
    this.lastTempState.update(reading.temperature)
  }
}

State 存储时使用 Flink 的TypeInformation(序列化、反序列化)
StateDescriptor 是函数从 StateBackend 获取/注册 State 的描述符

Operator List State: 可以继承

public interface ListCheckpointed<T extends Serializable> {
    List<T> snapshotState(long checkpointId, long timestamp) throws Exception;
    void restoreState(List<T> state) throws Exception;
}

注意：该接口已经标记 @Deprecated, 建议使用 CheckpointedFunction

Broadcast State: 典型的场景是：a stream of rules and a stream of events on which the rules are applied, 即事件流和规则流。

val sensorData: DataStream[SensorReading] = ???
val thresholds: DataStream[ThresholdUpdate] = ???
val keyedSensorData: KeyedStream[SensorReading, String] = sensorData.keyBy(_.id)

// the descriptor of the broadcast state
val broadcastStateDescriptor =
  new MapStateDescriptor[String, Double](
    "thresholds", classOf[String], classOf[Double])

val broadcastThresholds: BroadcastStream[ThresholdUpdate] = thresholds
  .broadcast(broadcastStateDescriptor)

// connect keyed sensor stream and broadcasted rules stream
val alerts: DataStream[(String, Double, Double)] = keyedSensorData
  .connect(broadcastThresholds)
  .process(new UpdatableTemperatureAlertFunction())

注意 Broadcast events 可能乱序。

CheckpointedFunction, CheckpointListener跟 checkpoint 紧密相关，前者在触发 checkpoint 时调用，可以定义各类 State，例如ValueState ListState等，后者则注册了 checkpoint 完成时的回调。

1.2 Enabling Failure Recovery for Stateful Applications

1.3 Ensuring the Maintainability of Stateful Applications

任务会经常变动：Bugs need to be fixed, functionality adjusted, added, or removed, or the parallelism of the operator needs to be adjusted to account for higher or lower data rates.

为了确保任务的可维护性，关于 state 有两点需要注意：

Specifying Unique Operator Identifiers : 最好从程序开始就为每个 operator 指定
Defining the Maximum Parallelism of Keyed State Operators: setMaxParallelism在这里更确切的作用是setCountOfKeyGroups
1.4 Performance and Robustness of Stateful Applications

StateBackend: MemoryStateBackend, the FsStateBackend, and the RocksDBStateBackend.
使用 RocksDBStateBackend 时，不同 State 类型性能差别较大。比如 MapState[X, Y]比ValueState[HashMap[X, Y]]性能更高，ListState[X]比ValueState[List[X]]更适合频繁追加数据的场景。
滥用 state 会导致 state 过大的问题，比如 KeyedStream.aggregate 而 key 无限制，典型的比如统计用户行为时的 sessionId. 使用 timer 清理 state，确保 state 不会引发问题。例如：

class SelfCleaningTemperatureAlertFunction(val threshold: Double)
    extends KeyedProcessFunction[String, SensorReading, (String, Double, Double)] {

  // the keyed state handle for the last temperature
  private var lastTempState: ValueState[Double] = _
  // the keyed state handle for the last registered timer
  private var lastTimerState: ValueState[Long] = _

  override def open(parameters: Configuration): Unit = {
    // register state for last temperature
    val lastTempDesc = new ValueStateDescriptor[Double]("lastTemp", classOf[Double])
    lastTempState = getRuntimeContext.getState[Double](lastTempDescriptor)
    // register state for last timer
    val lastTimerDesc = new ValueStateDescriptor[Long]("lastTimer", classOf[Long])
    lastTimerState = getRuntimeContext.getState(timestampDescriptor)
  }

  override def processElement(
      reading: SensorReading,
      ctx: KeyedProcessFunction
        [String, SensorReading, (String, Double, Double)]#Context,
      out: Collector[(String, Double, Double)]): Unit = {

    // compute timestamp of new clean up timer as record timestamp + one hour
    val newTimer = ctx.timestamp() + (3600 * 1000)
    // get timestamp of current timer
    val curTimer = lastTimerState.value()
    // delete previous timer and register new timer
    ctx.timerService().deleteEventTimeTimer(curTimer)
    ctx.timerService().registerEventTimeTimer(newTimer)
    // update timer timestamp state
    lastTimerState.update(newTimer)

    // fetch the last temperature from state
    val lastTemp = lastTempState.value()
    // check if we need to emit an alert
    val tempDiff = (reading.temperature - lastTemp).abs
    if (tempDiff > threshold) {
      // temperature increased by more than the threshold
      out.collect((reading.id, reading.temperature, tempDiff))
    }

    // update lastTemp state
    this.lastTempState.update(reading.temperature)
  }

  override def onTimer(
      timestamp: Long,
      ctx: KeyedProcessFunction
        [String, SensorReading, (String, Double, Double)]#OnTimerContext,
      out: Collector[(String, Double, Double)]): Unit = {

    // clear all state for the key
    lastTempState.clear()
    lastTimerState.clear()
  }
}

1.5 Evolving Stateful Applications

Updating an Application without Modifying Existing State: compatible
Changing the Input Data Type of Built-in Stateful Operators: not compatible
Removing State from an Application: 默认 avoid losing state，可以关闭
Modifying the State of an Operator: 比如ValueState[String]修改为ValueState[Double]，兼容不全，尽量避免。

1.6 Queryable State

支持 state 的点查和读取，依赖 flink-queryable-state-client-java. 注：之前调研 flink 时，这个功能看上去非常强大，不过目前在官网已经看不到相关文档了。

2 Chapter8: Reading From and Write to External Systems

2.1 Application Consistency Guarantees

如果想不丢数据，source 需要是 resettable 的，例如读文件时 File ByteStream 的 offset，读 kafka 时 TopicPartition 的 offset. 但是如果想要 end-to-end exactly-once, sink connectors 还需要支持 idempotent writes or transactional writes. 后者比如 write-ahead-log (WAL) sink , two-phase-commit (2PC) sink.

	Nonresettable source	Resettable source
Any Sink	At-most-once	At-least-once
Idempotent sink	At-most-once	Exactly-once* (temporary inconsistencies during recovery)
WAL sink	At-most-once	At-least-once
2PC sink	At-most-once	Exactly-once

注意 WAL sink 即使仅在 checkpoint complete 完成的时候 sink，也无法作答 Exactly-once.s

2.2 Provided Connectors

Kafka, Filesystem, etc. 官网比书里已经更详细了。

2.3 Implementing a Custom Source Function

SourceFunction and RichSourceFunction can be used to define nonparallel source connectors—sources that run with a single task.
ParallelSourceFunction and RichParallelSourceFunction can be used to define source connectors that run with multiple parallel task instances. 注：接口后来有变化

当 checkpoint 进行的时候，需要记录此时的 offset, 就需要避免SourceFunction.run()emit data.
换句话说CheckpointedFunction.snapshotState和该方法，只能同时在执行一个。

需要注意 sourceFunction 某个 parallelism idle 时不会发出 watermark，可能导致整个任务在等待的情形。

2.4 Implementing a Custom Sink Function

Idempotent Sink Connectors: 要求结果数据有 deterministic (composite) key，存储支持
Transactional Sink Connectors:
- GenericWriteAheadSink: 先写 WAL，收到 CheckpointCompleted 时写入到存储。听上去似乎很完美，但是实际上只能做到 At-least-once，有两种情况：存储的批量写入不是原子的；存储写入成功，但是 commit checkpoint 时失败。
- TwoPhaseCommitSinkFunction
  - sink operator 收到 checkpoint barrier：persists its state, prepares the current transaction for committing, and acknowledges the checkpoint at the JobManager.
  - JobManager 收到所有 task instances 的 successful checkpoint notifications
  - sink operator 收到 checkpoint completed 消息：commits all open transactions of previous checkpoints.
  - 我理解 commit 确保了持久化, 如果 commit 失败的话，preCommit 的操作会被回滚，确保不会对 storage system 产生影响，因而保证了 Exactly-once 语义。书里有一个TransactionalFileSink的例子，很直观。当然支持该语义带来的问题也需要注意，一是 checkpoint 完成后数据才可见；二是对 kafka transaction timeout 调优，避免一直 commit 失败导致可能的数据丢失。

2.5 Asynchronously Accessing External Systems

异步查询词典的场景

3 Chapter9: Setting Up Flink for Streaming Applications

3.1 Deployment Modes

Standalone Cluster:
- 启动：
- 提交：
Docker
YARN:
- JobMode:
- SessionMode:
  - 启动：
  - 提交：
- 注：ApplicationMode
Kubernetes: 生产环境的目标状态应当还是容器化部署

3.2 Highly Available Setups

HA 的目的是 as little downtime as possible. TaskManager 失败可以由 ResourceManager 恢复，JobManager 失败则依赖于 HA 部署。 HA 需要考虑的存储有：JAR file, the JobGraph, and pointers to completed checkpoints.

书里介绍了 ZooKeeper HA Services，当前还有 Kubernetes HA Services. 实践经验里看还是有些坑的，尤其是 Yarn 相关参数。

3.3 Integration with Hadoop Components

3.4 Filesystem Configuration

3.5 System Configuration

Java and Classloading: Flink 提供了 User-code class loaders, 注意 classloader.resolve-order 相关的配置。
CPU: task 在 TaskManager 的线程运行，以 slot 的方式对外提供。
Main Memory and Network Buffers: JM TM 内存重点不同，额外注意 network buffer 和 rocksdb backend.
Disk Storage
Checkpointing and State Backends
Security

4 Chapter10: Operating Flink and Streaming Applications

4.1 Running and Managing Streaming Applications

这节提到了“maintaining streaming applications is more challenging than maintaining batch applications”，我个人觉得对于 streaming applications，maintaining 比 develop 更具挑战性。maintaining = start, stop, pause and resume, scale, and upgrade.

操作 flink 任务可以使用 CLI 或者 REST API. savepoint 相关功能最好通过uid()定义 Unique Operator IDs.

Kubernetes 相关的内容已经过时，建议直接参考文档。

4.2 Controlling Task Scheduling

Task Chaining 可以将网络通信转为线程内方法的直接调用，因此 Flink 默认开启，如有必要可以通过disableChaining, startNewChain调优。

Slot-Sharing Groups 允许用户自己协调计算密集和 IO 密集的 task：

// slot-sharing group "green"
val a: DataStream[A] = env.createInput(...)
  .slotSharingGroup("green")
  .setParallelism(4)
val b: DataStream[B] = a.map(...)
  // slot-sharing group "green" is inherited from a
  .setParallelism(4)

// slot-sharing group "yellow"
val c: DataStream[C] = env.createInput(...)
  .slotSharingGroup("yellow")
  .setParallelism(2)

// slot-sharing group "blue"
val d: DataStream[D] = b.connect(c.broadcast(...)).process(...)
  .slotSharingGroup("blue")
  .setParallelism(4)
val e = d.addSink()
  // slot-sharing group "blue" is inherited from d
  .setParallelism(2)

如上代码，不同 task 分配的效果：

4.3 Tuning Checkpointing and Recovery

重要配置：

间隔、minPause、超时时间、
程序退出时是否删除
选择合适的 backend
Restart strategies

4.4 Monitoring Flink Clusters and Applications

Flink WebUI 可以用来初步分析任务：日志、metrics 等。如果要深入分析，则依赖 metrics systems.

4.5 Configuring the Logging Behavior

5 Chapter11: Where to Go from Here?

批处理、TableAPI、SQL、CEP 等

Reading《Stream Processing with Apache Flink》-1st

2024-04-27T12:55:45+00:00

1 Chapter1: Introduction to Stateful Stream Processing

2 Chapter2: Stream Processing Fundamentals

介绍了 Parallel、Time、State 等概念

Processing Streams in Parallel
1. Latency and Throughput: 延迟、吞吐的关系
2. Operations on DataStreams: 输入输出、算子、聚合、窗口
Time Semantics: processing time 适合对数据延迟、乱序不敏感的场景；event time 适合对结果要求准确且唯一的场景，引入了 watermark 避免一直等待。
State And Consitency Models: 批的 failover 可以依赖回放数据，但是流不可以；真实世界使用最多的是 At-least-once，如何保证？一个方案是确保保存数据，直到所有 task 都返回了 ACK

3 Chapter3: The Architecture of Apache Flink

3.1 System Architecture

Components of a Flink Setup:

JobManager(生成和分配ExecutionGraph、任务协调)；ResourceManager(跟 resource provider 交互，申请和回收 taskmanager 资源)； TaskManager(实际的 worker process)；Dispatcher（Rest）. 根据环境不同，有的 components 可能跑在一个 JVM Process 上。注意跟现在的已经不一样了

Task Execution:

taskmanager 多个 slot，上下游 operator 的 parallelism 不同时，就会发生数据的 exchange.

High Available Setup:

TaskManager failures: JobManager 跟 ResourceManager 申请新的 slot
JobManager failures: 数据持久化到 storage，pointer 存储到 zk，新的 JM 通过 zk 上的 latest complete checkpoint 恢复任务

3.2 Data Transfer In Flink

Task Chaining:

用户定义 :

chain 为函数间的调用关系 :

有时也会希望在多个线程间执行 :

t1=0.1s t2=0.8s t3=0.2s，1 个线程 1qps，因此 10 个线程 10qps；也可以 f1 1 个线程，f2 8个，f3 1个(不过我没想清楚区别在哪)

3.3 Event-Time Processing

相比 Processing-Time 的流系统，实现上更加复杂

Timestamps: 事件的元数据
Watermarks: 如果收到了 T 时间戳的 watermark，则表示 T 之前的数据都已经到达。后续如果有违反该约定的数据，成为 Late Record.在 Handling Late Data 一节分析。合适的 watermark 是在 latency 和 completeness tradeoff.
Watermark Propagation and Event Time: 多流的场景，不同流的 watermark 有快有慢，更加复杂
Timestamp Assignment and Watermark Genearation: timestamp、watermark 显示设置，有三种方式: Source Function、AssignerWithPeriodicWatermarks、AssignerWithPunctuatedWatermarks，从 record 提取 timestamp，同时结合配置计算当前的 watermark. 后两者有对应的子类实现。

3.4 State Management:

Operator State：，有 ListState、UnionListState、BroadcastState
Keyed State：，有 ValueState、ListState、MapState
State Backend：state 的读写速度影响 latency
Scaling Stateful Operators：，scale out、scale in 都按照 key group，而不是 redistribute.
1. Operator list state :
2. Operator union list state:
3. Operator broadcast state:

3.5 Checkpoints, Savepoints, and State Recovery

Consistent Checkpoints: naive mechanism 需要暂停数据输入，待所有 in-flight 的数据都处理完成后再 resume，但是 flink 采用了更 sophisticated 的方法：

，Source 产生 1,2,3，… 的数据，在图中的时刻，checkpoint 记录了 Source offset = 5, 而奇数和偶数的 sum 分别为 9 和 6.
Recovery From a Consistent Checkpoint：, task 失败从 checkpoint 恢复时，从 5 之后继续消费，数据是正确且一致的。注意 sink operators 可能收到多条。

Flink’s Checkpointing Algorithm : Flink 没有使用 pause-checkpoint-resume 的做法，而是基于 Chandy-Lamport algorithm for distributed snapshots.

例如这个过程：

source 分为两部分，每部分都生成递增的数字，当前状态如图所示：
此时 JobManager 触发 checkpointID=2(三角形)：
Source 收到后，记录此时 source 的 offset(3, 4)，并在当前位置插入 checkpoint barrier(ID=2)，跟普通数据一样，发送到下游算子：
下游算子收到后，等待所有上游算子实例的 ID=2 的 barrier：，此时上游算子仍然在产生数据，当前算子也缓存着晚于 barrier 的数据(例如 Source1 产生的蓝色圆圈4)
当所有 ID=2 的 barrier 到达后，该算子也写入 checkpoint 数据(8, 8)，
待当前算子发送所有 ID=2 的 barrier 后，处理缓存的数据并发送：
当 sink operators 也 ACK checkpoint 后，就认为 ID=2 的 checkpoint 全部完成

Performance Implications Of Checkpointing: 异步的将 local snapshot to the remote storage；不强制等待 barrier 对齐，而是继续处理并发送数据到下游（代价是恢复时只能 exactly-once，以及随着非对齐增多导致 state 变大？）

Savepoints: checkpoints 主要用于失败恢复的场景，但是 consistent snapshots 实际上有更多的用途。Using savepoints：比如 fix bugs and reprocesss 的场景，或者 A/B tests，不过需要 application 前后兼容。修改并发、修改集群、pause-resume.

Starting an application from a savepoint :

4 Chapter4: Setting Up a Development Environment for Apache Flink

主要介绍在 IDE 上运行 Flink 任务，注意有些 issue 例如 ClassLoader 跟实际环境是不同的

5 Chapter5: The DataStream API(v1.7)

Hello, Flink: 构建一个 flink application 有 5 步：

Set Up the Execution Environment
Read An Input Stream
Apply Transformations
Output the result
Execute

Transformations：

Basic Transformations: on individual events, Map/Filter/FlatMap
KeyedStream Transformations: in context of a key
1. keyBy: convert DataStream into KeyedStream
2. Rolling aggregations: sum/min/max/minBy/maxBy
3. Reduce

MultiStream Transformations: merge into one or split into multiple

Union:
Connect, coMap, and coFlatMap: DataSteam 的数据是随机处理的，因此 ConnectedStream 常用于两个 KeyedStream、DataStream + Broadcast 以确保结果的确定性，因此用到了 keyedState.
Split and select: split 与 union 相反 : ，返回 SplitStream, 通过 select 方法返回不同的 DataStream

Distribution Transformations: 普通情况下是由 operation semantics and parallelism 决定的，不过也支持 shuffle/rebalance/rescale(rebalance vs rescale: /broadcast/global/partitionCustom(自定义)

Setting the Parallelism: application 和 opertor 级别

Types: 网络传输、读写 statebackend 都会用到 Types; 统一各语言的 type diff，例如 scala 和 java 的 tuple (seg-1)、语言特有的，比如 scala case class

Supported Data Types 包括 Primitives、Java and Scala tuples、Scala case classes、POJOS, including classes generated by Apache Arvo、Some special types, 其他类型则 fallback 到 Kryo serialization framework.
Creating Type Information for Data Types: type system 的核心类是 TypeInformation，flink 为支持的各种数据类型，都提供了对应的子类实现，例如 NumericTypeInfo 封装了 Integer Long Double Byte Short Float Character 类。
Explicitly Providing Type Information: 自动提取 TypeInformation 失败的场景(例如 java 里的 erasing generic type information)，此时就需要显示指定 return 的 TypeInformation 了。

Defining Keys and Referencing Fields : 可以按照 pos、字段名(literal)、KeySelector

Implementing Functions : Function 应当是 Serializable，如果存在 non-serializable field，需要 override RichFunction.open 方法

6 Chapter6: Time-Based And Window Operators

6.1 Configuring Time Characteristics

6.1.1 Assigning Timestamp and Generating Watermarks

DataStream.assignTimestampsAndWatermarks 的参数类型可以为 AssignerWithPeriodicWatermarks 或者 AssignerWithPunctuatedWatermarks，两者都继承自 TimestampAssigner.
其中：

TimestampAssigner.extractTimestamp: 定义了提取 timestamp 的接口
Watermark checkAndGetNextWatermark/ Watermark getCurrentWatermark() ：定义了提取 watermark 的接口

AssignerWithPeriodicWatermarks: Watermark getCurrentWatermark() ，watermark 跟 timestamp 有关的场景

# timestamp 为 SensorReading.timestamp
# watermark 为当前收到的maxTimestamp - 1min
# env.getConfig.setAutoWatermarkInterval(5000) => 每 5s 调用一次 getCurrentWatermark 方法
class PeriodicAssigner
    extends AssignerWithPeriodicWatermarks[SensorReading] {

  val bound: Long = 60 * 1000     // 1 min in ms
  var maxTs: Long = Long.MinValue // the maximum observed timestamp

  override def getCurrentWatermark: Watermark = {
    // generated watermark with 1 min tolerance
    new Watermark(maxTs - bound)
  }

  override def extractTimestamp(
      r: SensorReading,
      previousTS: Long): Long = {
    // update maximum timestamp
    maxTs = maxTs.max(r.timestamp)
    // return record timestamp
    r.timestamp
  }
}

明确事件时间递增的前提下，简化为 assignAscendingTimeStamps，相当于使用了内置的 AscendingTimestampExactor implements AssignerWithPeriodicWatermarks ;另外一种常用的内置 AssignerWithPeriodicWatermarks 则是 BoundedOutOfOrdernessTimeStampExtractor.

AssignerWithPunctuatedWatermarks: Watermark checkAndGetNextWatermark，watermark 跟 event 自身有关的场景

# sensor_1 携带着 watermark
class PunctuatedAssigner
    extends AssignerWithPunctuatedWatermarks[SensorReading] {

  val bound: Long = 60 * 1000 // 1 min in ms

  override def checkAndGetNextWatermark(
      r: SensorReading,
      extractedTS: Long): Watermark = {
    if (r.id == "sensor_1") {
      // emit watermark if reading is from sensor_1
      new Watermark(extractedTS - bound)
    } else {
      // do not emit a watermark
      null
    }
  }

  override def extractTimestamp(
      r: SensorReading,
      previousTS: Long): Long = {
    // assign record timestamp
    r.timestamp
  }
}

6.1.2 Watermarks, Latency and Completeness

Watermarks are used to balance latency and result completenes,

6.2 Process Functions

相比之前介绍的 MapFunction，process functions 是一组 low-level transformation，能够读取到 timestamp, watermark, register timers. 例如：ProcessFunction, KeyedProcessFunction, CoProcessFunction,ProcessJoinFunction, BroadcastProcessFunction,KeyedBroadcastProcessFunction, ProcessWindowFunction, and ProcessAllWindowFunction.

比如 KeyedProcessFunction 提供的接口，支持获取 TimerService，该类支持获取当前的 timestamp, watermark，同时注册基于时间的回调方法:

processElement(v: IN, ctx: Context, out: Collector[OUT])
onTimer(timestamp: Long, ctx: OnTimerContext, out: Collector[OUT])

由于使用了回调，注意线程和 cpu 的使用：

例子TempIncreaseAlterFunction：接收到的温度数据里，如果持续升高温度超过 1s，则在 timer 发出数据；如果期间温度降低，则取消 timer.
例子FreezingMonitor：使用了 OutputTag[X] 输出到多个 stream
例子ReadingFilter: 使用了 CoProcessFunction 来处理两个 stream 协同的场景

6.3 Window Operator

window 的作用，即将 events 归到一个 bucket，然后基于 bucket 内有限的数据计算。

Tumbling Windows: ，TumblingEventTimeWindows.of TumblingProcessingTimeWindows.of, 默认对齐到 epoch，也可以指定 offset 参数。
Sliding Windows: , SlidingEventTimeWindows.of SlidingProcessingTimeWindows.of
Session Window: , EventTimeSessionWindows.withGap ProcessingTimeSessionWindows.withGap

作用于 window 的 function 主要有三类：

ReduceFunction: 比如计算窗口里的最大值、最小值等
AggregateFunction: 相比 1 更加灵活，不再限制数据类型，子类需要 override 创建初始值、累加、获取结果、merge 方法. 1 2对 state 使用都较小，因为记录的都是 aggregate 的值。
ProcessWindowFunction: 如果要计算一个窗口内的中间值，就依赖遍历 window 的数据了；所有数据在底层通过 ListState 存储，因此可能变的非常大，最好想办法变成 incrementally aggregated.

自定义 window 由三部分组成：assigner, trigger, evictor

incremental aggregation function(记录 aggregation 值):
full window function(记录全部 event，使用 ListState): ){:width=”300”}
mix:

可以通过

extends WindowAssigner 实现自定义的窗口范围；
extends Trigger 用于触发窗口数据的计算和结果返回，onElement onEventTime onProcessingTime返回TriggerResult，CONTINUE, FIRE, PURGE, FIRE_AND_PURGE等枚举值，可以参考内置的 class EventTimeTrigger extends Trigger实现。
Evictor 是可选项，似乎仅在 Non Incremental Aggregation Function 里才有意义，没太看懂，具体可以翻翻TopSpeedWindowing的例子代码。

6.4 Joining Stream on Time

Interval Join：INNER JOIN 的语义，且仅支持 Event Time. ){:width=”300”}

如图，表示 A 会选择 B 里 [-1hour, +15min] 时间范围，相同 key 的数据；如果 JOIN 不到，则忽略该数据。对应代码实现形如：

input1
  .keyBy(…)
  .between(<lower-bound>, <upper-bound>) // bounds with respect to input1
  .process(ProcessJoinFunction) // process pairs of matched events

B 也是对称的行为，即 JOIN A 对应时间范围内的数据。按照上述行为，State 里就需要存储:

A 里 >= CurrentWatermark - 15Min 的数据(B 可能会 JOIN)
B 里 >= CurrentWatermark - 1Hour 的数据(A 可能会 JOIN) 如果两者的 watermark 对不齐，那则取决于更慢的那条流。注：此时 State 可能会遇到读写、大小的瓶颈

Window Join:

Tumbling Window Join 的效果：

6.5 Handling Late Data

处理迟到的数据有三种方式：Drop, Redirecting, Updating Results By Including Late Events.

Redirecting 主要依赖 Side-Output feature:

Window Operator With Side-Output: .timeWindow.sideOutputLateData
在 ProcessFunction 里比较：

class LateReadingsFilter 
    extends ProcessFunction[SensorReading, SensorReading] {

  val lateReadingsOut = new OutputTag[SensorReading]("late-readings")

  override def processElement(
      r: SensorReading,
      ctx: ProcessFunction[SensorReading, SensorReading]#Context,
      out: Collector[SensorReading]): Unit = {

    // compare record timestamp with current watermark
    if (r.timestamp < ctx.timerService().currentWatermark()) {
      // this is a late reading => redirect it to the side output
      ctx.output(lateReadingsOut, r)
    } else {
      out.collect(r)
    }
  }
}

allowedLateness允许迟到的数据再次参与 window 的计算（潜在行为即 window 的数据保留时间更长）

见人，亦见己-读《在工作中，看到中国》

2024-03-30T10:27:42+00:00

之前看一本直播间的访谈录，也是各行各业，受邀的多是名人。翻了几章实在看不下去，倒不是名人说的不对，只是字里行间的心灵鸡汤、高屋建瓴，让人望而生畏。

1. 见人

潮来潮去，讲了四个职业：黄金分析师、海外置业顾问、电视购物、线下实体店。这些职业曾在浪潮之巅，却也在这个时代日薄西山。赶上时代的浪潮，顺风顺水；等到浪潮退去，黯然神伤。经历过浪潮的人，总是郁郁的叹上一句“怎么就不挣钱了呢？”。

匠人之魂，讲了两个职业：开锁师傅、风车检修。锁住的一扇门代表什么呢？门后可能是别离的人、偷情的人、贩毒的人、疫情隔离的人。风车检修，又苦又累。但是在广袤的大地上，巨大的风车之下，人仿佛是一只蚂蚁，见天地、见众生，收获务实的简单、快乐。所谓匠人，大概就是“他打磨了无数把钥匙，也在打磨自己。”

天涯海角，讲了三个职业：金沙江河谷的守站人、杂技小演员、海员。

守站人在这里一呆20年，见过人来人往。故事的尾声，终于听到了那首说了多次的彝族歌词，小狗也意识到什么，追着列车追出好远。
台上一分钟、台下十年功，说的不只是明星。那些少时出国表演的杂技演员，多数都在成年后离开了这个行业。
一个海员，为了挣钱，常年漂泊在海上，结过两次婚，第一任全职卖起了保险，第二任全职喜欢去美容院。这两任相同的是，都见过很多有钱人，孩子都跟他很陌生。

后浪百相，讲了三个职业：当销售的女孩、“鉴黄师”、警犬训导员。

人们对做销售的女孩，经常会有一些偏见：“家庭条件不好、学历不高，找不到其他工作，只能去当销售。”。这些是真是假很难说，可是做销售的女孩，大概也都有一个共同点，那就是很拼命。
加入创业公司，想做新媒体运营，确成了网络发帖的“鉴黄师”。更加真实的是，老板扯淡忽悠，成了赢家。

人间苦甜，讲了三个职业：信用卡业务员、种大棚的农民、货车司机

有的人，他没什么学历，也不怎么精通人情世故，但他不怕吃苦受累、不怕受人白眼，他踏踏实实的推销信用卡挣钱。有人离开这个行业，有人因此坚持。
“这些年下来，没见几个种菜的暴富了，收菜的倒是几乎都成了大款。”
货车司机，在路上，遇到路政、小偷、拦道，见过人情冷暖，有苦有甜。

幕后英雄，讲了三个职业：基层社区调查员、麻醉医生、空管

我们在工作，讲了四个职业：北漂的俄罗斯小伙、AI教育创业、历史系老师、35岁高管。

见己

这本书每篇文章的作者都名不见经传，文字、叙事水平也都参差不起，但是套路千万条，唯有真诚动人心。

不知道为什么，有些职业我看到了相同点，大抵是自己的心态、境界使然。

批处理和流处理的思考

2024-03-23T02:53:32+00:00

1.时间的边界

1.1. T+1

时间分两种，处理时间和事件时间。大部分情况，数据处理都会选择事件时间。

以离线的天级 Hive 表任务为例，我们看看是如何产出 T+1 的数据的。

T+1 00:00 是处理时间，假定 A 表 Tday 的数据在 00:05 完全到达，B 表 T-day 的数据在 01:05 完全到达。

数据完全到达后，开始 merge Tday 的增量数据，然后根据需要生成全量表或者拉链表。可能的时间轴：

00:05 -> 00:30: merge A 表 Tday 的增量数据，记录到 A-inc
00:30 -> 01:35: merge A-inc + A-base，记录到 A 表 Tday 分区
01:05 -> 01:10: merge B 表 Tday 的增量数据，记录到 B-inc
01:10 -> 01:50: merge B-inc + B-base，记录到 B 表 Tday 分区
01:50: 开始执行 SQL A + B -> C，产出 C 表的 Tday 分区

这里为离线数仓任务的开发建立了一个非常友好的模型：SQL 处理的是需要的全量数据。

基于这个易用且成熟的模型，基础设施上只需要确保两点：

如何判断数据是否完全到达？需要考虑生产环境一直没有数据的情况，因此依赖 FileAgent/CDC 的心跳包。整个数据流是保序的，当 T+1 day 的心跳包到达，就可以认为 T day 的数据已经采集完成。
如何编排上述任务？依赖于任务调度系统，相比于 K8S jobs、Linux crontab，大数据的任务调度系统，在功能上最强大的两点之一便是编排任务 DAG 的能力。通过任务编排，离线任务简单划分为了依赖检查和任务执行两个阶段。

1.2. Micro Batch

离线是处理1天的数据，微批处理时间间隔更小，比如10分钟。因此面对实时的需求，从离线过渡到微批，似乎更加顺理成章一些。

[08:00, 08:10] 区间的数据，A 表在 08:11 完全到达，B 表在 08:16 完全到达。如果完全复用离线处理的思路：(A-base + A-inc) JOIN (B-base + B-inc)，但是数据量就太大了，包含了很多的无用计算&存储，同时对下游，也不是 Micro Batch，而是全量数据。

因此，构建的目标应当是产出该 Mirco Batch 内产生变化的数据。

一个初步的想法是：(A-inc JOIN B-base) UNION (B-inc JOIN A-base)，不过 A-inc 和 B-inc 的数据可能也有关联，因此修正为： (A-inc JOIN (B-base + B-inc)) UNION (B-inc JOIN (A-base + A-inc)). 这样 SQL 产出的数据就是需要给到下游的全部增量数据。

基于这个想法，具体的落地需求：

B-base + B-inc 如何实现？需要一个支持 upsert 的存储系统
A-inc JOIN B-merged：需要该存储系统支持点查
同样依赖调度系统编排任务
A-inc 应该如何获取？通过存储系统字段/用户字段，或者计算引擎的 windows/batch/binlog 处理，从实现复杂度上前者更合适。

使用事件时间的好处，是系统对外清晰，当前批次处理完成，那么 <= 08:10 的数据就都处理完成。如果放到生产环境，还需要考虑使用事件时间时，一次性要处理的数据可能过多。

反过来，如果 T+1 关注的是昨天(<今天00:00)的指标，当开始追求微批处理，我们关注的其实是时效性，而不是这个数据是否一定 <= 08:10。这种情况下，也可以考虑使用处理时间。

上一节在处理 A-inc 时，还有一个默认对用户屏蔽的点，就是删除数据。在当前系统，需要 A 表只能 markDel，如果物理删除的话依赖从 binlog 读取增量数据。因此，基于 Micro Batch 模型，SQL 用户也需要考虑删除数据如何传递的问题。

如果数据在时间周期内多次更新，Micro Batch 和 T+1 都有一个好处，就是可以仅用最新数据触发一次。

1.3. Stream

从微批到流，我觉得变化最大的是模型。

举一个计算用户历史订单的例子，微批处理时，计算 10 分钟内新增订单的用户，所有历史订单：

SELECT uid, COUNT(1)
FROM trade_table
WHERE update_time IN [current_time, current_time - 10min)

trade_table 存储在 OLAP 引擎，用户开发的 SQL，跟 T+1 的区别不大。如果这个 SQL 交给流计算引擎解释，至少两点是模糊不清的：

COUNT 这个聚合操作，包含了哪些订单数据？实时任务启动以来的，还是 trade_table 里所有的历史数据？trade_table 存储在 Kafka？
IN 的时间范围是不是不需要了？

我们会发现单纯靠流计算引擎里的 state，解决这个问题是完全不靠谱的。进一步，假设 trade_table_a 是订单的更新流，存储在 Kafka；table_table_b 是历史所有订单，存储在 hbase。

SELECT uid, COUNT(1)
FROM trade_table_a
JOIN(Temporal) table_table_b 
	ON trade_table_a.uid = table_table_b.uid

如果从全量模型的角度，这里的 COUNT(1) 是 JOIN 到的所有历史订单的总数。但是对于流计算，从 SQL 语义上，其实是 trade_table_a 流上的 COUNT，每次更新 uid，都会触发历史数据在当前基础上再 COUNT 一次。

对于流上 JOIN、AGGREGATE 的处理过程，特别像是后端的一次点查。其 SQL 也跟离线在全量模型上的写法，变化较大。

因此，随着时间边界的不同，模型、思维上都发生了很大变化。

2. 调度

T+1 的调度，前提条件是“全部数据”，也就是任务 DAG . 因此对离线任务来讲，任务调度平台是不可或缺的。流处理任务的“调度”概念要轻的多，我觉得叫做托管更合适一些。

对批处理和流处理任务，对调度的要求更多是稳定性。

批处理任务，关注任务能否按时启动。因此调度系统核心指标之一是调度延迟。
流处理任务，关注任务失败恢复的时间。因此流式任务里，会格外看重 HA , Restart-Strategy 的设计。
我最开始做实时计算，看着批处理任务的调度很简单：任务每天都会调度，调度上的修改隔天就能生效。流任务则复杂的多，因为流任务都是 long-running 的，调度上的修改需要重启任务。时间一长，较早的任务和新运行的任务，底层配置和依赖已经变化很大了。
后来做离线任务调度，才发现正因为每天都调度，数据库的压力、版本变化带来的影响更大。同时如何确保任务不因调度系统重启而重启，是一个非常大的挑战。
也算是一个特别的体验了。

同时调度也都会影响到正确性，这点容易被忽略:

对批处理，影响准确性的点是变量替换。例如用户在任务里指定 $[yyyyMMddHH-1/24]，实际执行的 SQL 应该替换为正确的时间。
对流处理，影响准确性的点是 checkpoint. 从错误的 checkpoint 启动，可能会造成丟数。

两者的异同：

相似点，比如任务状态和日志的管理、任务的提交流程、失败都需要重试、报警、语法检测等。
不同点，比如定时执行还是 long running，监控(qps、延迟、cpu、mem、gc)，任务资源自动调优等，同时 Flink 任务的管理，在 checkpoint 上也会做的比较重。

功能上，开发平台都会需要考虑草稿箱、权限、任务组、集群注册、任务发布和回滚的流程、审批审计、运行日志等。

对公司来讲，一套平台可以提供统一的使用习惯、权限、报警等，因此无论是实时还是离线，都应该通过一套平台开发和管理任务。而底层调度系统，对于一次性(批)还是 long running(流)的任务，统一管理还有些难度，但是我觉得也应该尽量统一。这点资源调度上，K8S 做的非常靠前，提供了 Job、Crontab、Deployment 等多种资源对象，值得学习。

《Kubernetes修炼手册》读书笔记

2024-03-10T01:05:45+00:00

最近要落地 Native Flink On Kubernetes，但是公司的容器团队支持力度很小。因此开始看 Kubernetes 相关书籍，这本书是从春节前看的，早上、周末、陆陆续续持续了一个月左右，收获很大。适合入门，非常推荐。

1. Kubernetes系统

Kubernetes 集群由主节点和工作节点组成：

主节点(master): 即控制平面，包含了
1. API Server: 组件通信，通过HTTPS的方式提供了RESTful风格的API接口
2. 集群存储: etcd
3. controller管理器: controller 管理器是 controller 的管理者，例如工作节点controller、终端controller，以及副本controller，保证集群的当前状态（current state）可以与期望状态（desired state）相匹配
4. 调度器: 通过监听API Server来启动新的工作任务，分配到适合的且处于正常运行状态的节点中
5. 云 contoller 管理器: 云controller管理器负责集成底层的公有云服务，例如实例、负载均衡以及存储等
工作节点(node): 包含了
1. Kubelet: Kubelet负责将当前工作节点注册到集群当中，集群的资源池就会获取到当前工作节点的CPU、内存以及存储信息，并将工作节点加入当前资源池
2. CRI: Kubelet需要一个容器运行时（container runtime）来执行依赖容器才能执行的任务，例如拉取镜像并启动或停止容器
3. kube proxy: kube-proxy保证了每个工作节点都可以获取到唯一的IP地址，并且实现了本地IPTABLE以及IPVS来保障Pod间的网络路由与负载均衡
DNS: 每个Kubernetes集群都有自己内部的DNS服务，确保集群内的通信。

2. Pod/Deployment/Service

VM 是 VMware 调度的原子单位，容器是 Docker 调度的原子单位，Pod 则是 Kubernetes 调度的原子单位。简单的使用方式，就是一个 Pod 里只运行一个容器；如果有共享 IPC 命名空间、内存、磁盘、网络等的场景，也可以在一个 Pod 里运行多个容器。

Pod 没有自愈能力，不能扩缩容，也不支持方便的升级和回滚。而 Deployment 可以。大多数时间，用户通过更上层的控制器来完成 Pod 部署。上层的控制器包括 Deployment、DaemonSet 以及 StatefulSet。
Pod 是最基本的用来部署微服务应用的单元，而 Deployment 增加了诸如扩缩容、自愈和滚动升级等特性。但我们不能仅仅依靠各 Pod 的 IP 来访问它们。Service 对象能够为一组动态的 Pod 提供稳定可靠的网络管理能力。
Service 依靠 Label 匹配 Pod，只要 Pod 拥有这些 Label 即可。Pod 的 Label 不要求完全一样，即使 Pod 有额外的 Label，不受影响。Service 有 ClusterIP、NodePort、LoadBalancer 三种方式，LB 即云厂商的负载均衡服务，使用要注意。

举个 Native Flink On Kubernetes 的例子：

JobManager 是通过 Deployment 形式部署，定义了 replicas=1，即需要有 1 个 pod. Deployment、Pod 上都定义了 type=flink-native-kubernetes component=jobmanager app=Deployment名字。
TaskManager 是通过 Pod 形式部署，也定义了 type、app 等 label，以及 component=taskmanager

RestService: JobManager 会同时创建 Service，用于对外提供 RestEndpoint 服务；如果是非 HA 模式，还会定义 clusterIP=None 的 headless service，用于 Flink 集群通信。service 的 spec.selector 字段定义了用于筛选 Pod 的 lable 集合。例如：

% kubectl describe svc my-first-application-cluster-rest
Name:                     my-first-application-cluster-rest
Namespace:                default
Labels:                   app=my-first-application-cluster
                       type=flink-native-kubernetes
Annotations:              <none>
Selector:                 app=my-first-application-cluster,component=jobmanager,type=flink-native-kubernetes
Type:                     NodePort
IP Family Policy:         SingleStack
IP Families:              IPv4
IP:                       x
IPs:                      x
Port:                     rest  8081/TCP
TargetPort:               8081/TCP
NodePort:                 rest  31492/TCP
Endpoints:                x:8081
Session Affinity:         None
External Traffic Policy:  Cluster
Events:
  Type    Reason           Age   From                Message
  ----    ------           ----  ----                -------
  Normal  EnsuringService  13m   service-controller  Deleted Loadbalancer

3. 服务发现

每个 Service 对象维护了一个 Endpoint 对象，Endpoint 内部维护了会变化的 Pod IP，这个列表是通过 selector 筛选的。

每一个节点上都运行着一个 kube-proxy，它能够为新的 Service 和 Endpoint 创建 IPVS 规则，从而到达 Service 的 ClusterIP 的流量会被转发至匹配 Label 筛选器的某一个 Pod 上：

Kubernetes 将集群 DNS 作为服务注册中心使用，几个重要的组件:

Pod：由coredns Deployment管理
Service：一个名为kube-dns的ClusterIP Service，其监听端口为TCP/UDP53
Endpoint：也叫kube-dns。所有与集群DNS相关的对象都有K8s-app=kube-dns的Label
这一点在筛选kubectl输出的时候很有用。

FQDN 里包含：$object-name.$namespace.svc.cluster.local，例如 ent.prod.svc.cluster.local

举个例子：

curl env:8080 ，访问的是 ns=dev 的 env 服务
curl env.prod.svc.cluster.local，访问的是 ns=prod 的 env 服务

这是 dns 解析到了不同 ip.同样的，容器外部访问不了该名字，但是容器内部机器可以：

% curl my-first-application-cluster-rest:8081
curl: (6) Could not resolve host: my-first-application-cluster-rest; Unknown error

% kubectl exec -it -n default my-first-application-cluster-taskmanager-1-1 -- /bin/bash
root@my-first-application-cluster-taskmanager-1-1:/opt/flink# curl my-first-application-cluster-rest:8081

dns 修改了 pod 的 /etc/resolve.conf，比如同一个 flink 任务的 jobmanager、taskmanager：

root@my-first-application-cluster-6647497579-79dr5:/opt/flink# cat /etc/resolv.conf
search default.svc.cluster.local svc.cluster.local cluster.local
nameserver x.y.z
options ndots:5

root@my-first-application-cluster-taskmanager-1-1:/opt/flink# cat /etc/resolv.conf
search default.svc.cluster.local svc.cluster.local cluster.local
nameserver x.y.z
options ndots:5

这个 x.y.z 正是 kube-dns service 的 CLUSTER-IP

4. 存储

kind=PersistentVolumen kind=PersistentVolumeClaim
挂载外部存储的方式，预计当前阶段用的不多，没有细看。后续实践里，打算用于挂载用户 Flink 任务未打包到镜像里的文件。

5. ConfigMap

应用和配置解耦，其优点如下:

可重用的应用镜像
更容易测试
更简单、更少的破坏性改动

ConfigMap 包含了多个 key/value 格式的数据。具体创建和使用的流程

通过卷导入ConfigMap：

创建 ConfigMap: 名为 multimap，具体略
创建基于 ConfigMap multimap 的名为 volmap 的卷

spec:
  volumes:
    - name: volmap
      configMap:
        name:multimap

3.将 volmap 挂载到 /etc/name

spec:
  containers:
    - name: ctr
      image: nginx
      volumeMounts:
        - name: volmap
          mountPath: /etc/name

这样的效果，就是 /etc/name 目录下有了文件，文件名是 multimap 的 key，文件内容是对应的 value

再看看实际场景里，flink jobmanager 是如何使用 ConfigMap 的：

spec:
  containers:
  - args:
    - native-k8s
    - ...
    volumeMounts:
    - mountPath: /opt/flink/conf
      name: flink-config-volume
  volumes:
  - configMap:
      defaultMode: 420
      items:
      - key: logback-console.xml
        path: logback-console.xml
      - key: log4j-console.properties
        path: log4j-console.properties
      - key: flink-conf.yaml
        path: flink-conf.yaml
      name: flink-config-zlink-202402201833
    name: flink-config-volume

定义了 volume = flink-config-volume, 挂载点=/opt/flink/conf
flink-config-volume 这个volume，引用了 configmap=flink-config-zlink-202402201833 里的 3 个 entry
效果上：/opt/flink/conf/$key 这个文件内容为 $value，key/value 即来自于 2 里的 entry 定义；当 configmap 的值更新了，文件内容也会异步更新。
挂载点是只读的，通过更新 ConfigMap 对象来更新文件。

6. 集群运维

集群运维是最为复杂和考验熟悉程度的：

kubectl get/describe/apply/delete 不同资源的命令都是类似的，需要熟练。
pod 里支持的命令很少，可以启动一个用于排查问题的Pod，其中需安装常用的的网络工具（ping、traceroute、curl、dig、nslookup等）。例如常见的测试 DNS解析的方法是使用 nslookup 来解析用于代理 API Server 的 kubenetes.default Service，测试请求将返回一个 IP 地址和名称 kubernetes.default.svc.cluster.local。