Apache Beam KafkaIO BatchMode OOM问题

2024-05-20 • 问答

我有一个用例，我想使用Kafka和Sparkrunner以批处理模式从Apache Beam读取数据。

使用withMaxNumRecords(long)类的KafkaIO方法，可以从BoundedReader制作UnboundedReader。但是我发现，在批处理模式下，首先要从每个分区读取数据，然后将数据放入内存中，然后传递给下一个操作（映射，过滤器等）。

每个分区中都有大量数据，以批处理模式读取此数据时，出现OOM错误。我试图增加执行程序的内存。但是对于每次运行，我都无法使用所需的值配置此参数。另一件事是，我能够以流模式读取相同的数据。

我认为这种情况正在发生，因为在批处理模式下，每个分区中的所有记录都分配给GlobalWindow（ProcessContext的一部分），仅读取所有数据才触发该记录。这可能是由于OOM问题引起的。

如果这是原因，那么如何在ProcessContext中将GlobalWindow更改为PartitioningWindow？

如果这不是原因，那么如何在不增加每次执行的执行器内存的情况下，使用Apache Beam以批量模式从Kafka读取大量数据？

You can use windowing with fixed-size data sets in bounded PCollections. However,note that windowing considers only the implicit timestamps attached to each element of a PCollection,and data sources that create fixed data sets (such as TextIO) assign the same timestamp to every element. This means that all the elements are by default part of a single,global window. To use windowing with fixed data sets,you can assign your own timestamps to each element. To assign timestamps to elements,use a ParDo transform with a DoFn that outputs each element with a new timestamp (for example,the WithTimestamps transform in the Beam SDK for Java).

The bounded (or unbounded) nature of your PCollection affects how Beam processes your data. A bounded PCollection can be processed using a batch job,which might read the entire data set once,and perform processing in a job of finite length. An unbounded PCollection must be processed using a streaming job that runs continuously,as the entire collection can never be available for processing at any one time.

Apache Beam KafkaIO BatchMode OOM问题

lz0738 回答：Apache Beam KafkaIO BatchMode OOM问题

大家都在问