如何使用Google Dataflow Python按数据字段中的字段执行分区gsink（镶木地板）

2024-05-19 • 问答

我正在尝试从GS存储桶中读取数据，并将其输出到另一个GS存储桶（按自定义列，arrival_date划分）到另一个存储桶中。数据为实木复合地板格式。寻找基于Apache Beam Python的源代码以在数据流中运行。

源数据具有三列：组织，名称，到达时间预期输出：组织，名称，到达时间到组织明智的文件夹中。

# Instantiate a pipeline with all the pipeline options
p = beam.Pipeline(options=options)


#Processing and structure of pipeline 
p \
| 'Input: QueryTable' >> beam.io.Read(beam.io.BigQuerySource(
    query=known_args.bql,use_standard_sql=True)) \
| 'Output: Export to Parquet' >> beam.io.parquetio.WriteToParquet(
        file_path_prefix=known_args.output,schema=parquet_schema,file_name_suffix='.parquet'
    )

我正在寻找有关如何实现对按自定义列（在这种情况下为org）分区的GS文件夹进行写入的答案

gengwenshuang 回答：如何使用Google Dataflow Python按数据字段中的字段执行分区gsink（镶木地板）

暂时没有好的解决方案，如果你有好的解决方案，请发邮件至：iooj@foxmail.com

apache-beam dataflow google-cloud-dataflow google-cloud-storage

本文链接：https://www.f2er.com/3167492.html

如何使用Google Dataflow Python按数据字段中的字段执行分区gsink（镶木地板）

gengwenshuang 回答：如何使用Google Dataflow Python按数据字段中的字段执行分区gsink（镶木地板）

大家都在问