我从一个3 GB的json.gz文件中加载了一个PySpark数据帧,具有以下架构:
root
|-- _id: long (nullable = false)
|-- quote: string (nullable = true)
|-- occurrences: array (nullable = true)
| |-- element: struct (containsnull = true)
| | |-- articleID: string (nullable = true)
| | |-- title: string (nullable = true)
| | |-- date: string (nullable = true)
| | |-- author: string (nullable = true)
| | |-- source: string (nullable = true)
我需要删除标题,作者和日期字段,或创建一个不包含这些字段的新dataFrame
。
到目前为止,我已经设法获得以下架构:
root
|-- _id: long (nullable = false)
|-- quote: string (nullable = true)
|-- occurrences: array (nullable = false)
| |-- element: struct (containsnull = false)
| | |-- articleID: array (nullable = true)
| | | |-- element: string (containsnull = true)
| | |-- source: array (nullable = true)
| | | |-- element: string (containsnull = true)
使用
df.select(df._id,df.quote,array(
struct(
col("occurrences.articleID"),col("occurrences.source")
)
).alias("occurrences"))
但是我需要一种方法来将商品ID和来源保持在同一struct
中。我该怎么办?