root
|-- address: string (nullable = true)
|-- attributes: map (nullable = true)
| |-- key: string
| |-- value: string (valueContainsnull = true)
|-- business_id: string (nullable = true)
|-- categories: string (nullable = true)
|-- city: string (nullable = true)
|-- hours: map (nullable = true)
| |-- key: string
| |-- value: string (valueContainsnull = true)
|-- is_open: long (nullable = true)
|-- latitude: double (nullable = true)
|-- longitude: double (nullable = true)
|-- name: string (nullable = true)
|-- postal_code: string (nullable = true)
|-- review_count: long (nullable = true)
|-- stars: double (nullable = true)
|-- state: string (nullable = true)
我目前正在使用Yelp的数据集,我的目标是找到一家公司每天/每周营业的总时数。从数据中,我可以提取特定日期每一天的时间范围,这些时间范围看起来像[9:0,0:0]。如何使用pyspark获得两列,一列用于开放时间显示[9:0],另一列用于关闭时间显示[0:0]?
这是我用来在数据集中简单显示营业时间的一些代码。
import pyspark.sql.functions as f
from pyspark.sql.functions import expr
df_hours = df_MappedBusiness.select(
"business_id","name",f.explode("hours").alias("hourDay","hourValue"),f.split("hourValue","[-]").alias("split_hours")
).show(50,truncate=False)
Expected Output
---------------
+----------------------------------------------------------------
|hourDay |hourValue |split_hours | open_hours | close_hours
+-----------------------------------------------------------------
|Monday |9:0-0:0 |[9:0,0:0] | [9,0] | [0,0] |