无法在pyspark中导入parse_url 编辑:

我有这个sql查询,用于pyspark中的hiveql:

spark.sql('SELECT split(parse_url(page.viewed_page,"PATH"),"/")[1] as path FROM df')

我想翻译成如下功能查询:

df.select(split(parse_url(col('page.viewed_page'),'HOST')))

但是当我导入parse_url函数时,我得到:

----> 1 from pyspark.sql.functions import split,parse_url

ImportError: cannot import name 'parse_url' from 'pyspark.sql.functions' (/usr/local/opt/apache-spark/libexec/python/pyspark/sql/functions.py)

您能指出我正确的方向来导入parse_url函数吗?

欢呼

woaizhaokai2009 回答:无法在pyspark中导入parse_url 编辑:

parse_url是Hive UDF,因此您需要在创建SparkSession object()时启用Hive支持。

spark = SparkSession.builder.enableHiveSupport().getOrCreate()

然后您的以下查询应该起作用:

spark.sql('SELECT split(parse_url(page.viewed_page,"PATH"),"/")[1] as path FROM df')

如果您的Spark是<2.2

from pyspark import SparkContext
from pyspark.sql import HiveContext,SQLContext

sc = SparkContext()

sqlContext = SQLContext(sc)
hivContext = HiveContext(sc)

query = 'SELECT split(parse_url(page.viewed_page,"/")[1] as path FROM df'

hivContext.sql(query) # this will work
sqlContext.sql(query) # this will not work

编辑:

parse_url是Spark v2.3内置的SparkSQL。截至{7/2019},pyspark.sql.functions中尚不可用。

本文链接:https://www.f2er.com/3145878.html

大家都在问