pyspark sql查询等效功能

我刚刚开始潜入Pyspark。

有一个数据集,其中包含一些值,我将在下面演示这些值,以询问无法创建的查询。

pyspark sql查询等效功能

这是包含大约2万行的实际数据集的样本。我正在pyspark shell中读取此CSV文件作为数据框。尝试对此数据进行一些基本的SQL查询以进行操作。以下是我无法执行的此类查询之一:

1. Which country has the least number of Government Type (4th Column).

我手动创建了一些其他查询,这些查询可以在SQL中完成,但我只是想了解一下。如果我对此有所了解,则可以解决其他问题。

这是我经过多次错误检查后才能创建的唯一一行:

df.filter(df.Government=='Democratic').select('Country').show()

我不确定如何处理此问题陈述。有什么想法吗?

jun576 回答:pyspark sql查询等效功能

这是您可以怎么做

Demography = Row("City","Country","Population","Government")

demo1 = Demography("a","AD",1.2,"Democratic")
demo2 = Demography("b","Democratic")
demo3 = Demography("c","Democratic")
demo4 = Demography("m","XX","Democratic")
demo5 = Demography("n","Democratic")
demo6 = Demography("o","Democratic")
demo7 = Demography("q","Democratic")

demographic_data = [demo1,demo2,demo3,demo4,demo5,demo6,demo7]

demographic_data_df = spark.createDataFrame(demographic_data)
demographic_data_df.show(10)

+----+-------+----------+----------+
|City|Country|Population|Government|
+----+-------+----------+----------+
|   a|     AD|       1.2|Democratic|
|   b|     AD|       1.2|Democratic|
|   c|     AD|       1.2|Democratic|
|   m|     XX|       1.2|Democratic|
|   n|     XX|       1.2|Democratic|
|   o|     XX|       1.2|Democratic|
|   q|     XX|       1.2|Democratic|
+----+-------+----------+----------+

new = demographic_data_df.groupBy('Country').count().select('Country',f.col('count').alias('n'))

max = new.agg(f.max('n').alias('n'))

new.join(max,on = "n",how = "inner").show()

+---+-------+                                                                   
|  n|Country|
+---+-------+
|  4|     XX|
+---+-------+

另一个选择是将数据框注册为临时表并运行常规SQL查询。要将其注册为临时表,您可以执行以下

demographic_data_df.registerTempTable("demographic_data_table")

希望有帮助

本文链接:https://www.f2er.com/3124011.html

大家都在问