删除pyspark数据框中value为字符串的行

2024-05-04 • 问答

我正在尝试使用Apache Spark对存储在MongoDB数据库中的地理空间数据使用KMeans。数据具有以下格式，

DataFrame[decimalLatitude: double,decimalLongitude: double,features: vector]

代码如下，其中inputdf是数据帧。

vecAssembler = VectorAssembler(
                inputCols=["decimalLatitude","decimalLongitude"],outputCol="features")
inputdf = vecAssembler.transform(inputdf)
kmeans = KMeans(k = 10,seed = 123)
model = kmeans.fit(inputdf.select("features"))

随着我得到以下错误，数据集中似乎有一些空字符串

com.mongodb.spark.exceptions.MongoTypeConversionException: Cannot cast STRING into a Integertype (value: BsonString{value=''})

我尝试使用来查找此类行，

issuedf = inputdf.where(inputdf.decimalLatitude == '')
issuedf.show()

但是我得到与上面相同的类型转换错误。我也尝试了df.replace，但是遇到了同样的错误。如何删除存在该值的所有行？

jsj07123 回答：删除pyspark数据框中value为字符串的行

可以通过以下方式提供数据类型来解决此问题，

inputdf = my_spark.read.format("mongo").load(schema=StructType(
    [StructField("decimalLatitude",DoubleType(),True),StructField("decimalLongitude",True)]))

这可确保所有值均为DoubleType。现在可以使用inputdf.dropna()

删除空值

apache-spark pyspark pyspark-dataframes

本文链接：https://www.f2er.com/3108958.html

删除pyspark数据框中value为字符串的行

jsj07123 回答：删除pyspark数据框中value为字符串的行

大家都在问