PySpark 作业在调用 o803.showString 时中止

我正在使用 aws 胶运行 pyspark 脚本，我的程序在调用 .show() 函数时出错。该程序在过去 3 个月内一直顺利运行，没有任何问题。另外，代码没有改变。

这是日志文件中的错误：

111111111111111111111111
df5 Application final count = 359363
222222222222222222222222222222
    ******** Error -- An error occurred while calling o803.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 76.0 failed 4 times,most recent failure: Lost task 0.3 in stage 76.0 (TID 2065,10.166.17.229,executor 2): com.microsoft.sqlserver.jdbc.SQLServerException: Subquery returned more than 1 value. This is not permitted when the subquery follows =,!=,<,<=,>,>= or when the subquery is used as an expression.
    at com.microsoft.sqlserver.jdbc.SQLServerException.makeFromDatabaseError(SQLServerException.java:232)
    at com.microsoft.sqlserver.jdbc.SQLServerResultSet$FetchBuffer.nextRow(SQLServerResultSet.java:6254)
    at com.microsoft.sqlserver.jdbc.SQLServerResultSet.fetchBufferNext(SQLServerResultSet.java:1799)
    at com.microsoft.sqlserver.jdbc.SQLServerResultSet.next(SQLServerResultSet.java:1060)
    at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anon$1.getNext(JdbcUtils.scala:352)
    at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anon$1.getNext(JdbcUtils.scala:338)
    at org.apache.spark.util.NextIterator.hasnext(NextIterator.scala:73)
    at org.apache.spark.InterruptibleIterator.hasnext(InterruptibleIterator.scala:37)
    at org.apache.spark.util.CompletionIterator.hasnext(CompletionIterator.scala:31)
    at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage19.processnext(Unknown Source)
    at org.apache.spark.sql.execution.BufferedRowIterator.hasnext(BufferedRowIterator.java:43)
    at org.apache.spark.sql.execution.WholeStagecodegenExec$$anonfun$13$$anon$1.hasnext(WholeStagecodegenExec.scala:636)
    at scala.collection.Iterator$$anon$11.hasnext(Iterator.scala:409)
    at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:148)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
    at org.apache.spark.scheduler.Task.run(Task.scala:121)
    at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)

我可能是错的，但 SQL 异常可能会产生误导。由于我正在打印像 df.show(3) 这样的数据框的前 3 行，它怎么可能是返回超过 1 个值的子查询。它是一个数据框，我应该可以毫无问题地打印出前 3 行。

这是我的代码如下：

df5 = df5.join(df_app,trim(df5.LOG_NO) == trim(df_app.LogNumber),"left")\
         .select (df5["*"],df_app["applicationid"],df_app["CreatedDate"])
         
df5 = df5.withColumnRenamed("applicationid","applicationid_1")\
         .withColumnRenamed("CreatedDate","CreatedDate_1")

df5 = df5.join(df_app,substring(trim(df5.LOG_NO),1,8) == trim(df_app.LogNumber),"applicationid_2")\
         .withColumnRenamed("CreatedDate","CreatedDate_2")

df5 = df5.withColumn("PRMIS_APPLICATION_ID",coalesce(df5["applicationid_1"],df5["applicationid_2"]))\
         .withColumn("FILE_PRMIS_CREATION_DATE",coalesce(df5["CreatedDate_1"],df5["CreatedDate_2"]))

df5 = df5.drop("applicationid_1","applicationid_2","CreatedDate_1","CreatedDate_2")
print("111111111111111111111111")
print("df5 Application final count = " + str(df5.count()))
print("222222222222222222222222222222")
df5.show(3)
print("33333333333333333333333333333")

通过查看我的代码，您可以注意到代码打印出 1s、计数和 2s，但它在 df5.show(3) 上失败。任何帮助将不胜感激，因为我一整天都在尝试调试问题。提前致谢。

PySpark 作业在调用 o803.showString 时中止

kathy86 回答：PySpark 作业在调用 o803.showString 时中止

大家都在问