PySpark 作业在调用 o803.showString 时中止

我正在使用 aws 胶运行 pyspark 脚本,我的程序在调用 .show() 函数时出错。该程序在过去 3 个月内一直顺利运行,没有任何问题。另外,代码没有改变。

这是日志文件中的错误:

111111111111111111111111
df5 Application final count = 359363
222222222222222222222222222222
    ******** Error -- An error occurred while calling o803.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 76.0 failed 4 times,most recent failure: Lost task 0.3 in stage 76.0 (TID 2065,10.166.17.229,executor 2): com.microsoft.sqlserver.jdbc.SQLServerException: Subquery returned more than 1 value. This is not permitted when the subquery follows =,!=,<,<=,>,>= or when the subquery is used as an expression.
    at com.microsoft.sqlserver.jdbc.SQLServerException.makeFromDatabaseError(SQLServerException.java:232)
    at com.microsoft.sqlserver.jdbc.SQLServerResultSet$FetchBuffer.nextRow(SQLServerResultSet.java:6254)
    at com.microsoft.sqlserver.jdbc.SQLServerResultSet.fetchBufferNext(SQLServerResultSet.java:1799)
    at com.microsoft.sqlserver.jdbc.SQLServerResultSet.next(SQLServerResultSet.java:1060)
    at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anon$1.getNext(JdbcUtils.scala:352)
    at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anon$1.getNext(JdbcUtils.scala:338)
    at org.apache.spark.util.NextIterator.hasnext(NextIterator.scala:73)
    at org.apache.spark.InterruptibleIterator.hasnext(InterruptibleIterator.scala:37)
    at org.apache.spark.util.CompletionIterator.hasnext(CompletionIterator.scala:31)
    at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage19.processnext(Unknown Source)
    at org.apache.spark.sql.execution.BufferedRowIterator.hasnext(BufferedRowIterator.java:43)
    at org.apache.spark.sql.execution.WholeStagecodegenExec$$anonfun$13$$anon$1.hasnext(WholeStagecodegenExec.scala:636)
    at scala.collection.Iterator$$anon$11.hasnext(Iterator.scala:409)
    at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:148)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
    at org.apache.spark.scheduler.Task.run(Task.scala:121)
    at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)

我可能是错的,但 SQL 异常可能会产生误导。由于我正在打印像 df.show(3) 这样的数据框的前 3 行,它怎么可能是返回超过 1 个值的子查询。它是一个数据框,我应该可以毫无问题地打印出前 3 行。

这是我的代码如下:

df5 = df5.join(df_app,trim(df5.LOG_NO) == trim(df_app.LogNumber),"left")\
         .select (df5["*"],df_app["applicationid"],df_app["CreatedDate"])
         
df5 = df5.withColumnRenamed("applicationid","applicationid_1")\
         .withColumnRenamed("CreatedDate","CreatedDate_1")

df5 = df5.join(df_app,substring(trim(df5.LOG_NO),1,8) == trim(df_app.LogNumber),"applicationid_2")\
         .withColumnRenamed("CreatedDate","CreatedDate_2")

df5 = df5.withColumn("PRMIS_APPLICATION_ID",coalesce(df5["applicationid_1"],df5["applicationid_2"]))\
         .withColumn("FILE_PRMIS_CREATION_DATE",coalesce(df5["CreatedDate_1"],df5["CreatedDate_2"]))

df5 = df5.drop("applicationid_1","applicationid_2","CreatedDate_1","CreatedDate_2")
print("111111111111111111111111")
print("df5 Application final count = " + str(df5.count()))
print("222222222222222222222222222222")
df5.show(3)
print("33333333333333333333333333333")

通过查看我的代码,您可以注意到代码打印出 1s、计数和 2s,但它在 df5.show(3) 上失败。任何帮助将不胜感激,因为我一整天都在尝试调试问题。提前致谢。

kathy86 回答:PySpark 作业在调用 o803.showString 时中止

暂时没有好的解决方案,如果你有好的解决方案,请发邮件至:iooj@foxmail.com
本文链接:https://www.f2er.com/6615.html

大家都在问