当集群上的人口规模增加时,Spark应用程序将卡住

我创建了一个Scala Spark应用程序,并在远程集群上运行实验。我的集群有2个具有8个核心的节点。

在我的应用程序中,我必须创建大小为N和D的总体。如果我为10,000次迭代设置N = 700和D = 700,则实验成功进行。但是,当我创建总体大小并将其设置为N = 1000和D = 1000时,Cluster陷入了困境。

INFO BlockManagerInfo: Added broadcast_3_piece0 in memory on XXX.XX.XX.XXX:XXXX (size: 2.1 KB,free: 343.2 MB)

总体执行输出在这里

    Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
19/11/23 12:53:41 INFO SparkContext: Running Spark version 2.2.0
19/11/23 12:53:41 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
19/11/23 12:53:41 WARN Utils: Your hostname,localhost resolves to a loopback address: SOME_IP; using SOME_IP instead (on interface eno3)
19/11/23 12:53:41 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
19/11/23 12:53:41 INFO SparkContext: Submitted application:  APPLICATION_NAME
19/11/23 12:53:41 INFO SecurityManager: Changing view acls to: USER
19/11/23 12:53:41 INFO SecurityManager: Changing modify acls to: USER
19/11/23 12:53:41 INFO SecurityManager: Changing view acls groups to: 
19/11/23 12:53:41 INFO SecurityManager: Changing modify acls groups to: 
19/11/23 12:53:41 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(USER); groups with view permissions: Set(); users  with       modify permissions: Set(username); groups with modify permissions: Set()
19/11/23 12:53:41 INFO Utils: Successfully started service 'sparkDriver' on port XXXX.
19/11/23 12:53:41 INFO SparkEnv: Registering MapOutputTracker
19/11/23 12:53:41 INFO SparkEnv: Registering BlockManagerMaster
19/11/23 12:53:41 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
19/11/23 12:53:41 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
19/11/23 12:53:41 INFO DiskBlockManager: Created local directory at /tmp/XXXXXXXXXXXXXXXXX
19/11/23 12:53:41 INFO MemoryStore: MemoryStore started with capacity 366.3 MB
19/11/23 12:53:41 INFO SparkEnv: Registering OutputCommitCoordinator
19/11/23 12:53:42 INFO Utils: Successfully started service 'SparkUI' on port XXXX.
19/11/23 12:53:42 INFO SparkUI: Bound SparkUI to 0.0.0.0,and started at http://XXXXXXXX
19/11/23 12:53:42 INFO SparkContext: Added JAR file:/home/PATH/FILENAME.jar at spark://MASTER_IP:PORT/jars/APPLICATION.jar with timestamp 1574495622195
19/11/23 12:53:42 INFO StandaloneAppClient$ClientEndpoint: Connecting to master spark://IP ADDRESS
19/11/23 12:53:42 INFO TransportClientFactory: Successfully created connection to MASTER  after 24 ms (0 ms spent in bootstraps)
19/11/23 12:53:42 INFO StandaloneschedulerBackend: Connected to Spark cluster with app ID 
19/11/23 12:53:42 INFO StandaloneAppClient$ClientEndpoint: Executor added:  with 4 cores
19/11/23 12:53:42 INFO StandaloneschedulerBackend: Granted executor ID  with 4 cores,1024.0 MB RAM
19/11/23 12:53:42 INFO StandaloneAppClient$ClientEndpoint: Executor added: with 4 cores
19/11/23 12:53:42 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port XXXX.
19/11/23 12:53:42 INFO StandaloneschedulerBackend: Granted executor ID on hostPort PORT with 4 cores,1024.0 MB RAM
19/11/23 12:53:42 INFO NettyBlockTransferService: Server created on MASTER
19/11/23 12:53:42 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
19/11/23 12:53:42 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver,MASTER,PORT,None)
19/11/23 12:53:42 INFO BlockManagerMasterEndpoint: Registering block manager MATER:PORT with 366.3 MB RAM,BlockManagerId(driver,None)
19/11/23 12:53:42 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver,None)
19/11/23 12:53:42 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver,None)
19/11/23 12:53:42 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-XXXX/1 is now RUNNING
19/11/23 12:53:42 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-XXXX/0 is now RUNNING
19/11/23 12:53:42 INFO StandaloneschedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.0

19/11/23 12:53:43 INFO SparkContext: Starting job: collect at CODE_FILE_NAME.scala:001
19/11/23 12:53:43 INFO DAGScheduler: Got job 0 (collect at CODE_FILE_NAME.scala:001) with 3 output partitions
19/11/23 12:53:43 INFO DAGScheduler: Final stage: ResultStage 0 (collect at CODE_FILE_NAME.scala:001)
19/11/23 12:53:43 INFO DAGScheduler: Parents of final stage: List()
19/11/23 12:53:43 INFO DAGScheduler: Missing parents: List()
19/11/23 12:53:43 INFO DAGScheduler: Submitting ResultStage 0 (ParallelCollectionRDD[0] at parallelize at CODE_FILE_NAME.scala:0010),which has no missing parents
19/11/23 12:53:43 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 1560.0 B,free 366.3 MB)
19/11/23 12:53:43 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 1065.0 B,free 366.3 MB)
19/11/23 12:53:43 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on MASTER:IP (size: 1065.0 B,free: 366.3 MB)
19/11/23 12:53:43 INFO SparkContext: Created broadcast 0 from broadcast at DAGScheduler.scala:1006
19/11/23 12:53:43 INFO DAGScheduler: Submitting 3 missing tasks from ResultStage 0 (ParallelCollectionRDD[0] at parallelize at bat1.scala:42) (first 15 tasks are for partitions Vector(0,1,2))
19/11/23 12:53:43 INFO TaskSchedulerImpl: Adding task set 0.0 with 3 tasks
19/11/23 12:53:43 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (MASTER:PORT) with ID 1
19/11/23 12:53:43 INFO BlockManagerMasterEndpoint: Registering block manager MASTER:PORT with 366.3 MB RAM,BlockManagerId(1,None)
19/11/23 12:53:43 WARN TaskSetManager: Stage 0 contains a task of very large size (7849 KB). The maximum recommended task size is 100 KB.
19/11/23 12:53:43 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0,executor 1,partition 0,PROCESS_LOCAL,8037809 bytes)
19/11/23 12:53:43 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1,partition 1,8037809 bytes)
19/11/23 12:53:43 INFO TaskSetManager: Starting task 2.0 in stage 0.0 (TID 2,partition 2,8061931 bytes)
19/11/23 12:53:43 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (MASTER:PORT) with ID 0
19/11/23 12:53:43 INFO BlockManagerMasterEndpoint: Registering block manager MASTER:PORT with 366.3 MB RAM,BlockManagerId(0,None)
19/11/23 12:53:46 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on MASTER:PORT (size: 1065.0 B,free: 366.3 MB)
19/11/23 12:53:46 INFO BlockManagerInfo: Added rdd_0_1 in memory on IP ADDRESS:PORT (size: 7.7 MB,free: 358.6 MB)
19/11/23 12:53:46 INFO BlockManagerInfo: Added rdd_0_0 in memory on IP ADDRESS:PORT (size: 7.7 MB,free: 350.9 MB)
19/11/23 12:53:46 INFO BlockManagerInfo: Added rdd_0_2 in memory on IP ADDRESS:PORT (size: 7.7 MB,free: 343.2 MB)
19/11/23 12:53:46 INFO BlockManagerInfo: Added taskresult_2 in memory on IP ADDRESS:PORT (size: 7.7 MB,free: 335.5 MB)
19/11/23 12:53:46 INFO BlockManagerInfo: Added taskresult_0 in memory on IP ADDRESS:PORT (size: 7.7 MB,free: 327.8 MB)
19/11/23 12:53:46 INFO BlockManagerInfo: Added taskresult_1 in memory on IP ADDRESS:PORT (size: 7.7 MB,free: 320.1 MB)
19/11/23 12:53:46 INFO TransportClientFactory: Successfully created connection to /IP ADDRESS:PORT after 1 ms (0 ms spent in bootstraps)
19/11/23 12:53:47 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 3774 ms on 172.18.19.203 (executor 1) (1/3)
19/11/23 12:53:47 INFO BlockManagerInfo: Removed taskresult_0 on IP ADDRESS:PORT in memory (size: 7.7 MB,free: 327.8 MB)
19/11/23 12:53:48 INFO TaskSetManager: Finished task 1.0 in stage 0.0 (TID 1) in 4400 ms on 172.18.19.203 (executor 1) (2/3)
19/11/23 12:53:48 INFO BlockManagerInfo: Removed taskresult_1 on IP ADDRESS:PORT in memory (size: 7.7 MB,free: 335.5 MB)
19/11/23 12:53:48 INFO TaskSetManager: Finished task 2.0 in stage 0.0 (TID 2) in 5076 ms on 172.18.19.203 (executor 1) (3/3)
19/11/23 12:53:48 INFO TaskSchedulerImpl: Removed TaskSet 0.0,whose tasks have all completed,from pool 
19/11/23 12:53:48 INFO DAGScheduler: ResultStage 0 (collect at bat1.scala:43) finished in 5.411 s
19/11/23 12:53:48 INFO BlockManagerInfo: Removed taskresult_2 on IP ADDRESS:PORT in memory (size: 7.7 MB,free: 343.2 MB)
19/11/23 12:53:48 INFO DAGScheduler: Job 0 finished: collect at CODE_FILE_NAME.scala:001,took 5.644525 s

19/11/23 12:53:48 INFO SparkContext: Starting job: collect at Executor.scala:13
19/11/23 12:53:48 INFO DAGScheduler: Got job 1 (collect at Executor.scala:13) with 8 output partitions
19/11/23 12:53:48 INFO DAGScheduler: Final stage: ResultStage 1 (collect at Executor.scala:13)
19/11/23 12:53:48 INFO DAGScheduler: Parents of final stage: List()
19/11/23 12:53:48 INFO DAGScheduler: Missing parents: List()
19/11/23 12:53:48 INFO DAGScheduler: Submitting ResultStage 1 (ParallelCollectionRDD[1] at parallelize at Executor.scala:12),which has no missing parents
19/11/23 12:53:48 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 1352.0 B,free 366.3 MB)
19/11/23 12:53:48 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 895.0 B,free 366.3 MB)
19/11/23 12:53:48 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on IP ADDRESS:PORT (size: 895.0 B,free: 366.3 MB)
19/11/23 12:53:48 INFO SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:1006
19/11/23 12:53:48 INFO DAGScheduler: Submitting 8 missing tasks from ResultStage 1 (ParallelCollectionRDD[1] at parallelize at Executor.scala:12) (first 15 tasks are for partitions Vector (0,2,3,4,5,6,7))
19/11/23 12:53:48 INFO TaskSchedulerImpl: Adding task set 1.0 with 8 tasks
19/11/23 12:53:48 INFO TaskSetManager: Starting task 0.0 in stage 1.0 (TID 3,Ip,executor 0,4827 bytes)
19/11/23 12:53:48 INFO TaskSetManager: Starting task 1.0 in stage 1.0 (TID 4,ip,4827 bytes)
19/11/23 12:53:48 INFO TaskSetManager: Starting task 2.0 in stage 1.0 (TID 5,IP,4827 bytes)
19/11/23 12:53:48 INFO TaskSetManager: Starting task 3.0 in stage 1.0 (TID 6,partition 3,4827 bytes)
19/11/23 12:53:48 INFO TaskSetManager: Starting task 4.0 in stage 1.0 (TID 7,partition 4,4827 bytes)
19/11/23 12:53:48 INFO TaskSetManager: Starting task 5.0 in stage 1.0 (TID 8,partition 5,4827 bytes)
19/11/23 12:53:48 INFO TaskSetManager: Starting task 6.0 in stage 1.0 (TID 9,partition 6,4827 bytes)
19/11/23 12:53:48 INFO TaskSetManager: Starting task 7.0 in stage 1.0 (TID 10,partition 7,29363 bytes)
19/11/23 12:53:48 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on IP:PORT (size: 895.0 B,free: 343.2 MB)
19/11/23 12:53:48 INFO TaskSetManager: Finished task 3.0 in stage 1.0 (TID 6) in 35 ms on IP (executor 1) (1/8)
19/11/23 12:53:48 INFO TaskSetManager: Finished task 7.0 in stage 1.0 (TID 10) in 33 ms on IP (executor 1) (2/8)
19/11/23 12:53:48 INFO TaskSetManager: Finished task 1.0 in stage 1.0 (TID 4) in 35 ms on IP (executor 1) (3/8)
19/11/23 12:53:48 INFO TaskSetManager: Finished task 5.0 in stage 1.0 (TID 8) in 35 ms on IP (executor 1) (4/8)
19/11/23 12:53:49 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on IP:PORT (size: 895.0 B,free: 366.3 MB)
19/11/23 12:53:49 INFO TaskSetManager: Finished task 6.0 in stage 1.0 (TID 9) in 479 ms on IP (executor 0) (5/8)
19/11/23 12:53:49 INFO TaskSetManager: Finished task 2.0 in stage 1.0 (TID 5) in 480 ms on IP (executor 0) (6/8)
19/11/23 12:53:49 INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID 3) in 484 ms on IP (executor 0) (7/8)
19/11/23 12:53:49 INFO TaskSetManager: Finished task 4.0 in stage 1.0 (TID 7) in 481 ms on IP (executor 0) (8/8)
19/11/23 12:53:49 INFO TaskSchedulerImpl: Removed TaskSet 1.0,from pool 
19/11/23 12:53:49 INFO DAGScheduler: ResultStage 1 (collect at Executor.scala:13) finished in 0.486 s
19/11/23 12:53:49 INFO DAGScheduler: Job 1 finished: collect at Executor.scala:13,took 0.495059 s
19/11/23 12:53:49 INFO MemoryStore: Block broadcast_2 stored as values in memory (estimated size 23.7 KB,free 366.3 MB)
19/11/23 12:53:49 INFO MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 24.2 KB,free 366.2 MB)
19/11/23 12:53:49 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on IP:PORT (size: 24.2 KB,free: 366.3 MB)
19/11/23 12:53:49 INFO SparkContext: Created broadcast 2 from broadcast at BroadcastWrapper.scala:13
starting execution of main function
19/11/23 12:53:49 INFO SparkContext: Starting job: collect at Executor.scala:65
19/11/23 12:53:49 INFO DAGScheduler: Got job 2 (collect at Executor.scala:65) with 3 output partitions
19/11/23 12:53:49 INFO DAGScheduler: Final stage: ResultStage 2 (collect at Executor.scala:65)
19/11/23 12:53:49 INFO DAGScheduler: Parents of final stage: List()
19/11/23 12:53:49 INFO DAGScheduler: Missing parents: List()
19/11/23 12:53:49 INFO DAGScheduler: Submitting ResultStage 2 (MapPartitionsRDD[2] at mapPartitionsWithIndex at Executor.scala:17),which has no missing parents
19/11/23 12:53:49 INFO MemoryStore: Block broadcast_3 stored as values in memory (estimated size 3.5 KB,free 366.2 MB)
19/11/23 12:53:49 INFO MemoryStore: Block broadcast_3_piece0 stored as bytes in memory (estimated size 2.1 KB,free 366.2 MB)
19/11/23 12:53:49 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory on IP:PORT (size: 2.1 KB,free: 366.3 MB)
19/11/23 12:53:49 INFO SparkContext: Created broadcast 3 from broadcast at DAGScheduler.scala:1006
19/11/23 12:53:49 INFO DAGScheduler: Submitting 3 missing tasks from ResultStage 2 (MapPartitionsRDD[2] at mapPartitionsWithIndex at Executor.scala:17) (first 15 tasks are for partitions  Vector(0,2))
19/11/23 12:53:49 INFO TaskSchedulerImpl: Adding task set 2.0 with 3 tasks
19/11/23 12:53:49 WARN TaskSetManager: Stage 2 contains a task of very large size (7849 KB). The maximum recommended task size is 100 KB.
19/11/23 12:53:49 INFO TaskSetManager: Starting task 0.0 in stage 2.0 (TID 11,8037809 bytes)
19/11/23 12:53:49 INFO TaskSetManager: Starting task 1.0 in stage 2.0 (TID 12,8037809 bytes)
19/11/23 12:53:49 INFO TaskSetManager: Starting task 2.0 in stage 2.0 (TID 13,8061931 bytes)
19/11/23 12:53:51 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory on IP_ADDRESS:PORT (size: 2.1 KB,free: 343.2 MB)

如何解决此问题?

liyinjie86 回答:当集群上的人口规模增加时,Spark应用程序将卡住

暂时没有好的解决方案,如果你有好的解决方案,请发邮件至:iooj@foxmail.com
本文链接:https://www.f2er.com/3046516.html

大家都在问