随着我不断阅读有关Spark架构和调度的在线资源,我开始变得更加困惑。一种资源说:The number of tasks in a stage is the same as the number of partitions in the last RDD in the stage。另一方面:Spark maps the number tasks on a particular Executor to the number of cores allocated to it。因此,第一个资源说如果我有1000个分区,那么无论我的机器是什么,我都会有1000个任务。在第二种情况下,如果我有4台核心计算机和1000个分区,那又如何?我将有4个任务?那么如何处理数据?
另一个困惑:each worker can process one task at a time和Executors can run multiple tasks over its lifetime,both in parallel and sequentially。那么任务是顺序的还是并行的?