在将RDD写入文件之前执行以下转换之间有什么区别?
> coalesce(1,shuffle = true)
> coalesce(1,shuffle = false)
代码示例:
val input = sc.textFile(inputFile) val filtered = input.filter(doSomeFiltering) val mapped = filtered.map(doSomeMapping) mapped.coalesce(1,shuffle = true).saveAsTextFile(outputFile) vs mapped.coalesce(1,shuffle = false).saveAsTextFile(outputFile)
它与collect()相比如何?我完全知道Spark保存方法会将它存储为HDFS风格的结构,但是我对collect()和shuffled / non-shuffled coalesce()的数据分区方面更感兴趣.