将 Pandas 数据帧转换为 PySpark 数据帧会删除索引

2024-05-16 • 问答

我有一个名为 data_clean 的 Pandas 数据框。它看起来像这样：

我想将其转换为 Spark 数据帧，因此我使用了 createDataFrame() 方法： sparkDF = spark.createDataFrame(data_clean)

然而，这似乎从原始数据框中删除了索引列（名称为 ali、anthony、bill 等的列）。

的输出

sparkDF.printSchema()
sparkDF.show()

是

root
 |-- transcript: string (nullable = true)

+--------------------+
|          transcript|
+--------------------+
|ladies and gentle...|
|thank you thank y...|
| all right thank ...|
|                    |
|this is dave he t...|
|                    |
|   ladies and gen...|
|   ladies and gen...|
|armed with boyish...|
|introfade the mus...|
|wow hey thank you...|
|hello hello how y...|
+--------------------+

文档说 createDataFrame() 可以将 pandas.DataFrame 作为输入。我使用的是 Spark 版本“3.0.1”。

与此相关的 SO 上的其他问题没有提到索引列消失的这个问题：

This one about converting Pandas to Pyspark 没有提到索引列消失的这个问题。
同this one
在转换过程中还有 this one relates to data dropping，但更多的是关于窗口函数。

我可能遗漏了一些明显的东西，但是当我从 Pandas 数据帧转换为 PySpark 数据帧时如何保留索引列？

将 Pandas 数据帧转换为 PySpark 数据帧会删除索引

wxt112 回答：将 Pandas 数据帧转换为 PySpark 数据帧会删除索引

大家都在问