如何在pyspark中读取csv文件?

我正在尝试使用pyspark读取csv文件,但显示一些错误。 您能告诉我读取csv文件的正确过程是什么吗?

python代码:

from pyspark.sql import *
df = spark.read.csv("D:\Users\SPate233\Downloads\iMedical\query1.csv",inferSchema = True,header = True)

我也尝试过以下一种方法:

sqlContext = SQLContext
df = sqlContext.load(source="com.databricks.spark.csv",header="true",path = "D:\Users\SPate233\Downloads\iMedical\query1.csv")

错误:

Traceback (most recent call last):
  File "<pyshell#18>",line 1,in <module>
    df = spark.read.csv("D:\Users\SPate233\Downloads\iMedical\query1.csv",header = True)
NameError: name 'spark' is not defined

and

Traceback (most recent call last):
      File "<pyshell#26>",in <module>
        df = sqlContext.load(source="com.databricks.spark.csv",path = "D:\Users\SPate233\Downloads\iMedical\query1.csv")
    AttributeError: type object 'SQLContext' has no attribute 'load'
qwedqw 回答:如何在pyspark中读取csv文件?

首先,您需要创建如下所示的SparkSession

from pyspark.sql import SparkSession


spark = SparkSession.builder.master("yarn").appName("MyApp").getOrCreate()

您的csv必须在hdfs上,然后您就可以使用spark.csv

df = spark.read.csv('/tmp/data.csv',header=True)

/tmp/data.csv在hdfs上

,

在pyspark中最简单的读取csv-使用Databrick的spark-csv模块。

from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)

df = sqlContext.read.format('com.databricks.spark.csv').options(header='true',inferschema='true').load('file.csv')

您还可以按字符串阅读并解析到分隔符。

reader = sc.textFile("file.csv").map(lambda line: line.split(","))
本文链接:https://www.f2er.com/3124306.html

大家都在问