无法从Databrick的Connect Apache Spark中读取Azure Blob存储挂载中的文件

我在Azure上配置了数据块连接以在Azure云上运行我的Spark程序。对于空运行,我测试了一个单词计数程序。但是程序失败并出现以下错误。

Exception in thread "main" org.apache.hadoop.mapred.InvalidInputException: Input path does not exist:

我正在使用Intellij运行该程序。我具有访问群集的必要权限。但是我仍然遇到这个错误。

以下程序是一个包装程序,它接受参数并发布结果。

package com.spark.scala
import com.spark.scala.demo.{Argument,WordCount}
import org.apache.spark.sql.SparkSession

import com.databricks.dbutils_v1.DBUtilsHolder.dbutils

import scala.collection.mutable.Map

object Test {
  def main(args: Array[String]): Unit = {
    val argumentMap: Map[String,String] = Argument.parseArgs(args)
    val spark = SparkSession
      .builder()
      .master("local")

      .getOrCreate()
    println(spark.range(100).count())


    val rawread = String.format("/mnt/%s",argumentMap.get("--raw-reads").get)
    val data = spark.sparkContext.textFile(rawread)

    print(data.count())

    val rawwrite = String.format("/dbfs/mnt/%s",argumentMap.get("--raw-write").get)
    WordCount.executeWordCount(spark,rawread,rawwrite);
    // The Spark code will execute on the Databricks cluster.
    spark.stop()
  }
}

以下代码执行单词计数逻辑:-

package com.spark.scala.demo


import org.apache.spark.sql.SparkSession

object WordCount{

  def executeWordCount(sparkSession:SparkSession,read: String,write: String)
  {
    println("starting word count process ")


    //val path = String.format("/mnt/%s","tejatest\wordcount.txt")

    //Reading input file and creating rdd with no of partitions 5
    val bookRDD=sparkSession.sparkContext.textFile(read)

    //Regex to clean text
    val pat = """[^\w\s\$]"""
    val cleanBookRDD=bookRDD.map(line=>line.replaceAll(pat,""))

    val wordsRDD=cleanBookRDD.flatMap(line=>line.split(" "))

    val wordMapRDD=wordsRDD.map(word=>(word->1))

    val wordCountMapRDD=wordMapRDD.reduceByKey(_+_)



    wordCountMapRDD.saveAsTextFile(write)



  }
}

我已经编写了一个映射器来映射给定的路径,并且正在通过命令行传递读写位置。我的pom.xml如下:-

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <groupId>ex-com.spark.scala</groupId>
    <artifactId>ex- demo</artifactId>
    <version>1.0-snAPSHOT</version>


    <dependencies>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-core_2.11</artifactId>
            <version>2.1.1</version>
            <scope>compile</scope>
        </dependency>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-sql_2.11</artifactId>
            <version>2.1.1</version>
            <scope>compile</scope>
        </dependency>
        <!-- https://mvnrepository.com/artifact/org.apache.spark/spark-mllib -->
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-mllib_2.11</artifactId>
            <version>2.1.1</version>
        </dependency>

        <dependency>
            <groupId>org.slf4j</groupId>
            <artifactId>slf4j-api</artifactId>
            <version>1.7.5</version>
        </dependency>
        <dependency>
            <groupId>org.slf4j</groupId>
            <artifactId>slf4j-log4j12</artifactId>
            <version>1.7.25</version>
        </dependency>
        <dependency>
            <groupId>org.clapper</groupId>
            <artifactId>grizzled-slf4j_2.11</artifactId>
            <version>1.3.1</version>
        </dependency>
        <dependency>
            <groupId>org.slf4j</groupId>
            <artifactId>slf4j-log4j12</artifactId>
            <version>1.7.25</version>
        </dependency>
        <dependency>
            <groupId>org.scala-lang</groupId>
            <artifactId>scala-library</artifactId>
            <version>2.11.8</version>
        </dependency>
        <dependency>
            <groupId>com.databricks</groupId>
            <artifactId>dbutils-api_2.11</artifactId>
            <version>0.0.3</version>
        </dependency>
        <!-- Test -->
        <dependency>
            <groupId>junit</groupId>
            <artifactId>junit</artifactId>
            <version>4.11</version>
            <scope>test</scope>
        </dependency>
    </dependencies>


</project>
qduwcf1 回答:无法从Databrick的Connect Apache Spark中读取Azure Blob存储挂载中的文件

暂时没有好的解决方案,如果你有好的解决方案,请发邮件至:iooj@foxmail.com
本文链接:https://www.f2er.com/3126719.html

大家都在问