Read local/linux files in Spark Scala code executing in Yarn Cluster Mode-CodePudding

How to access and read local file data in Spark executing in Yarn Cluster Mode.

local/linux file: /home/test_dir/test_file.csv

spark-submit --class "" --master yarn --deploy_mode cluster --files /home/test_dir/test_file.csv test.jar

Spark code to read csv:

val test_data = spark.read.option("inferSchema", "true").option("header", "true).csv("/home/test_dir/test_file.csv")
val test_file_data = spark.read.option("inferSchema", "true").option("header", "true).csv("file:///home/test_dir/test_file.csv")

The above sample spark-submit is failing with local file not-found error (/home/test_dir/test_file.csv)

Spark by defaults check for file in hdfs:// but my file is in local and should not be copied into hfds and should read only from local file system.

Any suggestions to resolve this error?

CodePudding user response：

Using file:// prefix will pull files from the YARN nodemanager filesystem, not the system from where you submitted the code.

To access your --files use csv("#test_file.csv")

should not be copied into hdfs

Using --files will copy the files into a temporary location that's mounted by the YARN executor and you can see them from the YARN UI

CodePudding user response：

Below solution worked for me:

local/linux file: /home/test_dir/test_file.csv

spark-submit --class "" --master yarn --deploy_mode cluster --files /home/test_dir/test_file.csv test.jar

To access file passed in spark-submit:

import scala.io.Source
val lines = Source.fromPath("test_file.csv").getLines.toString

Instead of specifying complete path, specify only file name that we want to read. As spark already takes copy of file across nodes, we can access data of file with only file name.