FileNotFound error when running spark-submit-CodePudding

I am trying to run the spark-submit command on my Hadoop cluster Here is a summary of my Hadoop Cluster:

The cluster is built using 5 VirtualBox VM's connected on an internal network
There is 1 namenode and 4 datanodes created.
All the VM's were built from the Bitnami Hadoop Stack VirtualBox image

When I run the following command:

spark-submit --class org.apache.spark.examples.SparkPi $SPARK_HOME/examples/jars/spark-examples_2.12-3.0.3.jar 10

I receive the following error:

java.io.FileNotFoundException: File file:/home/bitnami/sparkStaging/bitnami/.sparkStaging/application_1658417340986_0002/__spark_conf__.zip does not exist

I also get a similar error when trying to create a sparkSession using PySpark:

spark = SparkSession.builder.appName('appName').getOrCreate()

I have tried/verified the following

environment variables: HADOOP_HOME, SPARK_HOME AND HADOOP_CONF_DIR have been set in my .bashrc file
SPARK_DIST_CLASSPATH and HADOOP_CONF_DIR have been defined in spark-env.sh
Added spark.master yarn, spark.yarn.stagingDir file:///home/bitnami/sparkStaging and spark.yarn.jars file:///opt/bitnami/hadoop/spark/jars/ in spark-defaults.conf

CodePudding user response：

Since the spark job is supposed to be submitted to the Hadoop cluster managed by YARN, master and deploy-mode has to be set. From the spark 3.3.0 docs:

# Run on a YARN cluster in cluster deploy mode
export HADOOP_CONF_DIR=XXX
./bin/spark-submit \
  --class org.apache.spark.examples.SparkPi \
  --master yarn \
  --deploy-mode cluster \
  --executor-memory 20G \
  --num-executors 50 \
  /path/to/examples.jar \
  1000

Or programatically:

spark = SparkSession.builder().appName('appName').master("yarn").config("spark.submit.deployMode","cluster").getOrCreate()

CodePudding user response：

I believe spark.yarn.stagingDir needs to be an HDFS path.

More specifically, the "YARN Staging directory" needs to be available on all Spark executors, not just a local file path from where you run spark-submit

The path that isn't found is being reported from the YARN cluster, where /home/bitnami might not exist, or the Unix user running the Spark executor containers does not have access to that path.

Similarly, spark.yarn.jars (or spark.yarn.archive) should be HDFS paths because these will get downloaded, in parallel, across all executors.