gcloud dataproc jobs submit spark \
--cluster=cluster \
--region=region \
--files=config.txt \
--class=class \
--jars=gs://abc.jar
we need to access the config.txt on the driver node. How can I access the config.txt file on driver node and how to get the path where the config.txt is stored.
In HDFS world with similar --files option I can access the file in driver using java.io.File("config.txt")
CodePudding user response:
I don't have easy access to an gcp account to test (I'm sorry about that), but you could try the org.apache.spark.SparkFiles.get(String filename) class method to access the file's absolute path.
Doc: https://spark.apache.org/docs/latest/api/java/index.html?org/apache/spark/SparkFiles.html
I hope it helps. See you.
CodePudding user response:
Dataproc sets the current (working) directory of driver process to a tmp dir. Files provided through the --files flag will be available in that dir.
For example (list-dir.py):
import os
print(os.getcwd())
print(os.listdir('.'))
then run
gcloud dataproc jobs submit pyspark \
--cluster=<cluster> --files=test.json list-dir.py
...
/tmp/d50faeccc7e94c299cc1e7f257cc542c
['.test.json.crc', '.list-dir.py.crc', 'test.json', 'list-dir.py']
You can see test.json is in the current dir of the Spark driver process.
