Home > Software engineering >  How to access files on Dataproc that were passed using --files
How to access files on Dataproc that were passed using --files

Time:02-05

gcloud dataproc jobs submit spark \
    --cluster=cluster \
    --region=region \
    --files=config.txt \
    --class=class \
    --jars=gs://abc.jar

we need to access the config.txt on the driver node. How can I access the config.txt file on driver node and how to get the path where the config.txt is stored.

In HDFS world with similar --files option I can access the file in driver using java.io.File("config.txt")

CodePudding user response:

I don't have easy access to an gcp account to test (I'm sorry about that), but you could try the org.apache.spark.SparkFiles.get(String filename) class method to access the file's absolute path.

Doc: https://spark.apache.org/docs/latest/api/java/index.html?org/apache/spark/SparkFiles.html

I hope it helps. See you.

CodePudding user response:

Dataproc sets the current (working) directory of driver process to a tmp dir. Files provided through the --files flag will be available in that dir.

For example (list-dir.py):

import os

print(os.getcwd())
print(os.listdir('.'))

then run

gcloud dataproc jobs submit pyspark \
  --cluster=<cluster> --files=test.json list-dir.py 

...
/tmp/d50faeccc7e94c299cc1e7f257cc542c
['.test.json.crc', '.list-dir.py.crc', 'test.json', 'list-dir.py']

You can see test.json is in the current dir of the Spark driver process.

  •  Tags:  
  • Related