Can 2 Spark job use a single HDFS/S3 storage simultaneously?-CodePudding

I'm a beginner in Spark. I've a scenario where there are multiple source of data at different point of time for an analysis. Can I have 2 spark jobs to use a single HDFS/S3 storage at the same time? One job will write latest data to S3/HDFS and other will read that along with input data from another source for analysis.

CodePudding user response：

Yes, you can be writing and reading to the same data source. Data will only be present once writes are completed.(In both HDFS/S3)

CodePudding user response：

In order to use both file systems, you need to include the protocol for the files.

e.g. spark.read.path("s3a://bucket/file") and/or spark.write.path("hdfs:///tmp/data")

However, you can use S3 directly in place of HDFS via setting fs.defaultFS