I'm a beginner in Spark. I've a scenario where there are multiple source of data at different point of time for an analysis. Can I have 2 spark jobs to use a single HDFS/S3 storage at the same time? One job will write latest data to S3/HDFS and other will read that along with input data from another source for analysis.
CodePudding user response:
Yes, you can be writing and reading to the same data source. Data will only be present once writes are completed.(In both HDFS/S3)
CodePudding user response:
In order to use both file systems, you need to include the protocol for the files.
e.g. spark.read.path("s3a://bucket/file") and/or spark.write.path("hdfs:///tmp/data")
However, you can use S3 directly in place of HDFS via setting fs.defaultFS
