In the past, the general consensus was such that you should not use S3 as checkpointing location for Spark Structured Streaming applications.
However, now that S3 offers strong read after write consistency, is it safe to use S3 as a checkpointing location? If it is not safe, why?
In my experiments, I continue to see checkpointing related exceptions in my Spark Structured streaming applications, but I am uncertain where the problem actually lies.
CodePudding user response:
You really answer your own question. You do not state if on Databricks or EMR so I am going to assume EC2.
Use HDFS as checkpoint location on local EC2 disk.
Where I am now we have HDFS using HDP and IBM S3, HDFS is used still for checkpointing.
CodePudding user response:
not really. you get consistency of list and updates, but rename is still mocked with copy and delete...and I think the standard checkpoint algorithm depends on it.
hadoop 3.3.1 added a new API, Abortable to aid with a custom S3 stream checkpoint committer -the idea os that the checkpointer woudl write sstraight to the destination, but abort the write when aborting the checkpoint. a normal close() would finish the write and manifest the file. see https://issues.apache.org/jira/browse/HADOOP-16906
AFAIK nobody has done the actual committer. opportunity for you to contribute there...
