I have a use-case where i need to process data from files stored in s3 and write the processed data to local files. The s3 files are constantly added to the bucket.
Each time a file is added to the bucket, the full path is published to a kafka topic.
I want to achieve on a single job the following:
- To read the file names from kafka (unbounded stream).
- An evaluator that receives the file name, reads the content from s3 (second source) and creates a dataStream.
- Process the dataStream (adding some logic to each row).
- Sink to file.
I managed to do the first, third and forth part of the design.
Is there a way to achieve this? Thanks in advance.
CodePudding user response:
I don't believe there's any straightforward way to do this.
To do everything in a single job, maybe you could convince the FileSource to use a custom FileEnumerator that gets the paths from Kafka.
A simpler alternative would be to launch a new (bounded) job for every file to be ingested. The file to be read could be passed in as a parameter.
CodePudding user response:
This is possible to implement in general, but as David Anderson has already suggested, there is currently no straightforward way to this with the vanilla Flink connectors.
Other approach could be writing the pipeline in Apache Beam, that already supports this and can use Flink as a runtime (which is a proof that this can be implemented with the existing primitives).
I think this is a legitimate use case that Flink should eventually support out of the box.
