After generating ~90 different 100 mb gzip'd CSV files, I want to merge them all into a single file. Using the built-in merge option for a data copy process, it seems that it would take well over a dozen hours to do this operation.
https://i.stack.imgur.com/yymnW.png
How can I merge many files in blob/ADLS storage quickly with Data Factory/Synapse?
CodePudding user response:
You could try a 2 step process.
- Merge all files from CSV into a Parquet format.
- Copy that Parquet file into a CSV file.
Writes into Parquet are generally quick (provided you have clean data like no spaces in column names) and they are smaller in size.
Edit - ADF Data Flow is another option. If that is still not fast enough then you might have to create a Spark Notebook in synapse and write spark code. Use a spark pool size with a balance between time and cost.
