I have the following json of format:
{"year":"2020", "id":"1", "fruit":"Apple","cost": "100" }
{"year":"2020", "id":"2", "fruit":"Kiwi", "cost": "200"}
{"year":"2020", "id":"3", "fruit":"Cherry", "cost": "300"}
{"year":"2020", "id":"4", "fruit": "Apple","cost": "400" }
{"year":"2020", "id":"5", "fruit": "Mango", "cost": "500"}
{"year":"2020", "id":"6", "fruit": "Kiwi", "cost": "600"}
Its of type: pyspark.sql.dataframe.DataFrame
How can I split this json file into multiple json files and save it in a year directory using Pyspark? like:
directory: path.../2020/<all split json files>
Apple.json
{"year":"2020", "id":"1", "fruit":"Apple","cost": "100" }
{"year":"2020", "id":"4", "fruit": "Apple","cost": "400" }
Kiwi.json
{"year":"2020", "id":"2", "fruit":"Kiwi", "cost": "200"}
{"year":"2020", "id":"6", "fruit": "Kiwi", "cost": "600"}
Mango.json
{"year":"2020", "id":"5", "fruit": "Mango", "cost": "500"}
Cherry.json
{"year":"2020", "id":"3", "fruit":"Cherry", "cost": "300"}
Also if I encounter a different year, how do push the files in similar way like: path.../2021/<all split json files> ?
Initially I tried, finding all the unique fruits and create a list. Then tried creating multiple data frames & pushing the json values into it. Then converted every dataframe into a json format. But I find this inefficient.
Then I also checked this link. But issue here is it creates a key value pair in dict form, which is slightly different.
Then I also learned about Pyspark groupBy method. It seems to make sense because I could groupBy() the fruit values and then split the json file, but I feel I am missing something.
CodePudding user response:
Using the following JSON as an example
{"year":"2020", "id":"1", "fruit":"Apple","cost": "100" }
{"year":"2020", "id":"2", "fruit":"Kiwi", "cost": "200"}
{"year":"2020", "id":"3", "fruit":"Cherry", "cost": "300"}
{"year":"2021", "id":"10", "fruit": "Pear","cost": "1000" }
{"year":"2021", "id":"11", "fruit": "Mango", "cost": "1100"}
{"year":"2021", "id":"12", "fruit": "Banana", "cost": "1200"}
You can use partitionBy to partion the data by year and fruit. Note that I created a duplicate of the year column as the column that you partition on is dropped when you write the data to disk.
df = spark.read.json("./ex.json")
df = df.withColumn("Year", df["year"])
df = df.withColumn("Fruit", df["fruit"])
df.write.partitionBy("Year", "Fruit").json("result")
This results in a folder called RESULT with the following structure.
|-- RESULT
| |-- Year=2020
| | |-- Fruit=Apple
| | | |-- part0000-dcea0683...json
| | |-- Fruit=Cherry
| | | |-- part0000-dcea0683...json
| | |-- Fruit=Kiwi
| | | |-- part0000-dcea0683...json
| |-- Year=2021
| | |-- Fruit=Banana
| | | |-- part0000-dcea0683...json
| | |-- Fruit=Mango
| | | |-- part0000-dcea0683...json
| | |-- Fruit=Pear
| | | |-- part0000-dcea0683...json
