I have a dataframe that looks like this:
I want to flatten the barcode field of type StringType, instead of looking like this where there is more than one barcode for the same item, I want it to repeat, for example:
------------------------------------------------------------
|suborder| updated_on | suborder_item_id| barcode |
-------- ------------ ----------------- --------------------
|1 | 123 | 3333 |["123","456","789"] |
-------- ------------ ----------------- --------------------
I want it to repeat if there's more than one barcode in the same dataframe:
-------- ---------- ---------------- -------
|suborder|updated_on|suborder_item_id|barcode|
-------- ---------- ---------------- -------
| 1| 123| 3333| 123|
| 1| 123| 3333| 456|
| 1| 123| 3333| 789|
-------- ---------- ---------------- -------
Your help would be greatly appreciated, I've been struggling with this for a while now.
CodePudding user response:
You are looking for just the explode function.
import pyspark.sql.functions as F
……
df = df.select('suborder', 'updated_on', 'suborder_item_id', F.explode('barcode').alias('barcode'))
CodePudding user response:
Use from_json function to parse the stringified array into ArrayType then explode it:
import pyspark.sql.functions as F
df = spark.createDataFrame([
(1, 123, 3333, '["123","456","789"]')
], ["suborder", "updated_on", "suborder_item_id", "barcode"])
df1 = df.withColumn(
"barcode",
F.explode(F.from_json("barcode", "array<string>"))
)
df1.show()
# -------- ---------- ---------------- -------
#|suborder|updated_on|suborder_item_id|barcode|
# -------- ---------- ---------------- -------
#| 1| 123| 3333| 123|
#| 1| 123| 3333| 456|
#| 1| 123| 3333| 789|
# -------- ---------- ---------------- -------

