I have the following data.
data = [
[
"2022-12-12",
["IND", "u1", [["auction_1", [[1,20], [2,12]]], ["auction_2", [[1,5], [2,7]]]]],
],
[
"2022-12-12",
["USA", "u2", [["auction_1", [[1,8], [2,12]]], ["auction_2", [[1,11], [2,4]]]]],
],
]
I have the following schema
actionSchema = T.StructType([
T.StructField("amountSpent", T.LongType()),
T.StructField("timeSpent", T.LongType())
])
actionsSchema = T.StructType(
[T.StructField("action1", actionSchema), T.StructField("action2", actionSchema)]
)
userSchema = T.ArrayType(
T.StructType(
[
T.StructField("refferalId", T.StringType()),
T.StructField("actions", actionsSchema),
]
)
)
dataSchema = T.StructType(
[
T.StructField("country", T.StringType()),
T.StructField("userId", T.StringType()),
T.StructField("users", userSchema),
]
)
schema = T.StructType(
[T.StructField("date", T.StringType()), T.StructField("data", dataSchema)]
)
df = spark.createDataFrame(data=data, schema=schema)
df.printSchema()
# it has the following schema
root
|-- date: string (nullable = true)
|-- data: struct (nullable = true)
| |-- country: string (nullable = true)
| |-- userId: string (nullable = true)
| |-- users: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- refferalId: string (nullable = true)
| | | |-- actions: struct (nullable = true)
| | | | |-- action1: struct (nullable = true)
| | | | | |-- amountSpent: long (nullable = true)
| | | | | |-- timeSpent: long (nullable = true)
| | | | |-- action2: struct (nullable = true)
| | | | | |-- amountSpent: long (nullable = true)
| | | | | |-- timeSpent: long (nullable = true)
I would like to have to data in the below format for further analysis.
date, country, userId, refferalId, action, amountSpent, timeSpent
2022-12-31, IND, 123, 123213, action1, 5, 56
display(df.select(F.explode("data")))
# cannot resolve 'explode(data)' due to data type mismatch: input to function explode should be an array or map type
Any help would be really appreciated.
If we can not explode any StructType how can I achieve the above data format?
I went through these questions also, but didn’t get much help -> Error while exploding a struct column in Spark
CodePudding user response:
You must explode data.users :
df.select('date', 'data.country', 'data.userId', F.explode('data.users').alias('info'))
For the actions you need a query like below (after exploding data.users):
.select('date', 'country', 'userId', 'info.refferalId', F.explode('actions').alias('actionInfo'))
But because you defined actions as struct it can't be exploded. If you change its schema to list the code will works fine
CodePudding user response:
This is basically a task where you need to transform a bunch of your data to get it to the form you want. You'll need a combination of the pyspark.sql.functions to get where you want.
If we start from your df:
output = df.select("date", "data.country", "data.userId", explode(col("data.users")).alias("users")) \
.select("date", "country", "userId", "users.*") \
.withColumn("actions", explode(array(
struct("actions.action1.*", lit("action1").alias("action")),
struct("actions.action2.*", lit("action2").alias("action"))
)
)) \
.select("date", "country", "userId", "refferalId", "actions.*")
output.printSchema()
root
|-- date: string (nullable = true)
|-- country: string (nullable = true)
|-- userId: string (nullable = true)
|-- refferalId: string (nullable = true)
|-- amountSpent: long (nullable = true)
|-- timeSpent: long (nullable = true)
|-- action: string (nullable = false)
output.show()
---------- ------- ------ ---------- ----------- --------- -------
| date|country|userId|refferalId|amountSpent|timeSpent| action|
---------- ------- ------ ---------- ----------- --------- -------
|2022-12-12| IND| u1| auction_1| 1| 20|action1|
|2022-12-12| IND| u1| auction_1| 2| 12|action2|
|2022-12-12| IND| u1| auction_2| 1| 5|action1|
|2022-12-12| IND| u1| auction_2| 2| 7|action2|
|2022-12-12| USA| u2| auction_1| 1| 8|action1|
|2022-12-12| USA| u2| auction_1| 2| 12|action2|
|2022-12-12| USA| u2| auction_2| 1| 11|action1|
|2022-12-12| USA| u2| auction_2| 2| 4|action2|
---------- ------- ------ ---------- ----------- --------- -------
The operations, transformation per transformation:
- The first,
selectstatement unwraps thedatastruct and explodes thedata.usersarray - Second,
selectstatement unwraps theusersstruct - Third,
withColumnstatement is a bit more complicated. At this point we have 2 structs (action1andaction2) that have the same schema. What we're doing here is:- adding a literal column,
actionwith either valueaction1oraction2to ouractionscolumns - putting those 2 similar columns in an array using the
arrayfunction - exploding that array
- adding a literal column,
- Fourth, select statement is to unwrap the
actionsstruct that we created
Hope this helps!
CodePudding user response:
The problem is that you cannot explode structs. You can only explode arrays or maps. The first step you need to take is to explode data.users (and not just data). You can do it this way:
users = df\
.withColumn("s", F.explode("data.users"))\
.select("date", "data.country", "data.userId", "s.*")
users.show()
---------- ------- ------ ---------- ------------------
| date|country|userId|refferalId| actions|
---------- ------- ------ ---------- ------------------
|2022-12-12| IND| u1| auction_1|{{1, 20}, {2, 12}}|
|2022-12-12| IND| u1| auction_2| {{1, 5}, {2, 7}}|
|2022-12-12| USA| u2| auction_1| {{1, 8}, {2, 12}}|
|2022-12-12| USA| u2| auction_2| {{1, 11}, {2, 4}}|
---------- ------- ------ ---------- ------------------
From there, you want to explode the actions but as before, you cannot explode structs. To overcome that, you can convert it into an array of structs.
users\
.withColumn("actions", F.array(
[ F.struct(
F.lit(f"action{i}").alias("action"),
F.col("actions")[f"action{i}"].alias("meta")
) for i in [1, 2] ]
))\
.withColumn("action", F.explode("actions"))\
.select("date", "country", "userId", "refferalId", "action.action", "action.meta.*")\
.show()
---------- ------- ------ ---------- ------- ----------- ---------
| date|country|userId|refferalId| action|amountSpent|timeSpent|
---------- ------- ------ ---------- ------- ----------- ---------
|2022-12-12| IND| u1| auction_1|action1| 1| 20|
|2022-12-12| IND| u1| auction_1|action2| 2| 12|
|2022-12-12| IND| u1| auction_2|action1| 1| 5|
|2022-12-12| IND| u1| auction_2|action2| 2| 7|
|2022-12-12| USA| u2| auction_1|action1| 1| 8|
|2022-12-12| USA| u2| auction_1|action2| 2| 12|
|2022-12-12| USA| u2| auction_2|action1| 1| 11|
|2022-12-12| USA| u2| auction_2|action2| 2| 4|
---------- ------- ------ ---------- ------- ----------- ---------
