How I can convert a array (in a column) with a set of elements in a JSON dataset to multiple columns with python, spark or pandas? The data is structured in this form:
root
|-- items: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- id: string (nullable = true)
| | |-- idAccount: long (nullable = true)
| | |-- infractionType: string (nullable = true)
| | |-- responseTime: string (nullable = true)
| | |-- status: string (nullable = true)
| | |-- transactionCode: string (nullable = true)
I'm expecting some kind of this:
| id | idAccount |
|---|---|
| value | value |
| value | value |
CodePudding user response:
In Spark SQL, you can access the item in ArrayType or MapType column by using getItem. For example, you want to get the value of the id of first item, you can use df.select(func.getItem(0).getItem('id'))
CodePudding user response:
an array of structs can be exploded into columns using the inline sql function.
here's an example of how it works.
data_sdf = spark.createDataFrame([([(1234, 2345, 3456), (4321, 5432, 6543)],)],
'items array<struct<id: int, id_acc: int, foo: int>>'
)
# ----------------------------------------
# |items |
# ----------------------------------------
# |[{1234, 2345, 3456}, {4321, 5432, 6543}]|
# ----------------------------------------
# root
# |-- items: array (nullable = true)
# | |-- element: struct (containsNull = true)
# | | |-- id: integer (nullable = true)
# | | |-- id_acc: integer (nullable = true)
# | | |-- foo: integer (nullable = true)
# explode and create new columns using struct fields - using `inline`
data_sdf. \
selectExpr('inline(items)'). \
show()
# ---- ------ ----
# | id|id_acc| foo|
# ---- ------ ----
# |1234| 2345|3456|
# |4321| 5432|6543|
# ---- ------ ----
you can further just select() the required fields after the explosion.
