This is my code:
%spark.pyspark
df_05_body = spark.sql("""
select
gtin
, principalBody.constituents
from
v_df_04""")
df_05_body.createOrReplaceTempView("v_df_05_body")
df_05_body.printSchema()
This is the schema:
root
|-- gtin: array (nullable = true)
| |-- element: string (containsNull = true)
|-- constituents: array (nullable = true)
| |-- element: array (containsNull = true)
| | |-- element: struct (containsNull = true)
| | | |-- constituentCategory: struct (nullable = true)
| | | | |-- value: string (nullable = true)
| | | | |-- valueRange: string (nullable = true)
How to change the principalBody.constituents row in the SQL to read the fields constituentCategory.value and constituentCategory.valueRange?
CodePudding user response:
The column constituents is an array of arrays of structs. If your intent is to get a flat structure then you'll need to flatten the nested arrays, then explode:
df_05_body = spark.sql("""
WITH
v_df_04_exploded AS (
SELECT
gtin,
explode(flatten(principalBody.constituents)) AS constituent
FROM
v_df_04 )
SELECT
gtin,
constituent.constituentCategory.value,
constituent.constituentCategory.valueRange
FROM
v_df_04_exploded
""")
Or simply using inline after flatten like this:
df_05_body = spark.sql("""
SELECT
gtin,
inline(flatten(principalBody.constituents))
FROM
v_df_04_exploded
""")
