How to read a string value in JSON array struct?-CodePudding

This is my code:

%spark.pyspark
    
df_05_body = spark.sql("""
     select 
    gtin
    , principalBody.constituents
 from 
v_df_04""")

df_05_body.createOrReplaceTempView("v_df_05_body")

df_05_body.printSchema()

This is the schema:

root
 |-- gtin: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- constituents: array (nullable = true)
 |    |-- element: array (containsNull = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- constituentCategory: struct (nullable = true)
 |    |    |    |    |-- value: string (nullable = true)
 |    |    |    |    |-- valueRange: string (nullable = true)

How to change the principalBody.constituents row in the SQL to read the fields constituentCategory.value and constituentCategory.valueRange?

CodePudding user response：

The column constituents is an array of arrays of structs. If your intent is to get a flat structure then you'll need to flatten the nested arrays, then explode:

df_05_body = spark.sql("""
    WITH
      v_df_04_exploded AS (
      SELECT
        gtin,
        explode(flatten(principalBody.constituents)) AS constituent
      FROM
        v_df_04 )

    SELECT
      gtin,
      constituent.constituentCategory.value,
      constituent.constituentCategory.valueRange
    FROM
      v_df_04_exploded
""")

Or simply using inline after flatten like this:

df_05_body = spark.sql("""
    SELECT
      gtin,
      inline(flatten(principalBody.constituents))
    FROM
      v_df_04_exploded
""")