How to read field from nested json?-CodePudding

this is my test json file.

{
  "item" : {
    "fracData" : [ ],
    "fractimeData" : [ {
      "number" : "1232323232",
      "timePeriods" : [ {
        "validFrom" : "2021-08-03"
      } ]
    } ],
    "Module" : [ ]
  }
}

This is how I read the json file.

starhist_test_df = spark.read.json("/mapr/xxx/yyy/ttt/dev/rawdata/Test.json", multiLine=True)
starhist_test_df.createOrReplaceTempView("v_test_df")

This query works.

df_test_01 = spark.sql("""
 select item.fractimeData.number from v_test_df""")
df_test_01.collect();

Result

[Row(number=['1232323232'])]

But this query doesn't work.

df_test_01 = spark.sql("""
 select item.fractimeData.timePeriods.validFrom from v_test_df""")
df_test_01.collect();

Error

cannot resolve 'v_test_df.`item`.`fractimeData`.`timePeriods`['validFrom']' due to data type mismatch: argument 2 requires integral type, however, ''validFrom'' is of string type.; line 3 pos 0;

What do I have to change, to read the validFrom field?

CodePudding user response：

try built in function "transform".

It will resolve your query

CodePudding user response：

dot notation to access values works with struct or array<struct> types.

The schema for field number in item.fractimeData is string and accessing it via dot notation returns an array<string> since fractimeData is an array.

Similarly, the schema for field timePeriods in item.fractimeData is <array<struct<validFrom>>, and accessing it via dot notation wraps it into another array, resulting in final schema of array<array<struct<validFrom>>>.

The error you get is because the dot notation can work on array<struct> but not on array<array>.

Hence, flatten the result from item.fractimeData.timePeriods to get back an array<struct<validFrom>> and then apply the dot notation.

df_test_01 = spark.sql("""
 select flatten(item.fractimeData.timePeriods).validFrom as validFrom from v_test_df""")
df_test_01.collect()

"""
[Row(validFrom=['2021-08-03', '2021-08-03'])]
"""