this is my test json file.
{
"item" : {
"fracData" : [ ],
"fractimeData" : [ {
"number" : "1232323232",
"timePeriods" : [ {
"validFrom" : "2021-08-03"
} ]
} ],
"Module" : [ ]
}
}
This is how I read the json file.
starhist_test_df = spark.read.json("/mapr/xxx/yyy/ttt/dev/rawdata/Test.json", multiLine=True)
starhist_test_df.createOrReplaceTempView("v_test_df")
This query works.
df_test_01 = spark.sql("""
select item.fractimeData.number from v_test_df""")
df_test_01.collect();
Result
[Row(number=['1232323232'])]
But this query doesn't work.
df_test_01 = spark.sql("""
select item.fractimeData.timePeriods.validFrom from v_test_df""")
df_test_01.collect();
Error
cannot resolve 'v_test_df.`item`.`fractimeData`.`timePeriods`['validFrom']' due to data type mismatch: argument 2 requires integral type, however, ''validFrom'' is of string type.; line 3 pos 0;
What do I have to change, to read the validFrom field?
CodePudding user response:
try built in function "transform".
It will resolve your query
CodePudding user response:
dot notation to access values works with struct or array<struct> types.
The schema for field number in item.fractimeData is string and accessing it via dot notation returns an array<string> since fractimeData is an array.
Similarly, the schema for field timePeriods in item.fractimeData is <array<struct<validFrom>>, and accessing it via dot notation wraps it into another array, resulting in final schema of array<array<struct<validFrom>>>.
The error you get is because the dot notation can work on array<struct> but not on array<array>.
Hence, flatten the result from item.fractimeData.timePeriods to get back an array<struct<validFrom>> and then apply the dot notation.
df_test_01 = spark.sql("""
select flatten(item.fractimeData.timePeriods).validFrom as validFrom from v_test_df""")
df_test_01.collect()
"""
[Row(validFrom=['2021-08-03', '2021-08-03'])]
"""
