Home > Software design >  Use high order function to add calculated field in array of structs in Spark query
Use high order function to add calculated field in array of structs in Spark query

Time:02-08

I have a table with a column of array of structs with this schema:

root
 |-- id: long (nullable = true)
 |-- mainColumn: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- fieldA: string (nullable = true)
 |    |    |-- fieldB: string (nullable = true)
 |    |    |-- fieldC: string (nullable = true)
 |    |    |-- fieldD: string (nullable = true)
 |-- randomOtherColumn: short (nullable = true)

I need to create a query that returns the entire table but applies a function to the fieldA of every entry (an encryption function). I have tried using the transform() function but I also need to return the other fields (fieldB , fieldC, etc).

SELECT
  x.id,
  x.randomOtherColumn,
  transform(y -> ???)
FROM
  my_table x

Something like this would be simple in JavaScript with the spread operator:

df.map(x => (
  { 
    x.id, 
    x.randomOtherColumn, 
    x.mainColumn.map(y => ({ ...y, fieldA: encrypt(y.fieldA) })) 
  }
))

CodePudding user response:

You need to recreate the inner struct elements of the array like this:

SELECT
  id,
  randomOtherColumn,
  transform(mainColumn, x -> struct(encrypt(fieldA) as 'fieldA', fieldB as 'fieldB', fieldC as 'fieldC', fieldD as 'fieldD')) as mainColumn
FROM
  my_table

Or using DataFrame API with column method withField (since Spark 3.1 ):

val df1 = df.withColumn(
  "mainColumn",
  transform(col("mainColumn"), x => x.withField("fieldA", encrypt(x("fieldA"))))
)

As you also tagged pyspark:

from pyspark.sql import functions as F

df1 = df.withColumn(
  "mainColumn",
  F.transform("mainColumn", lambda x: x.withField("fieldA", encrypt(x["fieldA"])))
)
  •  Tags:  
  • Related