Add new element to nested array of structs pyspark-CodePudding

I have a dataframe with the following schema using pyspark:

|-- suborders: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- trackingStatusHistory: array (nullable = true)
 |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |-- trackingStatusUpdatedAt: string (nullable = true)
 |    |    |    |    |-- trackingStatus: string (nullable = true)

What I want to do is create a new deliveredat element for each suborders array using conditions.

I need to find the date within the trackingStatusHistory array where trackingStatusHistory.trackingStatus = 'delivered'. If this trackingStatus exists, the new deliveredat element will receive the date in trackingStatusHistory.trackingStatusUpdatedAt. If doesn't exist, receive null.

How can I do this using pyspark?

CodePudding user response：

You can do that using higher-order functions transform filter on arrays. For each struct element of suborders array you add a new field by filtering the sub-array trackingStatusHistory and getting the delivery date, like this:

import pyspark.sql.functions as F

df = df.withColumn(
    "suborders",
    F.expr("""transform(
                suborders, 
                x -> struct(
                        filter(x.trackingStatusHistory, y -> y.trackingStatus = 'delivered')[0].trackingStatusUpdatedAt as deliveredAt,
                        x.trackingStatusHistory as trackingStatusHistory
                        )
                )
    """)
)