Home > Blockchain >  PySpark - Creating a UDF to Concatenate Two Columns of Lists Into a List of Lists
PySpark - Creating a UDF to Concatenate Two Columns of Lists Into a List of Lists

Time:02-05

Suppose I have a dataframe with columns A, B, C, D, E. Each of these can be comprised of a list of values or null. I would like to concatenate these values into a final column F that consists of a list of lists that ignores the null values and preserves the original columns' order.

Ex input: [a,b,c] | [b,c,d] | null | null | [z] Ex output: [[a,b,c], [b,c,d], [z]]

Unfortunately, concat_ws flattens everything, so I believe I must use a UDF. Does anyone have a solution to this problem?

CodePudding user response:

As you may know, it's common sense that UDF is generally not a good idea in pyspark and it seems to me that it's possible to solve it using sql functions.

You can start by creating your ArrayList of ArrayLists using array function:

df.withColumn('FWithNulls', array('A', 'B', 'C', 'D', 'E'))

And then remove the null values using array_except:

df.withColumn('F', array_except('FWithNulls', array(lit(None))))

Tested with pyspark 3.1.2:

from pyspark.sql.functions import lit, array, array_except
from pyspark.sql.types import StringType, ArrayType

(
  spark.createDataFrame(
    [
      {
        'A': ['a', 'b', 'c'],
        'B': ['b', 'c', 'd'],
        'E': ['z']
      }
    ]
  )
  .withColumn('C', lit(None).cast(ArrayType(StringType(), True)))
  .withColumn('D', lit(None).cast(ArrayType(StringType(), True)))
  .withColumn('FWithNulls', array('A', 'B', 'C', 'D', 'E'))
  .withColumn('F', array_except('FWithNulls', array(lit(None))))
  .show(vertical=True, truncate=False)
)
-RECORD 0---------------------------------------------
 A          | [a, b, c]                               
 B          | [b, c, d]                               
 E          | [z]                                     
 C          | null                                    
 D          | null                                    
 FWithNulls | [[a, b, c], [b, c, d], null, null, [z]] 
 F          | [[a, b, c], [b, c, d], [z]]         

It seems like a good idea to solve this with those two narrow transformations.

  •  Tags:  
  • Related