Home > Software design >  Looking to create column of "rank arrays" based on another column of Array(Float) type
Looking to create column of "rank arrays" based on another column of Array(Float) type

Time:01-26

Here is my dataset:

score  
[0.3, 0.5]
[0.1, 0.6, 0.7]

Desired Dataset:

score            rank 
[0.3, 0.5]      [1, 2]
[0.1, 0.6, 0.7] [1, 2, 3]

This is my initial attempt:

df_upd = df.withColumn("rank", F.array([F.lit(i) for i in range(1, F.size("score")   1)]))

I get this error:

TypeError: range() integer end argument expected, got Column.

I'm wondering if there are any concise ways to do this or will I have to explode df and then create a rank column using Window functions

CodePudding user response:

It looks like you want just create a sequence from 1 to size(score), you can use sequence function for that:

from pyspark.sql import functions as F

df = spark.createDataFrame([([0.3, 0.5],), ([0.1, 0.6, 0.7],)], ["score"])

df.withColumn("rank", F.expr("sequence(1, size(score))")).show()

# --------------- --------- 
#|          score|     rank|
# --------------- --------- 
#|     [0.3, 0.5]|   [1, 2]|
#|[0.1, 0.6, 0.7]|[1, 2, 3]|
# --------------- ---------  
  •  Tags:  
  • Related