Here is my dataset:
score
[0.3, 0.5]
[0.1, 0.6, 0.7]
Desired Dataset:
score rank
[0.3, 0.5] [1, 2]
[0.1, 0.6, 0.7] [1, 2, 3]
This is my initial attempt:
df_upd = df.withColumn("rank", F.array([F.lit(i) for i in range(1, F.size("score") 1)]))
I get this error:
TypeError: range() integer end argument expected, got Column.
I'm wondering if there are any concise ways to do this or will I have to explode df and then create a rank column using Window functions
CodePudding user response:
It looks like you want just create a sequence from 1 to size(score), you can use sequence function for that:
from pyspark.sql import functions as F
df = spark.createDataFrame([([0.3, 0.5],), ([0.1, 0.6, 0.7],)], ["score"])
df.withColumn("rank", F.expr("sequence(1, size(score))")).show()
# --------------- ---------
#| score| rank|
# --------------- ---------
#| [0.3, 0.5]| [1, 2]|
#|[0.1, 0.6, 0.7]|[1, 2, 3]|
# --------------- ---------
