PySpark: random number from range (based on a column)-CodePudding

I'm trying to generate a column with a random number per each row, but this number has to be in range between of already existing column and -1. If I have:

customer existing_value 
   A          -15
   B          -9
   C          -13

I would like to receive something like rand(existing_value, -1):

customer existing_value random_value
   A          -15            -3
   B          -9             -8
   C          -13            -6

I couldn't find a dedicated PySpark solution for this, the sql rand() function seems to be too limited. I tried to use the following code, but the function doesn't accept column as an input:

random_month.withColumn('random', randint(col('existing_value'), -1))

What would be a good solution here? The number of rows is about 100k, so if there is nothing appropriate in PySpark, pandas could be an option if necessary.

CodePudding user response：

You can use randint function with UDF:

from pyspark.sql import functions as F

df = spark.createDataFrame([("A", -15), ("B", -9), ("C", -13), ], ["customer", "existing_value"])

df1 = df.withColumn("random_value", F.udf(lambda x: randint(x, -1))("existing_value"))

df1.show()
# -------- -------------- ------------ 
#|customer|existing_value|random_value|
# -------- -------------- ------------ 
#|       A|           -15|          -5|
#|       B|            -9|          -7|
#|       C|           -13|          -3|
# -------- -------------- ------------

Another solution by generating a sequence of numbers from existing_value to -1 then pick randomly one element from the resulting array:

df1 = df.withColumn(
    "random_value",
    F.expr("sequence(existing_value, -1, 1)")
).withColumn(
    "random_value",
    F.col("random_value")[F.floor(F.rand() * F.size("random_value"))]
)

CodePudding user response：

Use pandas' apply function for such operations on columns: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.apply.html. One possible solution could be the following:

import pandas as pd
import random

df = pd.DataFrame([-15, -20], columns=['existing_value'])
df['random_value'] = df.existing_value.apply(lambda row: random.randint(row,-1))
print(df)