Home > Blockchain >  PySpark: random number from range (based on a column)
PySpark: random number from range (based on a column)

Time:01-22

I'm trying to generate a column with a random number per each row, but this number has to be in range between of already existing column and -1. If I have:

customer existing_value 
   A          -15
   B          -9
   C          -13

I would like to receive something like rand(existing_value, -1):

customer existing_value random_value
   A          -15            -3
   B          -9             -8
   C          -13            -6

I couldn't find a dedicated PySpark solution for this, the sql rand() function seems to be too limited. I tried to use the following code, but the function doesn't accept column as an input:

random_month.withColumn('random', randint(col('existing_value'), -1))

What would be a good solution here? The number of rows is about 100k, so if there is nothing appropriate in PySpark, pandas could be an option if necessary.

CodePudding user response:

You can use randint function with UDF:

from pyspark.sql import functions as F

df = spark.createDataFrame([("A", -15), ("B", -9), ("C", -13), ], ["customer", "existing_value"])

df1 = df.withColumn("random_value", F.udf(lambda x: randint(x, -1))("existing_value"))

df1.show()
# -------- -------------- ------------ 
#|customer|existing_value|random_value|
# -------- -------------- ------------ 
#|       A|           -15|          -5|
#|       B|            -9|          -7|
#|       C|           -13|          -3|
# -------- -------------- ------------ 

Another solution by generating a sequence of numbers from existing_value to -1 then pick randomly one element from the resulting array:

df1 = df.withColumn(
    "random_value",
    F.expr("sequence(existing_value, -1, 1)")
).withColumn(
    "random_value",
    F.col("random_value")[F.floor(F.rand() * F.size("random_value"))]
)

CodePudding user response:

Use pandas' apply function for such operations on columns: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.apply.html. One possible solution could be the following:

import pandas as pd
import random

df = pd.DataFrame([-15, -20], columns=['existing_value'])
df['random_value'] = df.existing_value.apply(lambda row: random.randint(row,-1))
print(df)
  •  Tags:  
  • Related