I'm trying to generate a column with a random number per each row, but this number has to be in range between of already existing column and -1. If I have:
customer existing_value
A -15
B -9
C -13
I would like to receive something like rand(existing_value, -1):
customer existing_value random_value
A -15 -3
B -9 -8
C -13 -6
I couldn't find a dedicated PySpark solution for this, the sql rand() function seems to be too limited. I tried to use the following code, but the function doesn't accept column as an input:
random_month.withColumn('random', randint(col('existing_value'), -1))
What would be a good solution here? The number of rows is about 100k, so if there is nothing appropriate in PySpark, pandas could be an option if necessary.
CodePudding user response:
You can use randint function with UDF:
from pyspark.sql import functions as F
df = spark.createDataFrame([("A", -15), ("B", -9), ("C", -13), ], ["customer", "existing_value"])
df1 = df.withColumn("random_value", F.udf(lambda x: randint(x, -1))("existing_value"))
df1.show()
# -------- -------------- ------------
#|customer|existing_value|random_value|
# -------- -------------- ------------
#| A| -15| -5|
#| B| -9| -7|
#| C| -13| -3|
# -------- -------------- ------------
Another solution by generating a sequence of numbers from existing_value to -1 then pick randomly one element from the resulting array:
df1 = df.withColumn(
"random_value",
F.expr("sequence(existing_value, -1, 1)")
).withColumn(
"random_value",
F.col("random_value")[F.floor(F.rand() * F.size("random_value"))]
)
CodePudding user response:
Use pandas' apply function for such operations on columns: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.apply.html. One possible solution could be the following:
import pandas as pd
import random
df = pd.DataFrame([-15, -20], columns=['existing_value'])
df['random_value'] = df.existing_value.apply(lambda row: random.randint(row,-1))
print(df)
