I'm understanding Spark pretty well conceptually but I'm still navigating the specific syntax options for expressions. I'm using Spark 3.3.0 with Java 17.
I have a table with a column foo that shouldn't have values much higher than 30. I want to add a column with the proportion of foo with 30, that is, the value of foo normalized to 0.0 <= proportion <= 1.0 with a cutoff of 30 for foo. Finally foo is a decimal type, represented in Java by BigDecimal.
I might make this calculation like this in Java:
BigDecimal maxVal = new BigDecimal(30);
BigDecimal proportion = foo.max(maxVal).divide(maxVal);
(Or I could first divide by maxVal and take the max of the result and 1.0, although the first approach above is probably more efficient and more direct semantically.)
How would I best do this in Spark? The following seems to work and gets me most of the way there:
BigDecimal maxVal = new BigDecimal(30);
df = df.withColumn("proportion", col("foo").divide(maxVal));
But what about the max() part so that I can cut off the value at 30? The only "max" function I could find offhand was the Spark SQL MAX() aggregate function.
CodePudding user response:
While min and max are aggregation functions that work on a group of rows, the functions least and greatest can compare values within a row or with a constant value.
import static org.apache.spark.sql.functions.*;
Dataset<Row> df = ...
BigDecimal maxVal=new BigDecimal(30);
df.withColumn("proportion", greatest(col("value"), lit(maxVal)).divide(maxVal)).show();
Output:
----- ----------
|value|proportion|
----- ----------
| 1| 1.000000|
| 10| 1.000000|
| 20| 1.000000|
| 30| 1.000000|
| 31| 1.033333|
| 40| 1.333333|
----- ----------
If the values should be capped at maxValue one would use least instead of greatest (this would correspond to using BigDecimal.min in the non-Spark code):
df.withColumn("proportion", least(col("value"), lit(maxVal)).divide(maxVal)).show();
Output now:
----- ----------
|value|proportion|
----- ----------
| 1| 0.033333|
| 10| 0.333333|
| 20| 0.666667|
| 30| 1.000000|
| 31| 1.000000|
| 40| 1.000000|
----- ----------
