The functions.expr("[SQL]") can be used as an alternative way to query in so many cases, for instance:
df2=df.withColumn("gender", expr("CASE WHEN gender = 'M' THEN 'Male' "
"WHEN gender = 'F' THEN 'Female' ELSE 'unknown' END"))
which is equal to
df2=df.withColumn("gender", when(col("gender") == "M", "Male")
.when(col("gender") == "F", "Female")
.otherwise("Unknown")
I am wondering, does it have a performance difference?
And what about the following example (which functions API doesn't have an out-of-box solution to add hours)?
df = df.withColumn('testing_time', df.testing_time expr('INTERVAL 2 HOURS'))
VS
df = df.withColumn("testing_time", (unix_timestamp("testing_time") 7200).cast('timestamp'))
Finally, do you suggest to use functions.expr where ever it could be?
CodePudding user response:
-
does it have a performance difference?
No, both versions are identic in every aspect, including performance.
from pyspark.sql import functions as F df = spark.createDataFrame([("M",), ("F",)], ["gender"]) df2 = df.withColumn("gender", F.when(F.col("gender") == "M", "Male") .when(F.col("gender") == "F", "Female") .otherwise("Unknown")) df3 = df.withColumn("gender", F.expr("CASE WHEN gender = 'M' THEN 'Male' " "WHEN gender = 'F' THEN 'Female' ELSE 'Unknown' END"))PySpark code doesn't directly make Spark run the algorithm. It creates logical and physical plans which actually run the algorithm. You can inspect them and compare - they are identic.
df2.explain() # == Physical Plan == # *(1) Project [CASE WHEN (gender#49 = M) THEN Male WHEN (gender#49 = F) THEN Female ELSE Unknown END AS gender#51] # - *(1) Scan ExistingRDD[gender#49] df3.explain() # == Physical Plan == # *(1) Project [CASE WHEN (gender#49 = M) THEN Male WHEN (gender#49 = F) THEN Female ELSE Unknown END AS gender#53] # - *(1) Scan ExistingRDD[gender#49] df2.sameSemantics(df3) # Available in Spark 3.1 # True Regarding the use of
expr, use it- when you don't have an equivalent in PySpark
- when your Spark version doesn't yet support PySpark equivalent
- when PySpark function expects a value, but you want to provide a column (e.g. this case)
Otherwise, it often looks cleaner when written in PySpark.
