from datetime import *
from pyspark.sql.functions import *
from pyspark.sql.types import *
columns = [ 'id1', 'id2', 'val' ]
ids = [ 'id1', 'id2' ]
vals1 = [
(1, 'a', 1),
(2, 'b', 4),
(4, None, 1),
]
df1 = spark.createDataFrame(data=vals1, schema=columns)
vals2 = [
(1, 'a', 5),
(3, 'c', 2),
(4, None, 2),
]
df2 = spark.createDataFrame(data=vals2, schema=columns)
df1 = df1.withColumnRenamed('val', 'val1')
df2 = df2.withColumnRenamed('val', 'val2')
res_df = df1.join(df2, ids, 'full')
res_df.sort(ids).show()
And the result is
--- ---- ---- ----
|id1| id2|val1|val2|
--- ---- ---- ----
| 1| a| 1| 5|
| 2| b| 4|null|
| 3| c|null| 2|
| 4|null| 1|null|
| 4|null|null| 2|
--- ---- ---- ----
i.e. there are 2 rows with the key [ 4, null ]
Is this behavior normal?
CodePudding user response:
This is expected behaviour. NULL does not equal NULL in comparisons as explained in the NULL Semantics section of the documentation:
Apache spark supports the standard comparison operators such as ‘>’, ‘>=’, ‘=’, ‘<’ and ‘<=’. The result of these operators is unknown or
NULLwhen one of the operands or both the operands are unknown orNULL.
(AFAIK, this is standard SQL behaviour and not specific to Spark)
Furthermore:
WHERE,HAVINGoperators filter rows based on the user specified condition. AJOINoperator is used to combine rows from two tables based on a join condition. For all the three operators, a condition expression is a boolean expression and can returnTrue,FalseorUnknown (NULL). They are “satisfied” if the result of the condition isTrue.
Comparing NULL to NULL produces NULL, which is not True, thus the (4, NULL) key in first table is not the same as the (4, NULL) key in the second table and so the join produces two distinct rows, one from joining the left row to a missing right row (| 4|null| 1|null|) and one from joining the right row to a missing left row (| 4|null|null| 2|).
