When I read articles on Sort Merge Join, it says this is the most preferred one in spark after Broadcast join, but only if joining keys are sortable. My question is when can a joining key be unsortable? Any datatype can be sorted. Could you help me understand a scenario where keys may not be sortable?
CodePudding user response:
See https://www.waitingforcode.com/apache-spark-sql/sort-merge-join-spark-sql/read. Excellent site.
Not all types can be sorted. E.g CalendarIntervalType.
Quoting:
"for not sortable keys the sort merge join" should "not be used" in {
import sparkSession.implicits._
// Here we explicitly define the schema. Thanks to that we can show
// the case when sort-merge join won't be used, i.e. when the key is not sortable
// (there are other cases - when broadcast or shuffle joins can be chosen over sort-merge
// but it's not shown here).
// Globally, a "sortable" data type is:
// - NullType, one of AtomicType
// - StructType having all fields sortable
// - ArrayType typed to sortable field
// - User Defined DataType backed by a sortable field
// The method checking sortability is org.apache.spark.sql.catalyst.expressions.RowOrdering.isOrderable
// As you see, CalendarIntervalType is not included in any of above points,
// so even if the data structure is the same (id login for customers, id customer id amount for orders)
// with exactly the same number of rows, the sort-merge join won't be applied here.
This is an old post, since v3 a comparison can be made. https://spark.apache.org/docs/3.0.0/api/scala/org/apache/spark/sql/types/CalendarIntervalType.html
But it demonstrates the point.
Also, what about non-equi joins?
