Why is union() a narrow transformation and intersection() is a wide transformation in spark?-CodePudding

I am trying to understand the underlying concept in Spark from here. As far as I have understood, narrow transformation produce child RDDs that are transformed from a single parent RDD (might be multiple partitions of the same RDD). However, union and intersection both require two or more RDDs for the transformations to be performed. Can someone please clear this theoretically?

CodePudding user response：

No, your understanding is incorrect. A a narrow transformation is the one that only requires a single partition from the source to compute all elements of one partition of the output. union is therefore a narrow transformation, because to create an output partition, you only need the single partition from the source data.

Intersection on the other hand is wide, because even for a single partition of the output, it requires access to the entire content of (at least) one of the source rdds.

CodePudding user response：

Narrow transformations are operations with dependencies on a known set(1 or more) of partition(s) in the parent RDD. They can be executed on a subset of the data without any information about the other partitions.

In contrast, transformations with wide dependencies cannot be executed on arbitrary rows and instead require the data to be partitioned in a particular way. Transformations with wide dependencies includes anything that calls for repartition.