I have a Spark Dataset containing a single column of ArrayType which denotes the path from one user to another through their mutual friends
| path |
|---|
| ["Amy","John","Wally"] |
| ["Beth","Sally","Tim","Jacob"] |
What I would like to achieve in the end is a table that explicitly lists the edges in the paths. (i.e. an edgelist)
| src | dest |
|---|---|
| "Amy" | "John" |
| "John" | "Amy" |
| "John" | "Wally" |
| "Beth" | "Sally" |
| "Sally" | "Tim" |
| "Tim" | "Sally" |
| "Tim" | "Jacob" |
| "Jacob" | "Tim" |
How should I go about trying to transform the former table into the latter one?
CodePudding user response:
You can turn each list to list of edges (pairs) by using arrays_zip on two slices - one w/o the last element and one w/o the first element. It will create array of structs, then explode resulting array to have each struct in a separate row and then turn struct column into two separate columns (withColumn).
Then you should add reverse nodes and remove duplicates by using distinct.
I assume that you work with DataFrame and use spark sql functions.
