I exploded a column and got the dataframe as below:
------------ ----------- --------------------
|serialnumber| roomname| devices|
------------ ----------- --------------------
|hello |Living Room| device1|
|hello |Living Room| device2|
|hello |Living Room| device3|
|hello |Living Room| device4|
|hello |Living Room| device5|
|hello | Kitchen| device1|
|hello | Kitchen| device2|
|hello | Kitchen| device3|
|hello | Kitchen| device4|
|hello | Kitchen| device5|
|hello | Bedroom1| device1|
|hello | Bedroom1| device2|
|hello | Bedroom1| device3|
|hello | Bedroom1| device4|
|hello | Bedroom1| device5|
|hello | Bedroom 2| device1|
|hello | Bedroom 2| device2|
|hello | Bedroom 2| device3|
|hello | Bedroom 2| device4|
|hello | Bedroom 2| device5|
|hello | Bedroom3| device1|
|hello | Bedroom3| device2|
|hello | Bedroom3| device3|
|hello | Bedroom3| device4|
|hello | Bedroom3| device5|
------------ ----------- --------------------
Now I want a dataframe as below, that means 1st of Living room, 2nd of Kitchen, 3rd of Bedroom1 and so on....
------------ ----------- --------------------
|serialnumber| roomname| devices|
------------ ----------- --------------------
|hello |Living Room| device1|
|hello | Kitchen| device2|
|hello | Bedroom1| device3|
|hello | Bedroom 2| device4|
|hello | Bedroom 3| device5|
------------ ----------- --------------------
CodePudding user response:
Here is how you can do it with groupBy and window function but you need to know an Order of the column.
import org.apache.spark.sql.functions._
val window = Window.partitionBy("serialnumber").orderBy("roomname")
df.groupBy("serialnumber", "roomname")
.agg(collect_list("devices").as("devices"))
.withColumn("index", rank().over(window))
.withColumn("devices", element_at($"devices", $"index"))
.drop("index")
.show(false)
Output:
------------ ----------- -------
|serialnumber|roomname |devices|
------------ ----------- -------
|hello |Bedroom 2 |device1|
|hello |Bedroom1 |device2|
|hello |Bedroom3 |device3|
|hello |Kitchen |device4|
|hello |Living Room|device5|
------------ ----------- -------
CodePudding user response:
From what I understand the problem you are having is you lose the order of roomname when you are using explode.
Assuming that roomname is of type Array[..], instead of using explode you can use posexplode
val df = Seq(
("hello", List[String]("room1", "room2")),
("hello1", List[String]("room1", "room2"))
).toDF("serial", "roomname")
df.select(posexplode($"roomname")).show()
will give you the following output
--- -----
|pos| col|
--- -----
| 0|room1|
| 1|room2|
| 0|room1|
| 1|room2|
--- -----
you can then select the "n-th" row as required by adding a filter
