Home > Software design >  Spark Scala - I need nth row of the group
Spark Scala - I need nth row of the group

Time:01-18

I exploded a column and got the dataframe as below:

 ------------ ----------- -------------------- 
|serialnumber|   roomname|             devices|
 ------------ ----------- -------------------- 
|hello       |Living Room|             device1|
|hello       |Living Room|             device2|
|hello       |Living Room|             device3|
|hello       |Living Room|             device4|
|hello       |Living Room|             device5|
|hello       |    Kitchen|             device1|
|hello       |    Kitchen|             device2|
|hello       |    Kitchen|             device3|
|hello       |    Kitchen|             device4|
|hello       |    Kitchen|             device5|
|hello       |   Bedroom1|             device1|
|hello       |   Bedroom1|             device2|
|hello       |   Bedroom1|             device3|
|hello       |   Bedroom1|             device4|
|hello       |   Bedroom1|             device5|
|hello       |  Bedroom 2|             device1|
|hello       |  Bedroom 2|             device2|
|hello       |  Bedroom 2|             device3|
|hello       |  Bedroom 2|             device4|
|hello       |  Bedroom 2|             device5|
|hello       |   Bedroom3|             device1|
|hello       |   Bedroom3|             device2|
|hello       |   Bedroom3|             device3|
|hello       |   Bedroom3|             device4|
|hello       |   Bedroom3|             device5|
 ------------ ----------- -------------------- 

Now I want a dataframe as below, that means 1st of Living room, 2nd of Kitchen, 3rd of Bedroom1 and so on....

    ------------ ----------- -------------------- 
    |serialnumber|   roomname|             devices|
     ------------ ----------- -------------------- 
    |hello       |Living Room|             device1|
    |hello       |    Kitchen|             device2|
    |hello       |   Bedroom1|             device3|
    |hello       |  Bedroom 2|             device4|
    |hello       |  Bedroom 3|             device5|
     ------------ ----------- -------------------- 

CodePudding user response:

Here is how you can do it with groupBy and window function but you need to know an Order of the column.

import org.apache.spark.sql.functions._

val window = Window.partitionBy("serialnumber").orderBy("roomname")

df.groupBy("serialnumber", "roomname")
  .agg(collect_list("devices").as("devices"))
  .withColumn("index", rank().over(window))
  .withColumn("devices", element_at($"devices", $"index"))
  .drop("index")
  .show(false)

Output:

 ------------ ----------- ------- 
|serialnumber|roomname   |devices|
 ------------ ----------- ------- 
|hello       |Bedroom 2  |device1|
|hello       |Bedroom1   |device2|
|hello       |Bedroom3   |device3|
|hello       |Kitchen    |device4|
|hello       |Living Room|device5|
 ------------ ----------- ------- 

CodePudding user response:

From what I understand the problem you are having is you lose the order of roomname when you are using explode.

Assuming that roomname is of type Array[..], instead of using explode you can use posexplode


 val df = Seq(
    ("hello", List[String]("room1", "room2")),
    ("hello1", List[String]("room1", "room2"))
  ).toDF("serial", "roomname")


df.select(posexplode($"roomname")).show()

will give you the following output

 --- ----- 
|pos|  col|
 --- ----- 
|  0|room1|
|  1|room2|
|  0|room1|
|  1|room2|
 --- ----- 

you can then select the "n-th" row as required by adding a filter

  •  Tags:  
  • Related