Home > Software design >  Spark Scala: How to pass multiple selected columns to a function?
Spark Scala: How to pass multiple selected columns to a function?

Time:01-23

I need to select a few columns (dynamically) from a dataframe to create an id out of it (maybe with sha256).

I don't know how to select a few columns, convert it to a map of column Name and Value to send to a function.

Here is the code that does not work

import org.apache.spark.sql._
val df = Seq(
    ("Java", "20000", "Compiled"), 
    ("Python", "100000", "Interpreted"), 
    ("Scala", "3000", "Compiled")).
    toDF("language","users_count","type")

val neededColumns = Array("language","type")

def GenerateId(map:Map[String,String]) =
{
    val builder = StringBuilder.newBuilder
    for((k,v) <- map) {
        builder.append(k)
        builder.append("=")
        builder.append(value)
    }
    sha2(builder.toString(), 256)
}

df.withColumn("id", GenerateId(df.columns.map(col)));

By selecting the neededColumns from the dataframe and sending to a function the desired output is

| language | users_count | type       | id    |
| -------- | ----------- |------------|-------|
| Java     | 20000       |Compiled    | hexid |
| Python   | 100000      |Interpreted | hexid |
| Scala    | 3000        |Compiled    | hexid |

CodePudding user response:

You can use contact function to build the string column you want to pass to sha2 like this:

def GenerateId(cols: Seq[String]) = {
  val concatExpr = concat_ws(",", cols.map(c => concat(lit(s"$c="), col(c))): _*)
  sha2(concatExpr, 256)
}

df.withColumn("id", GenerateId(neededColumns)).show(false)

// -------- ----------- ----------- ---------------------------------------------------------------- 
//|language|users_count|type       |id                                                              |
// -------- ----------- ----------- ---------------------------------------------------------------- 
//|Java    |20000      |Compiled   |5322d4f18b5f6b1330e2f2096bdfbdfdcc1dc7ea678982c35a424d3be3077e58|
//|Python  |100000     |Interpreted|cabdf1501324922d11f4281628c77cf983c219e5f123764598793baf1a96488a|
//|Scala   |3000       |Compiled   |4cc0da85338aa336c42dcd101daca71c6571aa12b1f752799fc7db1652a38d63|
// -------- ----------- ----------- ---------------------------------------------------------------- 

Behind the scenes:

val concatExpr = concat_ws(",", neededColumns.map(c => concat(lit(s"$c="), col(c))): _*)
df.withColumn("map", concatExpr).withColumn("id", sha2(col("map"), 256)).show(false)

// -------- ----------- ----------- -------------------------------- ---------------------------------------------------------------- 
//|language|users_count|type       |map                             |id                                                              |
// -------- ----------- ----------- -------------------------------- ---------------------------------------------------------------- 
//|Java    |20000      |Compiled   |language=Java,type=Compiled     |5322d4f18b5f6b1330e2f2096bdfbdfdcc1dc7ea678982c35a424d3be3077e58|
//|Python  |100000     |Interpreted|language=Python,type=Interpreted|cabdf1501324922d11f4281628c77cf983c219e5f123764598793baf1a96488a|
//|Scala   |3000       |Compiled   |language=Scala,type=Compiled    |4cc0da85338aa336c42dcd101daca71c6571aa12b1f752799fc7db1652a38d63|
// -------- ----------- ----------- -------------------------------- ---------------------------------------------------------------- 
  •  Tags:  
  • Related