I need to select a few columns (dynamically) from a dataframe to create an id out of it (maybe with sha256).
I don't know how to select a few columns, convert it to a map of column Name and Value to send to a function.
Here is the code that does not work
import org.apache.spark.sql._
val df = Seq(
("Java", "20000", "Compiled"),
("Python", "100000", "Interpreted"),
("Scala", "3000", "Compiled")).
toDF("language","users_count","type")
val neededColumns = Array("language","type")
def GenerateId(map:Map[String,String]) =
{
val builder = StringBuilder.newBuilder
for((k,v) <- map) {
builder.append(k)
builder.append("=")
builder.append(value)
}
sha2(builder.toString(), 256)
}
df.withColumn("id", GenerateId(df.columns.map(col)));
By selecting the neededColumns from the dataframe and sending to a function the desired output is
| language | users_count | type | id |
| -------- | ----------- |------------|-------|
| Java | 20000 |Compiled | hexid |
| Python | 100000 |Interpreted | hexid |
| Scala | 3000 |Compiled | hexid |
CodePudding user response:
You can use contact function to build the string column you want to pass to sha2 like this:
def GenerateId(cols: Seq[String]) = {
val concatExpr = concat_ws(",", cols.map(c => concat(lit(s"$c="), col(c))): _*)
sha2(concatExpr, 256)
}
df.withColumn("id", GenerateId(neededColumns)).show(false)
// -------- ----------- ----------- ----------------------------------------------------------------
//|language|users_count|type |id |
// -------- ----------- ----------- ----------------------------------------------------------------
//|Java |20000 |Compiled |5322d4f18b5f6b1330e2f2096bdfbdfdcc1dc7ea678982c35a424d3be3077e58|
//|Python |100000 |Interpreted|cabdf1501324922d11f4281628c77cf983c219e5f123764598793baf1a96488a|
//|Scala |3000 |Compiled |4cc0da85338aa336c42dcd101daca71c6571aa12b1f752799fc7db1652a38d63|
// -------- ----------- ----------- ----------------------------------------------------------------
Behind the scenes:
val concatExpr = concat_ws(",", neededColumns.map(c => concat(lit(s"$c="), col(c))): _*)
df.withColumn("map", concatExpr).withColumn("id", sha2(col("map"), 256)).show(false)
// -------- ----------- ----------- -------------------------------- ----------------------------------------------------------------
//|language|users_count|type |map |id |
// -------- ----------- ----------- -------------------------------- ----------------------------------------------------------------
//|Java |20000 |Compiled |language=Java,type=Compiled |5322d4f18b5f6b1330e2f2096bdfbdfdcc1dc7ea678982c35a424d3be3077e58|
//|Python |100000 |Interpreted|language=Python,type=Interpreted|cabdf1501324922d11f4281628c77cf983c219e5f123764598793baf1a96488a|
//|Scala |3000 |Compiled |language=Scala,type=Compiled |4cc0da85338aa336c42dcd101daca71c6571aa12b1f752799fc7db1652a38d63|
// -------- ----------- ----------- -------------------------------- ----------------------------------------------------------------
