Home > Blockchain >  How to get whole row's size in df using scala
How to get whole row's size in df using scala

Time:01-29

DataFrame has multiple columns. I need add a new column for the whole row size which means I need add all columns size together. Is there a simple way to do it efficiently? Thanks

Here is the sample:

val DataFrame = Seq(("Alice", "He is girl"), ("Bob", "She is girl"), ("Ben", null)).toDF("name","string") 
display(DataFrame) 

I want to add a column to df that it can sum length of each column. In this sample only two columns, but actually I have hundred columns in the df.

CodePudding user response:

val df = Seq(("Alice", "He is girl"), 
   ("Bob", "She is girl"), ("Ben", null)).toDF("name","string")

scala> df.show
 ----- ----------- 
| name|     string|
 ----- ----------- 
|Alice| He is girl|
|  Bob|She is girl|
|  Ben|       null|
 ----- ----------- 

Get rid of null values:

val dfNoNull = df.na.fill("")

scala> dfNoNull.show
 ----- ----------- 
| name|     string|
 ----- ----------- 
|Alice| He is girl|
|  Bob|She is girl|
|  Ben|           |
 ----- ----------- 

Create list of columns with applied length function to each of them:

val cols = dfNoNull.columns.map(x => length(col(x)))

Select data based on these columns/expressions:

val dfColCounts = dfNoNull.select(cols:_*)

scala> dfColCounts.show
 ------------ -------------- 
|length(name)|length(string)|
 ------------ -------------- 
|           5|            10|
|           3|            11|
|           3|             0|
 ------------ -------------- 

Get these new colum names:

val countCols = dfColCounts.columns.map(x => col(x))

Apply reduce to sum up all column values which are ints by now:

val dfPerRowCounts = dfColCounts
   .withColumn("countPerRow", countCols.reduce(_   _))
   .select("countPerRow")

Result:

dfPerRowCounts.show

scala> dfPerRowCounts.show
 ----------- 
|countPerRow|
 ----------- 
|         15|
|         14|
|          3|
 ----------- 
  •  Tags:  
  • Related