DataFrame has multiple columns. I need add a new column for the whole row size which means I need add all columns size together. Is there a simple way to do it efficiently? Thanks
Here is the sample:
val DataFrame = Seq(("Alice", "He is girl"), ("Bob", "She is girl"), ("Ben", null)).toDF("name","string")
display(DataFrame)
I want to add a column to df that it can sum length of each column. In this sample only two columns, but actually I have hundred columns in the df.
CodePudding user response:
val df = Seq(("Alice", "He is girl"),
("Bob", "She is girl"), ("Ben", null)).toDF("name","string")
scala> df.show
----- -----------
| name| string|
----- -----------
|Alice| He is girl|
| Bob|She is girl|
| Ben| null|
----- -----------
Get rid of null values:
val dfNoNull = df.na.fill("")
scala> dfNoNull.show
----- -----------
| name| string|
----- -----------
|Alice| He is girl|
| Bob|She is girl|
| Ben| |
----- -----------
Create list of columns with applied length function to each of them:
val cols = dfNoNull.columns.map(x => length(col(x)))
Select data based on these columns/expressions:
val dfColCounts = dfNoNull.select(cols:_*)
scala> dfColCounts.show
------------ --------------
|length(name)|length(string)|
------------ --------------
| 5| 10|
| 3| 11|
| 3| 0|
------------ --------------
Get these new colum names:
val countCols = dfColCounts.columns.map(x => col(x))
Apply reduce to sum up all column values which are ints by now:
val dfPerRowCounts = dfColCounts
.withColumn("countPerRow", countCols.reduce(_ _))
.select("countPerRow")
Result:
dfPerRowCounts.show
scala> dfPerRowCounts.show
-----------
|countPerRow|
-----------
| 15|
| 14|
| 3|
-----------
