Suppose I have a pyspark data frame as:
col1 col2 col3
1 2 -3
2 null 5
4 4 8
1 0 9
I want to add a column called "check" where it counts the number of values that are greater than 0.
The final output will be:
col1 col2 col3 check
1 2 -3 2
2 null 5 2
4 4 8 3
1 0 9 2
I was trying this. But, it didn't help and errors out as below:
df= df.withColumn("check", sum((df[col] > 0) for col in df.columns))
Invalid argument, not a string or column: <generator object <genexpr> at 0x7f0a866ae580> of type <class 'generator'>. For column literals, use 'lit', 'array', 'struct' or 'create_map' function.
CodePudding user response:
Don't know if there is a simpler SQL based solution or not, but it's pretty straight forward with a udf.
count_udf = udf(lambda arr: sum([1 for a in arr if a > 0]), IntegerType())
df.withColumn('check', count_udf(array('col1', 'col2', 'col3'))).show()
Not sure if it'll handle nulls. Add null check (if a and a > 0) in udf if needed.
Idea: https://stackoverflow.com/a/42540401/496289
Your code shows you doing a sum of non-zero columns, not count. If you need sum then
count_udf = udf(lambda arr: sum([a for a in arr if a > 0]), IntegerType())
CodePudding user response:
Create a new column array and filter the newly created column finally count the elements in the column.
Example:
df.show(10,False)
# ---- ---- ----
#|col1|col2|col3|
# ---- ---- ----
#|1 |2 |-3 |
#|2 |null|5 |
# ---- ---- ----
df.withColumn("check",expr("size(filter(array(col1,col2), x -> x > 0))")).show(10,False)
# ---- ---- ---- -----
#|col1|col2|col3|check|
# ---- ---- ---- -----
#|1 |2 |-3 |2 |
#|2 |null|5 |1 |
# ---- ---- ---- -----
