How do I create a new column has the count of all the row values that are greater than 0 in pyspark?-CodePudding

Suppose I have a pyspark data frame as:

col1 col2  col3
1     2    -3
2     null  5
4     4     8
1     0     9

I want to add a column called "check" where it counts the number of values that are greater than 0.

The final output will be:

col1 col2  col3     check
    1     2    -3    2
    2     null  5    2
    4     4     8    3
    1     0     9    2

I was trying this. But, it didn't help and errors out as below:

df= df.withColumn("check", sum((df[col] > 0) for col in df.columns))
Invalid argument, not a string or column: <generator object <genexpr> at 0x7f0a866ae580> of type <class 'generator'>. For column literals, use 'lit', 'array', 'struct' or 'create_map' function.

CodePudding user response：

Don't know if there is a simpler SQL based solution or not, but it's pretty straight forward with a udf.

count_udf = udf(lambda arr: sum([1 for a in arr if a > 0]), IntegerType())
df.withColumn('check', count_udf(array('col1', 'col2', 'col3'))).show()

Not sure if it'll handle nulls. Add null check (if a and a > 0) in udf if needed.

Idea: https://stackoverflow.com/a/42540401/496289

Your code shows you doing a sum of non-zero columns, not count. If you need sum then

count_udf = udf(lambda arr: sum([a for a in arr if a > 0]), IntegerType())

CodePudding user response：

Create a new column array and filter the newly created column finally count the elements in the column.

Example:

df.show(10,False)
# ---- ---- ---- 
#|col1|col2|col3|
# ---- ---- ---- 
#|1   |2   |-3  |
#|2   |null|5   |
# ---- ---- ---- 

df.withColumn("check",expr("size(filter(array(col1,col2), x -> x > 0))")).show(10,False)
# ---- ---- ---- ----- 
#|col1|col2|col3|check|
# ---- ---- ---- ----- 
#|1   |2   |-3  |2    |
#|2   |null|5   |1    |
# ---- ---- ---- -----