How to count specific rows?-CodePudding

Home > Enterprise > How to count specific rows?

How to count specific rows?

Time：12-27

I have an example of pyspark dataframe:

X	Y	Z	DATE
23	41	63	2016-01-01
23	41	5	2016-01-01
23	41	75	2016-01-01
23	41	46	2016-12-01
23	41	23	2016-12-01
27	41	5	2016-01-01
27	41	75	2016-01-01
27	41	85	2016-01-01
27	41	71	2016-01-01

What I want is to count rows with the same X, Y and DATE columns and store the value in a new column.

Final dataframe should looks like this:

X	Y	Z	DATE	SUM
23	41	63	2016-01-01	3
23	41	5	2016-01-01	3
23	41	75	2016-01-01	3
23	41	46	2016-12-01	2
23	41	23	2016-12-01	2
27	41	5	2016-01-01	4
27	41	75	2016-01-01	4
27	41	85	2016-01-01	4
27	41	71	2016-01-01	4

CodePudding user response：

This might help (assuming you mistyped and wanted Count instead of sum):

from pyspark.sql.functions import count

df = df.withColumn("Count", count("*").over(Window.partitionBy("X", "Y", "Date")))

Page link：https//www.codepudding.com/Enterprise/654807.html

Prev:Why some rows disappear when adding new rows in specific place?

Next:How do I create a new dataframe from another existing one by applying a changing formula?

Tags：

pythondataframepysparkaggregate

Related

Links：
CodePudding