I have a dataframe with 2 columns
id Date
1 2022-01-01
1 2022-01-01
1 2022-01-01
1 2022-01-02
1 2022-01-02
1 2022-01-02
2 2022-01-01
2 2022-01-01
3 2022-01-01
3 2022-01-01
3 2022-01-01
3 2022-01-01
I need to find the duplicated rows (or duplicated IDs) when there are same values in all columns and those rows should appear more than twice.
The result should be like this
id Date
1 2022-01-01
1 2022-01-02
3 2022-01-01
CodePudding user response:
Here's a way to find rows duplicated more than twice:
df2 = df.assign(x=0).groupby(list(df.columns)).count().query('x>2').drop(columns='x').reset_index()
Ouput:
id Date
0 1 2022-01-01
1 1 2022-01-02
2 3 2022-01-01
Explanation:
- add a temporary column
x - use
groupby().count()to obtain unique combinations of theid, Datekey with the number of occurrences per group in columnx - use
query()to filter for rows that occur more than twice using the conditionx>2 - drop the temporary column
xand usereset_index()to restoreidandDateas columns.
CodePudding user response:
You can also use .value_counts instead of groupby, while the rest is similar to the answer of @constantstranger:
(df
# count duplicate rows
.value_counts()
# convert to DataFrame to query the number of occurrences later on
.to_frame()
# MultiIndex to column and renaming
.reset_index().rename(columns={0:'count'})
# Filter out only 3 duplicates
.query('count > 2')
# Remove the temporary columns
.drop(columns='count')
)
giving
id Date
0 3 2022-01-01
1 1 2022-01-01
2 1 2022-01-02
