I have the table below and would like to apply onde command to compare and eliminate duplicate values in row n and n 1 in multiple dataframes (df1, df2).
Comand sugestion: .diff().ne(0)
How to apply this command only to the elements of columns A ,C and D, using the commands def ,lambda or apply?
df1:
| A | B |
|---|---|
| 22 | 33 |
| 22 | 4 |
| 3 | 55 |
| 1 | 55 |
df2:
| C | D |
|---|---|
| 5 | 2.3 |
| 45 | 33 |
| 7 | 33 |
| 7 | 11 |
The expected output is:
df1:
| A | B |
|---|---|
| 22 | 33 |
| NaN | 4 |
| 3 | 55 |
| 1 | 55 |
df2:
| C | D |
|---|---|
| 5 | 2.3 |
| 45 | 33 |
| 7 | NaN |
| NaN | 11 |
The other desired option would be to delete the duplicated lines, keeping the first number.
df1:
| A | B |
|---|---|
| 22 | 33 |
| row deleted | row deleted |
| 3 | 55 |
| row deleted | row deleted |
df2:
| C | D |
|---|---|
| 5 | 2.3 |
| 45 | 33 |
| row deleted | row deleted |
| row deleted | row deleted |
CodePudding user response:
Based on this answer, you can create a mask for a single column in your dataframe (here for example for column A) with
mask1 = df['A'].shift() == df['A']
Since this shows True if there was a duplicate, you need to slice the DataFrame with the negation of the mask
df = df[~mask1]
To do this for multiple columns, make a mask for each column and use NumPy's logical_or to combine the masks. Then slice df with the final mask.
CodePudding user response:
With your suggested command: .diff().ne(0) (or .diff.eq(0))
Option 1: set NaN to duplicate values
# For 1 column
df1.loc[df1['A'].diff().eq(0), 'A'] = np.NaN
print(df1)
A B
0 22.0 33
1 NaN 4
2 3.0 55
3 1.0 55
# For multiple columns
df2 = df2.apply(lambda x: x[x.diff().ne(0)])
print(df2)
C D
0 5.0 2.3
1 45.0 33.0
2 7.0 NaN
3 NaN 11.0
Option 2: delete rows
>>> df1[df1.diff().ne(0).all(axis=1)]
A B
0 22 33
2 3 55
>>> df2[df2.diff().ne(0).all(axis=1)]
C D
0 5 2.3
1 45 33.0
