I have a dataframe of duplicated Email Addresses.
| ID | EmailAddress | Name | Country | Distance | IDLen | NonNAN |
|---|---|---|---|---|---|---|
| 39203920 | [email protected] | John | UK | 12 | 8 | 6 |
| 32323 | [email protected] | NaN | UK | 12 | 5 | 5 |
I have created two additional columns that calculate the length of the ID and how many non NaN fields are for each row. I would like to create 2 new dataframes:
df1. Where the duplicate row has either the higher
NonNANvalue; or if they're the same, pick the lowestIDLen.df2. The remaining rows
I was thinking of using the df.duplicated() function but it only looks at first or last and I need something more sophisticated.
Thanks in advance.
CodePudding user response:
You can absolutely use duplicated, you just have to sort your data according to your conditions.
Given df:
ID EmailAddress Name Country Distance IDLen NonNAN
0 39203920 [email protected] John UK 12 8 6
1 32323 [email protected] NaN UK 12 5 5
2 2423 [email protected] Bob AUS 32 2 4
3 24233 [email protected] Robert AUS 32 2 5
Doing:
df = df.sort_values(['NonNAN', 'ID'], ascending=[False, True])
mask = df.duplicated('EmailAddress')
df1 = df[~mask]
df2 = df[mask]
print(df1)
print(df2)
# Output df1:
ID EmailAddress Name Country Distance IDLen NonNAN
0 39203920 [email protected] John UK 12 8 6
3 24233 [email protected] Robert AUS 32 2 5
# Output df2:
ID EmailAddress Name Country Distance IDLen NonNAN
1 32323 [email protected] NaN UK 12 5 5
2 2423 [email protected] Bob AUS 32 2 4
CodePudding user response:
Will this work for you? The key is to sort the data, then apply df.duplicated(), which has very high efficiency rather than looping through each record like .apply(lambda) functions
import pandas as pd
import numpy as np
df = pd.DataFrame({
'ID': [39203920, 32323, 22222, 392999],
'EmailAddress': ['[email protected]', '[email protected]', '[email protected]', '[email protected]'],
'Name': ['John', np.nan, 'Jane', 'John'],
'Country': ['UK', 'UK', 'UK', 'UK'],
'Distance': [12, 12, 12, 12],
'IDLen': [8, 5, 5, 6],
'NonNAN': [6, 5, 6, 6] })
df = df.sort_values(['EmailAddress', 'NonNAN', 'IDLen'], ascending=[True, False, True])
ID EmailAddress Name Country Distance IDLen NonNAN
2 22222 [email protected] Jane UK 12 5 6
1 32323 [email protected] NaN UK 12 5 5
3 392999 [email protected] John UK 12 6 6
0 39203920 [email protected] John UK 12 8 6
Based on your rules, I have sorted the data so that the desired record is located first. When df.duplicated() is applied on EmailAddress, the first record will be kept
df1 = df[~df.duplicated('EmailAddress')]
ID EmailAddress Name Country Distance IDLen NonNAN
2 22222 [email protected] Jane UK 12 5 6
3 392999 [email protected] John UK 12 6 6
df2 = df[df.duplicated('EmailAddress')]
ID EmailAddress Name Country Distance IDLen NonNAN
1 32323 [email protected] NaN UK 12 5 5
0 39203920 [email protected] John UK 12 8 6
If your ID column is numerical (ie, not alphanumeric), you can sort based on ascending ID, and there is no need for the column IDLen (because you would like the shortest one if 'NonNAN' is the same)
CodePudding user response:
You can create a boolean mask of rows you want to select for df1 - then automatically the inverse of this mask will select the rows for df2.
For each group where EmailAddress is the same:
maskselects all the rows whereNonNANhas a maximum value.if
maskselects multiple values -it must also select the rows where
IDLenis the minimum.if
maskstill selects multiple values -just take the first one.
def f(df):
mask = df['NonNAN'] == df['NonNAN'].max()
if mask.sum() > 1:
mask = mask & (df['IDLen'] == df.loc[mask, 'IDLen'].min())
if mask.sum() > 1:
mask.iloc[mask.argmax() 1 : ] = False
return mask
mask = df.groupby('EmailAddress', group_keys=False).apply(f)
mask = mask.reindex(df.index)
df1 = df[mask]
df2 = df[~mask]
