I have a sample df
id name
1 John Walter walter
2 Adam Smith Smith
3 Steve Rogers rogers
How can I find whether it is duplicated in every row and pop it out?
id name is_duplicated poped_out_string corrected_name
1 John Walter walter 1 walter John walter
2 Adam Smith Smith 1 walter Adam Smith
3 Steve Rogers rogers 1 walter Steve Rogers
CodePudding user response:
df['dup'] = df.apply(lambda x: x.drop_duplicates().to_string(index=False), axis=1)
Assuming that you have set the string in the data frame as df, this code will find whether the value is duplicated in every row and delete it.
CodePudding user response:
One way using more_itertools.unique_everseen
from more_itertools import unique_everseen
def unique(arr, key):
return " ".join(unique_everseen(arr, key=key))
df["name"].str.split().apply(unique, key=str.lower)
Output:
0 John Walter
1 Adam Smith
2 Steve Rogers
Name: name, dtype: object
If you don't want more_itertools, you can still use unique_everseen from itertools recipes:
from itertools import filterfalse
def unique_everseen(iterable, key=None):
"List unique elements, preserving order. Remember all elements ever seen."
# unique_everseen('AAAABBBCCDAABBB') --> A B C D
# unique_everseen('ABBCcAD', str.lower) --> A B C D
seen = set()
seen_add = seen.add
if key is None:
for element in filterfalse(seen.__contains__, iterable):
seen_add(element)
yield element
else:
for element in iterable:
k = key(element)
if k not in seen:
seen_add(k)
yield element
CodePudding user response:
Another way is to use set to de-duplicate and a collections.Counter to get the duplicated values -
df['corrected_name'] = df['name'].str.split().apply(lambda x: ' '.join(set(map(str.lower, x)))).str.title()
df['popped_out_string'] = df['name'].str.split().apply(lambda x: ''.join(k for k, v in Counter(map(str.lower, x)).items() if v > 1))
Output
id name corrected_name popped_out_string
0 1 John Walter walter John Walter walter
1 2 Adam Smith Smith Smith Adam smith
2 3 Steve Rogers rogers Rogers Steve rogers
