Given the following dataframe
Number
148621
645121
456
645121-A
1234123
Is there a way to remove duplicates ignoring certain characters? '-A' in this case. The expected result would be:
Number
148621
456
645121-A
1234123
CodePudding user response:
You can create a temporary column and use that as the filter:
(
df.assign(tmp=df.Number.map(lambda s: s.rstrip("-A")))
.drop_duplicates("tmp", keep="last")
.drop("tmp", axis=1, inplace=False)
)
CodePudding user response:
Stripping it to make them duplicates to check uniqueness and then filtering:
>>> is_uniq_custom = ~df.Number.str.rstrip("-A").duplicated(keep="last")
>>> is_uniq_custom
0 True
1 False
2 True
3 True
4 True
Name: Number, dtype: bool
>>> df.Number[is_uniq_custom]
0 148621
2 456
3 645121-A
4 1234123
Name: Number, dtype: object
CodePudding user response:
For a general use case, you could use regex
df["clean_number"] = df['Number'].str.extract("([0-9] )")
This will extract the first instance of a number seen in the string.
For example:
s = pd.Series(["12345a", "12345-A", "abc12345", "1234abc3435"])
>>> s.str.extract("([0-9] )")
0
0 12345
1 12345
2 12345
3 1234
Now if you want to drop duplicates then run the following:
df.drop_duplicates(subset='clean_number', keep="last")
Set keep="first" if you want to keep the first row. Although it looks like you only gave a pandas series, so you could also run
df["clean_number"].unique()
This will give you a numpy array of the unique numbers in your list.
