Drop duplicates that share part of a string within a dataframe-CodePudding

Given the following dataframe

Is there a way to remove duplicates ignoring certain characters? '-A' in this case. The expected result would be:

Number
148621
456
645121-A
1234123

CodePudding user response：

You can create a temporary column and use that as the filter:

(
    df.assign(tmp=df.Number.map(lambda s: s.rstrip("-A")))
      .drop_duplicates("tmp", keep="last")
      .drop("tmp", axis=1, inplace=False)
)

CodePudding user response：

Stripping it to make them duplicates to check uniqueness and then filtering:

>>> is_uniq_custom = ~df.Number.str.rstrip("-A").duplicated(keep="last")
>>> is_uniq_custom 
0     True
1    False
2     True
3     True
4     True
Name: Number, dtype: bool

>>> df.Number[is_uniq_custom]
0      148621
2         456
3    645121-A
4     1234123
Name: Number, dtype: object

CodePudding user response：

For a general use case, you could use regex

df["clean_number"] = df['Number'].str.extract("([0-9] )")

This will extract the first instance of a number seen in the string.

For example:

s = pd.Series(["12345a", "12345-A", "abc12345", "1234abc3435"])
>>> s.str.extract("([0-9] )")
      0
0  12345
1  12345
2  12345
3   1234

Now if you want to drop duplicates then run the following:

df.drop_duplicates(subset='clean_number', keep="last")

Set keep="first" if you want to keep the first row. Although it looks like you only gave a pandas series, so you could also run

df["clean_number"].unique()

This will give you a numpy array of the unique numbers in your list.