Home > Software design >  Pandas: best way to remove rows where columns match any set of values in a list of tuples?
Pandas: best way to remove rows where columns match any set of values in a list of tuples?

Time:02-02

I have two columns, A and B. I also have a list of tuples. I want to remove any rows where it matches any of the tuples in the list. For example:

Input:

A B
A 1
A 4
B 2
A 3
[(A,1),(C,4),(A,3)]

Output:

A B
A 4
B 2

CodePudding user response:

You can use zip list comprehension:

tuples = [('A', 1), ('C', 4), ('A', 3)]
new_df = df[[x not in tuples for x in zip(df['A'], df['B'])]]

Output:

>>> new_df
   A  B
1  A  4
2  B  2

CodePudding user response:

I think the best you can do here is to throw all the "blacklisted" tuples into a set (i.e. hash them) and perform a membership test on each row in your list. The membership test will take constant time & the overall time complexity of this algorithm will be O(n m), with n being the number of items in your list and m being the number of items in your blacklist.

def solve(arr, blacklist):
    S = set(blacklist)
    result = [None] * len(arr)
    idx = 0
    for i in range(len(arr)):
        if arr[i] not in S:
           result[idx] = arr[i]
           idx  = 1
    return result[:idx]

CodePudding user response:

A "pure" pandas solution (whatever that means):

df[~df.set_index(['A','B']).index.isin(tuples)]

output

    A   B
1   A   4
2   B   2

CodePudding user response:

Use zip pandas series to do without for loop (should be faster) Note: Based upon How to filter a pandas DataFrame according to a list of tuples

tuples = [('A',1),('C',4),('A',3)]
new_df = df[~pd.Series(list(zip(df['A'], df['B']))).isin(tuples)] # no for loop
>>> new_df
    A   B
1   A   4
2   B   2
  •  Tags:  
  • Related