Check for a column in pandas dataframe for all elements if they are in a set of values-CodePudding

We have a pandas DataFrame df and a set of values set_vals.

For a particular column (let's say 'name'), I would now like to compute a new column which is True whenever the value of df['name'] is in set_vals and False otherwise.

One way to do this is to write:

df['name'].apply(lambda x : x in set_vals)

but when both df and set_vals become large this method is very slow. Is there a more efficient way of creating this new column?

CodePudding user response：

The real problem is the complexity of df['name'].apply(lambda x : x in set_vals) is O(M*N) where M is the length of df and N is the length of set_vals if set_vals is a list (or another type for which the search complexity is linear).

The complexity can be improved to O(M) if set_vals is hashed (turned into dict type) and the search complexity will be O(1).

CodePudding user response：

It is a complex problem with a simple solution, you can try to run multiple threads with this for loop:

let's say [0:i], [i 1:j], [j 1,k] etc.

Here is a very good explanation of how to do multiple threads

Also, if you are interested in more details about performance and efficiency check this out.