We have a pandas DataFrame df and a set of values set_vals.
For a particular column (let's say 'name'), I would now like to compute a new column which is True whenever the value of df['name'] is in set_vals and False otherwise.
One way to do this is to write:
df['name'].apply(lambda x : x in set_vals)
but when both df and set_vals become large this method is very slow. Is there a more efficient way of creating this new column?
CodePudding user response:
The real problem is the complexity of df['name'].apply(lambda x : x in set_vals) is O(M*N) where M is the length of df and N is the length of set_vals if set_vals is a list (or another type for which the search complexity is linear).
The complexity can be improved to O(M) if set_vals is hashed (turned into dict type) and the search complexity will be O(1).
CodePudding user response:
It is a complex problem with a simple solution, you can try to run multiple threads with this for loop:
let's say [0:i], [i 1:j], [j 1,k] etc.
Here is a very good explanation of how to do multiple threads
Also, if you are interested in more details about performance and efficiency check this out.
