Home > Back-end >  Write a function that filters a dataset for rows that contains all of the words in a list of words
Write a function that filters a dataset for rows that contains all of the words in a list of words

Time:01-09

I want to get a sub-dataframe that contains all elements in a list. Let's take the DataFrame as an example.

my_dict = {
    'Job': ['Painting', 'Capentry', 'Teacher', 'Farming'],
    'Job_Detail': ['all sort of painting', 
                  'kitchen utensils, all types of roofing etc.',\
                  'skill and practical oriented teaching',\
                  'all agricultural practices']
          }
df = pd.DataFrame(my_dict)

Output looks thus:

    Job         Job_Detail
0   Painting    all sort of painting
1   Capentry    kitchen utensils, all types of roofing etc.
2   Teacher     skill and practical oriented teaching
3   Farming     all agricultural practices

my_lst = ['of', 'all']

I want to filter df with mylst to get a sub_DataFrame that looks like this:

    Job         Job_Detail
0   Painting    all sort of painting
1   Capentry    kitchen utensils, all types of roofing etc.

I've tried df[df.Job_Detail.isin(['of', 'all']) but it returns an empty DataFrame.

CodePudding user response:

I'm no pandas expert, but the best function to use here seems to be str.contains

From the docs:

Series.str.contains(pat, case=True, flags=0, na=None, regex=True)


Test if pattern or regex is contained within a string of a Series or Index.

Return boolean Series or Index based on whether a given pattern or regex is contained within a string of a Series or Index.

Edit: This masks using or, not and

import pandas as pd
my_dict = {
    'Job': ['Painting', 'Capentry', 'Teacher', 'Farming'],
    'Job_Detail': ['all sort of painting', 
                  'kitchen utensils, all types of roofing etc.',
                  'skill and practical oriented teaching',
                  'all agricultural practices']
          }

my_lst = ['of', 'all']

df = pd.DataFrame(my_dict)

print(df)


mask = df.Job_Detail.str.contains('|'.join(my_lst), regex=True)

print(df[mask])

Here's a solution that masks uing and:

import pandas as pd
my_dict = {
    'Job': ['Painting', 'Capentry', 'Teacher', 'Farming'],
    'Job_Detail': ['all sort of painting', 
                  'kitchen utensils, all types of roofing etc.',
                  'skill and practical oriented teaching',
                  'all agricultural practices']
          }

my_lst = ['of', 'all']

df = pd.DataFrame(my_dict)

print(df)

print("------")

masks = [df.Job_Detail.str.contains(word) for word in my_lst]
mask = pd.concat(masks, axis=1).all(axis=1)


print(df[mask])

CodePudding user response:

@Lone Your code answered a different question, but it helped me arrive at the answer. Thank you, appreciated.

Here's the closest to what I needed:

df[(df.Job_Detail.str.contains('of')) & (df.Job_Detail.str.contains('all'))]

  •  Tags:  
  • Related