Home > Software design >  Compare list column to a second list and return item in third list
Compare list column to a second list and return item in third list

Time:02-08

Have a dataframe with a list column like and two reference lists:

d = {'col1': [1,2,3,4,5], 'col2':[['a','b'],['a','b'],['a','b','c'],['a','b','d'], ['a','b','e']]}
df = pd.DataFrame(data=d)

ref_a = ['c','d','e','f']
ref_b = ['option1', 'option2', 'option3', 'option4']

    col1    col2
0   1   [a, b]
1   2   [a, b]
2   3   [a, b, c]
3   4   [a, b, d]
4   5   [a, b, e]

And I want to create a third column that checks if a value in col2 is in ref_a and return the matched item in ref_b

col1    col2    check
0   1   [a, b]  
1   2   [a, b]  
2   3   [a, b, c]   option1
3   4   [a, b, d]   option2
4   5   [a, b, e]   option3

I've tried something like this:

for index, row in df.iterrows():
    for (a, b) in zip(ref_a, ref_b):
        if a in row['col2']:
            df.loc[index, 'check'] = b
            print('true', b)
        else:
            df.loc[index, 'check'] = ''

And that prints out what I'm anticipating, but doesn't change the actual value, I don't have a large amount of data to iterate over but also not sure if there is a more elegant approach. There will only be one match in the list column, never multiple options.

CodePudding user response:

You could use a combination of explode, map, and groupby first for a nice (and probably pretty fast) pandas-esque one-liner:

df['check'] = df['col2'].explode().map(dict(zip(ref_a, ref_b))).groupby(level=0).first().fillna('')

Output:

>>> df
   col1       col2    check
0     1     [a, b]     
1     2     [a, b]     
2     3  [a, b, c]  option1
3     4  [a, b, d]  option2
4     5  [a, b, e]  option3

CodePudding user response:

Here's another approach if you think it's more elegant:

def get_option(xs):
    for x in xs:
        if x in ref_a:
            return ref_b[ref_a.index(x)]
    return 'no option' # or None or np.nan?


df['check'] = df['col2'].apply(get_option)

print(df)

CodePudding user response:

Using zip function, you can iterate on each one of the refs in parallel. Then, you can use map to make a list of booleans where each True value will be the sublist position where ref_a was found. After that, you need the positions itself where True value was found, for that reason is used index method.

As you should know, index method return a ValueError when an item was not found.

def mapping():
  try:
    col2 = df['col2'].to_list() #[ [a, b], [a, b], [a, b, c], ...]
    for a, b in zip(ref_a, ref_b):
      # Here you are going to get a list of boolean where True means ref_a was found in a sub-list
      booleans = list(map(lambda subl: a in subl, col2))
      index = booleans.index(True) # The possition where ref_a was found 
      df.loc[index, 'check'] = b
  except ValueError as e:
    print(e)

mapping()
print(df)

Output:

col1 col2 check
0 1 [a, b]
1 2 [a, b]
2 3 [a, b, c] option1
3 4 [a, b, d] option2
4 5 [a, b, e] option3
  •  Tags:  
  • Related