Have a dataframe with a list column like and two reference lists:
d = {'col1': [1,2,3,4,5], 'col2':[['a','b'],['a','b'],['a','b','c'],['a','b','d'], ['a','b','e']]}
df = pd.DataFrame(data=d)
ref_a = ['c','d','e','f']
ref_b = ['option1', 'option2', 'option3', 'option4']
col1 col2
0 1 [a, b]
1 2 [a, b]
2 3 [a, b, c]
3 4 [a, b, d]
4 5 [a, b, e]
And I want to create a third column that checks if a value in col2 is in ref_a and return the matched item in ref_b
col1 col2 check
0 1 [a, b]
1 2 [a, b]
2 3 [a, b, c] option1
3 4 [a, b, d] option2
4 5 [a, b, e] option3
I've tried something like this:
for index, row in df.iterrows():
for (a, b) in zip(ref_a, ref_b):
if a in row['col2']:
df.loc[index, 'check'] = b
print('true', b)
else:
df.loc[index, 'check'] = ''
And that prints out what I'm anticipating, but doesn't change the actual value, I don't have a large amount of data to iterate over but also not sure if there is a more elegant approach. There will only be one match in the list column, never multiple options.
CodePudding user response:
You could use a combination of explode, map, and groupby first for a nice (and probably pretty fast) pandas-esque one-liner:
df['check'] = df['col2'].explode().map(dict(zip(ref_a, ref_b))).groupby(level=0).first().fillna('')
Output:
>>> df
col1 col2 check
0 1 [a, b]
1 2 [a, b]
2 3 [a, b, c] option1
3 4 [a, b, d] option2
4 5 [a, b, e] option3
CodePudding user response:
Here's another approach if you think it's more elegant:
def get_option(xs):
for x in xs:
if x in ref_a:
return ref_b[ref_a.index(x)]
return 'no option' # or None or np.nan?
df['check'] = df['col2'].apply(get_option)
print(df)
CodePudding user response:
Using zip function, you can iterate on each one of the refs in parallel. Then, you can use map to make a list of booleans where each True value will be the sublist position where ref_a was found. After that, you need the positions itself where True value was found, for that reason is used index method.
As you should know, index method return a ValueError when an item was not found.
def mapping():
try:
col2 = df['col2'].to_list() #[ [a, b], [a, b], [a, b, c], ...]
for a, b in zip(ref_a, ref_b):
# Here you are going to get a list of boolean where True means ref_a was found in a sub-list
booleans = list(map(lambda subl: a in subl, col2))
index = booleans.index(True) # The possition where ref_a was found
df.loc[index, 'check'] = b
except ValueError as e:
print(e)
mapping()
print(df)
Output:
| col1 | col2 | check | |
|---|---|---|---|
| 0 | 1 | [a, b] | |
| 1 | 2 | [a, b] | |
| 2 | 3 | [a, b, c] | option1 |
| 3 | 4 | [a, b, d] | option2 |
| 4 | 5 | [a, b, e] | option3 |
