| A | B | C |
|---|---|---|
| catastrop | animal | True |
| ani | count | True |
| rows | cata | False |
| Second | rownumbers | False |
| counter | counters | True |
| column | inner | False |
| time | ||
| strong | ||
| kind | ||
| membrane |
A and B contain both string and substring, I want to check if column A has a value 'counter', and that value is either having a substring in column B 'count' or is a substring of column B 'counters'. If any of this satisfied, return True. (Something like combination of .isin and .str.contains). C is the output, which state whether the value of A is substring of B or A contains B. Like catastrop contain cata, so True, and ani is substring of animal so True
The initial code came into my mind is
list1=a1['A'].tolist()
output1=[]
for i in list1:
output1.append(any(a1['B'].str.contains(i,regex=False)))
and do an or operation with column B
But if I doing it reversely, like
list2=a1['B'].tolist()
output2=[]
for i in list2:
output2.append(any(a1['A'].str.contains(i,regex=False)))
list 2 would contain verify result about column B, rather than column A.
How could I write this code?
CodePudding user response:
One way to achieve this is as follows:
- For
a_contains_b, useSeries.str.containswith a string consisting of a list of values from columnB,joinedwith|delimiter to create alternatives (so:'animal|count|cata|rownumbers|counters|inner|time|strong|kind|membrane'). - For
b_contains_a, we want to useSeries.str.extractallinstead, since we need to feed the result back toSeries.isinto know which rows from columnAhave been matched. I.e. the result ofextractall([0].tolist()) will be['ani', 'counter'], which we use as input fordf.A.isin. - Finally, we use both boolean series as input for the new column. Use
|, since we needTrueif either series containsTrue. Wrap the result in brackets and applySeries.wherecombined withSeries.notnato overwrite allFalsevalues in rows that haveNaNvalues in columnA.
import pandas as pd
import numpy as np
data = {'A': {0: 'catastrop', 1: 'ani', 2: 'rows', 3: 'Second', 4: 'counter',
5: 'column', 6: np.nan, 7: np.nan, 8: np.nan, 9: np.nan},
'B': {0: 'animal', 1: 'count', 2: 'cata', 3: 'rownumbers',
4: 'counters', 5: 'inner', 6: 'time', 7: 'strong', 8: 'kind',
9: 'membrane'},
'C': {0: True, 1: True, 2: False, 3: False, 4: True, 5: False,
6: np.nan, 7: np.nan, 8: np.nan, 9: np.nan}}
df = pd.DataFrame(data)
a_contains_b = df.A.str.contains('|'.join(df.B.dropna().tolist()))
b_contains_a = df.A.isin(df.B.str.extractall(
'(' '|'.join(df.A.dropna().tolist()) ')')[0].tolist())
df['D'] = (b_contains_a | a_contains_b).where(df.A.notna())
print(df)
A B C D
0 catastrop animal True True
1 ani count True True
2 rows cata False False
3 Second rownumbers False False
4 counter counters True True
5 column inner False False
6 NaN time NaN NaN
7 NaN strong NaN NaN
8 NaN kind NaN NaN
9 NaN membrane NaN NaN
N.B. If you want the matching to be case-insensitive, consider using Series.str.lower. In the case of str.contains you can also make use of the parameter case, and set it to False.
