Filtering on a comma separated string by relative position in pandas columns-CodePudding

I have a pandas dataframe that shows on a row by row basis, the packages a user has active, and whether they use that package on a trial basis using true/false logic.:

| User | Packages  | TrialStatus                 |
|------|-----------|-----------------------------|
| 345  | A,B,C,D,F | False,False,False,True,True |
| 238  | B,C,F     | True,False,False            |
| 467  | A,B       | False,True                  |

As you can see, the values for columns packages and TrialStatus are simply comma seperated strings. The position of the true/false values in TrialStatus correspoding to the package positions in the packages column.

What I would like is an easy way to search the packages column for a specific package (e.g. B), look in the TrialStatus column for the trial status of package B and see if it's True based upon the position, and then return the original dataframe with a new column with the trial status of package B only. This should return:

| User | Packages  | TrialStatus                 | PackageB_TrialStatus |
|------|-----------|-----------------------------|----------------------|
| 345  | A,B,C,D,F | False,False,False,True,True | False                |
| 238  | B,C,F     | True,False,False            | True                 |
| 467  | A,B       | False,True                  | True                 |

My first thought is to combine the usual list comprehension way to create pandas conditional columns:

df['cond_column'] = ['true' if x == 'true_value' else 'false' for x in df['other_column']]

and combine this with some regex perhaps for the positional aspect but not sure if this would work. Can anybody please help? let me know if more explanation is needed

CodePudding user response：

You could create a helper dataframe with exploded values, then use it to filter your data and merge with the original dataframe:

df2 = (df.assign(Packages=df['Packages'].str.split(','),
                 TrialStatus=df['TrialStatus'].str.split(',')
                 )
         # multi-column explode requires pandas ≥ 1.3.0
         # if older use .apply(pd.Series.explode)
         .explode(column=['Packages', 'TrialStatus'])
       )

                             #  .eq('B')   # is also valid
df.merge(df2.loc[df2['Packages'].isin(['B']), ['User', 'TrialStatus']],
         on='User', suffixes=('', '_B')
        )

Having the helper dataframe will enable you to efficiently search for other possibilities, instead of repeating the split process every time you need to perform this action.

output:

   User   Packages                  TrialStatus TrialStatus_B
0   345  A,B,C,D,F  False,False,False,True,True         False
1   238      B,C,F             True,False,False          True
2   467        A,B                   False,True          True

CodePudding user response：

By using apply and converting string into list.No need of regex.

data = {"User": [345, 238, 467], "Packages": ["A,B,C,D,F", "B,C,F", "A,B"],
    "TrialStatus":["False,False,False,True,True", "True,False,False", "False,True"]}

df = pd.DataFrame(data)

User	Packages	TrialStatus
345	A,B,C,D,F	False,False,False,True,True
238	B,C,F	True,False,False
467	A,B	False,True

 df["PackageB_TrialStatus"] = df.apply(lambda x: "True" if x["Packages"].split(",").index("B") == x["TrialStatus"].split(",").index("True") else "False", axis=1)

Output DF

User	Packag	TrialStatus	PackageB_TrialStatus
345	A,B,C,D,F	False,False,False,True,True	False
238	B,C,F	True,False,False	True
467	A,B	False,True	True