I have a pandas dataframe that shows on a row by row basis, the packages a user has active, and whether they use that package on a trial basis using true/false logic.:
| User | Packages | TrialStatus |
|------|-----------|-----------------------------|
| 345 | A,B,C,D,F | False,False,False,True,True |
| 238 | B,C,F | True,False,False |
| 467 | A,B | False,True |
As you can see, the values for columns packages and TrialStatus are simply comma seperated strings. The position of the true/false values in TrialStatus correspoding to the package positions in the packages column.
What I would like is an easy way to search the packages column for a specific package (e.g. B), look in the TrialStatus column for the trial status of package B and see if it's True based upon the position, and then return the original dataframe with a new column with the trial status of package B only. This should return:
| User | Packages | TrialStatus | PackageB_TrialStatus |
|------|-----------|-----------------------------|----------------------|
| 345 | A,B,C,D,F | False,False,False,True,True | False |
| 238 | B,C,F | True,False,False | True |
| 467 | A,B | False,True | True |
My first thought is to combine the usual list comprehension way to create pandas conditional columns:
df['cond_column'] = ['true' if x == 'true_value' else 'false' for x in df['other_column']]
and combine this with some regex perhaps for the positional aspect but not sure if this would work. Can anybody please help? let me know if more explanation is needed
CodePudding user response:
You could create a helper dataframe with exploded values, then use it to filter your data and merge with the original dataframe:
df2 = (df.assign(Packages=df['Packages'].str.split(','),
TrialStatus=df['TrialStatus'].str.split(',')
)
# multi-column explode requires pandas ≥ 1.3.0
# if older use .apply(pd.Series.explode)
.explode(column=['Packages', 'TrialStatus'])
)
# .eq('B') # is also valid
df.merge(df2.loc[df2['Packages'].isin(['B']), ['User', 'TrialStatus']],
on='User', suffixes=('', '_B')
)
Having the helper dataframe will enable you to efficiently search for other possibilities, instead of repeating the split process every time you need to perform this action.
output:
User Packages TrialStatus TrialStatus_B
0 345 A,B,C,D,F False,False,False,True,True False
1 238 B,C,F True,False,False True
2 467 A,B False,True True
CodePudding user response:
By using apply and converting string into list.No need of regex.
data = {"User": [345, 238, 467], "Packages": ["A,B,C,D,F", "B,C,F", "A,B"],
"TrialStatus":["False,False,False,True,True", "True,False,False", "False,True"]}
df = pd.DataFrame(data)
Df
| User | Packages | TrialStatus |
|---|---|---|
| 345 | A,B,C,D,F | False,False,False,True,True |
| 238 | B,C,F | True,False,False |
| 467 | A,B | False,True |
df["PackageB_TrialStatus"] = df.apply(lambda x: "True" if x["Packages"].split(",").index("B") == x["TrialStatus"].split(",").index("True") else "False", axis=1)
Output DF
| User | Packag | TrialStatus | PackageB_TrialStatus |
|---|---|---|---|
| 345 | A,B,C,D,F | False,False,False,True,True | False |
| 238 | B,C,F | True,False,False | True |
| 467 | A,B | False,True | True |
