I'm really amateur-level with both python and pandas, but I'm trying to solve an issue for work that's stumping me.
I have two dataframes, let's call them dfA and dfB:
dfA:
project_id Category Initiative
10
20
30
40
dfB:
project_id field_id value
10 100 lorem
10 200 lorem1
10 300 lorem2
20 200 ipsum
20 300 ipsum1
20 500 ipsum2
Let's say I know "Category" from dfA correlates to field_id "100" from dfB, and "Initiative" correlates to field_id "200".
I need to look through dfB and for a given project_id/field_id combination, take the corresponding value in the "value" column and place it in the correct cell in dfA.
The result would look like this:
dfA:
project_id Category Initiative
10 lorem lorem1
20 ipsum
30
40
Bonus difficulty: not every project in dfA exists in dfB, and not every field_id is used in every project_id.
I hope I've explained this well enough; I feel like there must be a relatively simple way to handle this that I'm missing.
CodePudding user response:
You could do something like this although it's not very elegant, there must be a better way. I had to use try/except because of the cases where the project Id is not available in the dfB. I put NaN values for the missing ones but you can easily put empty strings.
def get_value(row):
try:
res = dfB[(dfB['field_id'] == 100) & (dfB['project_id'] == row['project_id'])]['value'].iloc[0]
except:
res = np.nan
row['Categorie'] = res
try:
res = dfB[(dfB['field_id'] == 200) & (dfB['project_id'] == row['project_id'])]['value'].iloc[0]
except:
res = np.nan
row['Initiative'] = res
return row
dfA = dfA.apply(get_value, axis=1)
EDIT: as mentioned in comment, this is not very flexible as some values are hardcoded but you can easily change that with something like the below. This way, if the field_id change or you need to add/remove a column, just update the dictionary.
columns_field = {"Category": 100, "Initiative": 200}
def get_value(row):
for key, value in columns_fields.items():
try:
res = dfB[(dfB['field_id'] == value) & (dfB['project_id'] == row['project_id'])]['value'].iloc[0]
except:
res = np.nan
row[key] = res
return row
dfA = dfA.apply(get_value, axis=1)
