I have the following dataframe where col2 is a dictionary with a list of tuples as values. The keys are consistantly 'added' and 'deleted' in the whole dataframe.
Input df
| col1 | col2 |
|---|---|
| value1 | {'added': [(59, 'dep1_v2'), (60, 'dep2_v2')], 'deleted': [(59, 'dep1_v1'), (60, 'dep2_v1')]} |
| value 2 | {'added': [(61, 'dep3_v2')], 'deleted': [(61, 'dep3_v1')]} |
Here's a copy-pasteable example dataframe:
jsons = ["{'added': [(59, 'dep1_v2'), (60, 'dep2_v2')], 'deleted': [(59, 'dep1_v1'), (60, 'dep2_v1')]}",
"{'added': [(61, 'dep3_v2')], 'deleted': [(61, 'dep3_v1')]}"]
df = pd.DataFrame({"col1": ["value1", "value 2"], "col2": jsons})
edit
col2 directly comes from the diff_parsed field of pydriller output
I want to "explode" col2 so that I obtain the following result:
Desired output
| col1 | number | added | deleted |
|---|---|---|---|
| value1 | 59 | dep1_v2 | dep1_v1 |
| value1 | 60 | dep2_v2 | dep2_v1 |
| value2 | 61 | dep3_v2 | dep3_v1 |
So far, I tried the following:
df = df.join(pd.json_normalize(df.col2))
df.drop(columns=['col2'], inplace=True)
The above code is simplified. I first manipulate the column to convert to proper json. It was in an attempt to first explode on 'added' and 'deleted' and then try to play around with the format to obtain what I want...but the list of tuples is not preserved and I obtain the following:
| col1 | added | deleted |
|---|---|---|
| value1 | 59, dep1_v2, 60, dep2_v2 | 59, dep1_v1, 60, dep2_v1 |
| value2 | 61, dep3_v1 | 61, dep3_v2 |
Thanks
CodePudding user response:
Well this certainly isn't elegant, but here's a potential solution that is at least easier to understand and reason about:
import ast # part of standard library
def explode_records(df):
new_records = []
def f(value, x):
temp = {}
for number, added in x["added"]:
temp[number] = {"value": value, "number": number, "added": added}
for number, deleted in x["deleted"]:
temp[number] = {**temp[number], "deleted": deleted}
new_records.extend(list(temp.values()))
df.apply(lambda r: f(r.col1, ast.literal_eval(r.col2)), axis=1)
return pd.DataFrame(new_records)
Usage:
In [4]: explode_records(df)
Out[4]:
value number added deleted
0 value1 59 dep1_v2 dep1_v1
1 value1 60 dep2_v2 dep2_v1
2 value 2 61 dep3_v2 dep3_v1
Note that I got value 2 from your original data. I'm assuming it's just a typo, and not that you also need value x -> valuex functionality.
I wasn't able to get the other solution working, so I wasn't able to compare its performance vs mine.
CodePudding user response:
Here's a solution. It's a little long, but it works:
tmp = pd.concat([df, pd.json_normalize(df['col2'])], axis=1).drop('col2', axis=1).explode(['added', 'deleted'])
new_df = pd.concat([tmp.drop(['added', 'deleted'], axis=1).reset_index(drop=True), pd.DataFrame(tmp['added'].tolist()).merge(pd.DataFrame(tmp['deleted'].tolist()), on=0).set_axis(['number', 'added', 'deleted'], axis=1)], axis=1)
Output:
>>> new_df
col1 number added deleted
0 value1 59 dep1_v2 dep1_v1
1 value1 60 dep2_v2 dep2_v1
2 value2 61 dep3_v2 dep3_v1
