Explode pandas column of dictionary with list of tuples as value-CodePudding

I have the following dataframe where col2 is a dictionary with a list of tuples as values. The keys are consistantly 'added' and 'deleted' in the whole dataframe.

Input df

col1	col2
value1	{'added': [(59, 'dep1_v2'), (60, 'dep2_v2')], 'deleted': [(59, 'dep1_v1'), (60, 'dep2_v1')]}
value 2	{'added': [(61, 'dep3_v2')], 'deleted': [(61, 'dep3_v1')]}

Here's a copy-pasteable example dataframe:

jsons = ["{'added': [(59, 'dep1_v2'), (60, 'dep2_v2')], 'deleted': [(59, 'dep1_v1'), (60, 'dep2_v1')]}",
         "{'added': [(61, 'dep3_v2')], 'deleted': [(61, 'dep3_v1')]}"]

df = pd.DataFrame({"col1": ["value1", "value 2"], "col2": jsons})

edit

col2 directly comes from the diff_parsed field of pydriller output

I want to "explode" col2 so that I obtain the following result:

Desired output

col1	number	added	deleted
value1	59	dep1_v2	dep1_v1
value1	60	dep2_v2	dep2_v1
value2	61	dep3_v2	dep3_v1

So far, I tried the following:

df = df.join(pd.json_normalize(df.col2))
df.drop(columns=['col2'], inplace=True)

The above code is simplified. I first manipulate the column to convert to proper json. It was in an attempt to first explode on 'added' and 'deleted' and then try to play around with the format to obtain what I want...but the list of tuples is not preserved and I obtain the following:

col1	added	deleted
value1	59, dep1_v2, 60, dep2_v2	59, dep1_v1, 60, dep2_v1
value2	61, dep3_v1	61, dep3_v2

Thanks

CodePudding user response：

Well this certainly isn't elegant, but here's a potential solution that is at least easier to understand and reason about:

import ast  # part of standard library


def explode_records(df):
    new_records = []
    def f(value, x):
        temp = {}
        for number, added in x["added"]:
            temp[number] = {"value": value, "number": number, "added": added}
        for number, deleted in x["deleted"]:
            temp[number] = {**temp[number], "deleted": deleted}
        new_records.extend(list(temp.values()))

    df.apply(lambda r: f(r.col1, ast.literal_eval(r.col2)), axis=1)
    return pd.DataFrame(new_records)

Usage:

In [4]: explode_records(df)
Out[4]:
     value  number    added  deleted
0   value1      59  dep1_v2  dep1_v1
1   value1      60  dep2_v2  dep2_v1
2  value 2      61  dep3_v2  dep3_v1

Note that I got value 2 from your original data. I'm assuming it's just a typo, and not that you also need value x -> valuex functionality.

I wasn't able to get the other solution working, so I wasn't able to compare its performance vs mine.

CodePudding user response：

Here's a solution. It's a little long, but it works:

tmp = pd.concat([df, pd.json_normalize(df['col2'])], axis=1).drop('col2', axis=1).explode(['added', 'deleted'])
new_df = pd.concat([tmp.drop(['added', 'deleted'], axis=1).reset_index(drop=True), pd.DataFrame(tmp['added'].tolist()).merge(pd.DataFrame(tmp['deleted'].tolist()), on=0).set_axis(['number', 'added', 'deleted'], axis=1)], axis=1)

Output:

>>> new_df
     col1  number    added  deleted
0  value1      59  dep1_v2  dep1_v1
1  value1      60  dep2_v2  dep2_v1
2  value2      61  dep3_v2  dep3_v1