I have a DataFrame like this. Each unique deal id has a number of bank subsidiaries, and each bank subsidiary has a row containing their parents where the number of bank parents is less than the number of bank subsidiaries.
[In]:
d = {'Bank Parent': [["Barclays", "UBS", "BofA"], ["Barclays", "UBS", "BofA"], ["Barclays", "UBS", "BofA"],
["China Merchants Bank Co Ltd", "UniCredit"], ["China Merchants Bank Co Ltd", "UniCredit"],
["China Merchants Bank Co Ltd", "UniCredit"]],
'Bank Subsidiary': ["Barclays plc", "UBS AG", "BofA Securities Inc", "CMB International Capital Ltd", "UniCredit Bank AG",
"Goldman Sachs (Asia) LLC"],
"Deal id": [1, 1, 1, 2, 2, 2]}
df = pd.DataFrame(data = d)
df
[Out]:
Bank Parent Bank Subsidiary Deal id
0 [Barclays, UBS, BofA] Barclays plc 1
1 [Barclays, UBS, BofA] UBS AG 1
2 [Barclays, UBS, BofA] BofA Securities Inc 1
3 [China Merchants Bank Co Ltd, UniCredit] CMB International Capital Ltd 2
4 [China Merchants Bank Co Ltd, UniCredit] UniCredit Bank AG 2
5 [China Merchants Bank Co Ltd, UniCredit] Goldman Sachs (Asia) LLC 2
I would like to explode the "Bank Parent" column and have the resulting DataFrame where the bank subsidiary that has no bank parent is "NA".
[Out]:
Bank Parent Bank Subsidiary Deal id
0 Barclays Barclays plc 1
1 UBS UBS AG 1
2 BofA BofA Securities Inc 1
3 China Merchants Bank Co Ltd CMB International Capital Ltd 2
4 UniCredit UniCredit Bank AG 2
5 NA Goldman Sachs (Asia) LLC 2
So far, I have tried pandas.DataFrame.explode() method, but I've not got the desired result.
CodePudding user response:
It looks like you want to pick the Parent "diagonally" from the same group. If that's the case, we first create a column idx that tells up the position of the Parent in the group:
df['idx'] = df.groupby(df['Bank Parent'].apply('_'.join), group_keys = False).cumcount()
then we go row by row and pick the relevant element from the list in 'Bank Parent', and stick into a gew column 'Parent':
df['Parent'] = df.apply(lambda r: r['Bank Parent'][r['idx']], axis=1)
df now looks like this:
Bank Parent Bank Subsidiary idx Parent
-- -------------------------------------------- ----------------------------- ----- ---------------------------
0 ['Barclays', 'UBS', 'BofA'] Barclays plc 0 Barclays
1 ['Barclays', 'UBS', 'BofA'] UBS AG 1 UBS
2 ['Barclays', 'UBS', 'BofA'] BofA Securities Inc 2 BofA
3 ['China Merchants Bank Co Ltd', 'UniCredit'] CMB International Capital Ltd 0 China Merchants Bank Co Ltd
4 ['China Merchants Bank Co Ltd', 'UniCredit'] UniCredit Bank AG 1 UniCredit
you can further df.drop(...) the columns you no longer need
Edit
If, as in the revised question, we have a list of Parent Banks that is too short, we can replace the second line of the solution with the following, where we essentially pad the list with NaNs:
df['Parent'] = df.apply(lambda r: (r['Bank Parent'] [None]*r['idx'])[r['idx']], axis=1)
output with the revised df:
Bank Parent Bank Subsidiary Deal id idx Parent
-- -------------------------------------------- ----------------------------- --------- ----- ---------------------------
0 ['Barclays', 'UBS', 'BofA'] Barclays plc 1 0 Barclays
1 ['Barclays', 'UBS', 'BofA'] UBS AG 1 1 UBS
2 ['Barclays', 'UBS', 'BofA'] BofA Securities Inc 1 2 BofA
3 ['China Merchants Bank Co Ltd', 'UniCredit'] CMB International Capital Ltd 2 0 China Merchants Bank Co Ltd
4 ['China Merchants Bank Co Ltd', 'UniCredit'] UniCredit Bank AG 2 1 UniCredit
5 ['China Merchants Bank Co Ltd', 'UniCredit'] Goldman Sachs (Asia) LLC 2 2
