I have below data in csv.
dataCenter,customer,companyID,UID,uba
dc1,customer1,companyID1,uid1,"uba1,uba2,uba3,uba4"
dc2,customer2,companyID1,uid2,"ubaA"
dc3,customer3,companyID3,uid3,"uba1,uba4"
dc4,customer4,companyID4,uid4,"uba1,uba2,uba5,uba6,uba10"
Now I want to convert the data to below format, assign the multi-values in 'uba' column into other new columns.
dataCenter,customer,companyID,UID,action1,action2,action3,action4,action5,...,
dc1,customer1,companyID1,uid1,uba1,uba2,uba3,uba4
dc2,customer2,companyID1,uid2,uba
dc3,customer3,companyID3,uid3,uba1,uba4
dc4,customer4,companyID4,uid4,uba1,uba2,uba5,uba6,uba10,uba11,uba12,uba13
I have tried below,but doesn't work.
a = a.explode('uba')
a = pd.concat([a,pd.DataFrame(a.pop('uba').tolist(),index=a.index)],axis=1)
I don't want to use str.split('',expand=True) as the performance is really bad when data is large.
is there any other good option with good performance for me?
CodePudding user response:
If done want to use str.split('',expand=True) and no missing values is possible use list comprehension:
a = pd.concat([a,pd.DataFrame([x.split(',') for x in a.pop('uba')],index=a.index).add_prefix('action')],axis=1)
print (a)
dataCenter customer companyID UID action0 action1 action2 action3 \
0 dc1 customer1 companyID1 uid1 uba1 uba2 uba3 uba4
1 dc2 customer2 companyID1 uid2 ubaA None None None
2 dc3 customer3 companyID3 uid3 uba1 uba4 None None
3 dc4 customer4 companyID4 uid4 uba1 uba2 uba5 uba6
action4
0 None
1 None
2 None
3 uba10
EDIT: For processing first N values use:
N = 2
a = pd.concat([a,pd.DataFrame([x.split(',', N)[:N] for x in a.pop('uba')],index=a.index).add_prefix('action')],axis=1)
print (a)
dataCenter customer companyID UID action0 action1
0 dc1 customer1 companyID1 uid1 uba1 uba2
1 dc2 customer2 companyID1 uid2 ubaA None
2 dc3 customer3 companyID3 uid3 uba1 uba4
3 dc4 customer4 companyID4 uid4 uba1 uba2
