separating a string that contains decimals and words and creating columns from the unique values in-CodePudding

I have this string:

''
'Storage:9.22Checkoff:6.90InElevation:0.00OutCharge:0.00Freightother:0.00'
''

so the first thing I need to do is separate values in the following form:

''
Storage:9.22 Checkoff:6.90 In_Elevation:0.00 Out_Charge:0.00 Freight_other:0.00
''

I will be looping through multiple rows with similar values, so I will have to make sure as soon as I see the name (and is unique) I create a new column and assign the value I found for that specific row, so at the end it should look something like this:

''
----------------------------------------------------------------
| Storage| Checkoff | In_Elevation | Out_Charge | Freight_other| 
---------------------------------------------------------------
|  9.22  |  6.90    |    0.00      |   0.00     |   0.00       |
----------------------------------------------------------------
''

I've been using a couple of examples at least to start separating the string but it does not give me what I really need:

This is one:

'''
word = ""
value = ""

for i in  range(0, len(df['Original'])):
    for j in df['Original'][i]:
        if j.isalpha():
            word = word   j
        elif j.isdecimal():
             value = value   j
        elif j.isascii():
            #print(j)
            None
'''

but this is the result:

'''
StorageCheckoffInElevationOutChargeFreightotherStorageCheckoffMiscellaneousChargesPremiumFreightStorageCheckoffOptionPremiumsforMinimumPriceContractsFITRUCKDiscountsFORAILCarryCostStorageCheckoffFreightInElevationOutChargeFreightotherStorageCheckoffFreightWeighingChgsFORAILCheckoffInElevationOutChargeFreightotherStorageCheckoffFreightMiscellaneousChargesStorageCheckoffInElevationOutChargeFreightotherInElevationOutChargeDiscountsFreightother
922690000000000061014372018602158602167642563191927552232584968331307341840509672628262873068122661185213241367192248181900000000074061234124424074596189800000000000000016635000
'''

and for the columns added to the dataframe I'm using this code snippet:

'''
cols = [i for i in new[0].unique()]
df1 = pd.DataFrame( index=range(len(cols)), columns=cols)
df1
'''

which might work but i still need the separation of the string correctly, none of the methods I have use really seems to give me the desirable output. if I use regex it separates words from values, but then there is no way to map which value correspond to what word.

as always any hint, suggestion will be greatly appreciated.

CodePudding user response：

Use Series.str.extractall with capturing groups to get the word and the numeric value (allowing for parenthesis to indicate negative values), which are separated by a colon. Then, pivot this DataFrame into the appropriate format. Since the extract pairs labels with values, they can even occur out of order in separate strings, like in the sample I created below.

Sample data

import pandas as pd
s = pd.Series(['Storage:9.22Checkoff:6.90InElevation:0.00OutCharge:0.00Freightother:0.00',
               'Checkoff:6.97Storage:19.22InElevation:0.00OutCharge:10.00Freightother:56.55',
               'Checkoff:(2.00)Storage:19.22InElevation:0.00OutCharge:10.00Freightother:56.55'])

Code

df = s.str.extractall(r'(.*?):([\(\)0-9.] )').reset_index()
df = df.pivot(index='level_0', columns=0, values=1).rename_axis(index=None, columns=None)

print(df)
#  Checkoff Freightother InElevation OutCharge Storage
#0     6.90         0.00        0.00      0.00    9.22
#1     6.97        56.55        0.00     10.00   19.22
#2   (2.00)        56.55        0.00     10.00   19.22