creating series_route from multiple chunks of string in list-CodePudding

I am working on a route building code and have half a million record which taking around 3-4 hrs to get executed.

For creating dataframe:

# initialize list of lists
data = [[['1027', '(K)', 'TRIM']], [[SJCL, (K), EJ00, (K), ZQFC, (K), 'DYWH']]
 
# Create the pandas DataFrame
df = pd.DataFrame(data, columns = ['route'])

Will look like something this:

route
[1027, (K), TRIM]
[SJCL, (K), EJ00, (K), ZQFC, (K), DYWH]

O/P Code I have used:

def func_list(hd1):
    required_list=[]

    for j,i in enumerate(hd1):
        #print(i,j)
        if j==0:
            req=i
        else:
            if (i[0].isupper() or i[0].isdigit()):
                required_list.append(req)
                req=i
            else:
                req=req i
    required_list.append(req)
    return required_list
df['route2']=df.route1.apply(lambda x : func_list (x))

#op
route2
[1027(K), TRIM]
[SJCL(K), EJ00(K), ZQFC(K), DYWH]

For half million rows, it taking 3-4 hrs, I dont know how to reduce it pls help.

CodePudding user response：

Use explode to flatten your dataframe:

sr1 = df['route'].explode()
sr2 = pd.Series(np.where(sr1.str[0] == '(', sr1.shift()   sr1, sr1), index=sr1.index)

df['route'] = sr2[sr1.eq(sr2).shift(-1, fill_value=True)].groupby(level=0).apply(list)
print(df)

# Output:
0                      [1027(K), TRIM]
1    [SJCL(K), EJ00(K), ZQFC(K), DYWH]
dtype: object

For 500K records:

7.46 s ± 97.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)