I am working on a route building code and have half a million record which taking around 3-4 hrs to get executed.
For creating dataframe:
# initialize list of lists
data = [[['1027', '(K)', 'TRIM']], [[SJCL, (K), EJ00, (K), ZQFC, (K), 'DYWH']]
# Create the pandas DataFrame
df = pd.DataFrame(data, columns = ['route'])
Will look like something this:
route
[1027, (K), TRIM]
[SJCL, (K), EJ00, (K), ZQFC, (K), DYWH]
O/P Code I have used:
def func_list(hd1):
required_list=[]
for j,i in enumerate(hd1):
#print(i,j)
if j==0:
req=i
else:
if (i[0].isupper() or i[0].isdigit()):
required_list.append(req)
req=i
else:
req=req i
required_list.append(req)
return required_list
df['route2']=df.route1.apply(lambda x : func_list (x))
#op
route2
[1027(K), TRIM]
[SJCL(K), EJ00(K), ZQFC(K), DYWH]
For half million rows, it taking 3-4 hrs, I dont know how to reduce it pls help.
CodePudding user response:
Use explode to flatten your dataframe:
sr1 = df['route'].explode()
sr2 = pd.Series(np.where(sr1.str[0] == '(', sr1.shift() sr1, sr1), index=sr1.index)
df['route'] = sr2[sr1.eq(sr2).shift(-1, fill_value=True)].groupby(level=0).apply(list)
print(df)
# Output:
0 [1027(K), TRIM]
1 [SJCL(K), EJ00(K), ZQFC(K), DYWH]
dtype: object
For 500K records:
7.46 s ± 97.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
