iterating large pandas DataFrame too slow-CodePudding

I have a large dataframe where I would like to make a new column based on existing columns.

test = pd.DataFrame({'Test1':["100","4242","3454","2","54"]})
test['Test2'] = ""
for i in range(0,len(test)):
    if len(test.iloc[i,0]) == 4:
        test.iloc[i,-1] = test.iloc[i,0][0:1]
    elif len(test.iloc[i,0]) == 3:
        test.iloc[i,-1] = test.iloc[i,0][0]
    elif len(test.iloc[i,0]) < 3:
        test.iloc[i,-1] = 0
    else:
        test.iloc[i,-1] = np.nan

This is working for a small dataframe, but when I have a large data set, (10 million rows), it is taking way too long. How can I make this process faster?

CodePudding user response：

Use str.len method to find the lengths of strings in the 'Test1' column and then using this information, use np.select to assign relevant parts of the strings in 'Test1' or default values to 'Test2'.

import numpy as np
lengths = test['Test1'].str.len()
test['Test2'] = np.select([lengths == 4, lengths == 3, lengths < 3], [test['Test1'].str[0:1], test['Test1'].str[0], 0], np.nan)

Output:

  Test1 Test2
0   100     1
1  4242     4
2  3454     3
3     2     0
4    54     0

Note that [0:1] only returns the first element (same as [0]) so maybe you meant [0:2] (or something else) otherwise you can save one condition there.

CodePudding user response：

So, basically you want to extract the first character of the string if it is at least 3 characters long. (NB. for a string, [0] and [0:1] yields exactly the same thing)

Just use a regex with a lookbehind for that.

test['Test2'] = test['Test1'].str.extract('^(.)(?=..)').fillna(0)

output:

  Test1 Test2
0   100     1
1  4242     4
2  3454     3
3     2     0
4    54     0

How the regex works:

^       # match beginning of string
(.)     # capture one character
(?=..)  # only if it is followed by at least two characters