is there efficient way for pandas to get tail rows with a condition-CodePudding

I want to get tail rows with a condition

For example: I want all negative tail rows in a column 'A', while the input pandas data frame with 'A' column looks like:

    test = pd.DataFrame({'A':[-8, -9, -10, 1, 2, 3, 0, -1,-2,-3]})

after the 'method' I expect to get new data frame like:

note that, it is not certain of how many 'negative' numbers are in the tail. So, I can not run test.tail(3)

It looks like the pandas provided 'tail()' function can only run with a given number.

But my input data frame might be too large that I dont want run a simple loop to check one by one

Is there any better way to do that?

CodePudding user response：

What's the tail for? It seems like you just need the negative numbers

test.query("A < 0")

Update: Find where sign changes, split the array and choose last one

split_points = (test.A.shift(1)<0) == (test.A<0) 
np.split(test, split_points.loc[lambda x: x==False].index.tolist())[-1]

Output:

CodePudding user response：

Is this what you wanted?

test = pd.DataFrame({'A':[-8, -9, -10, 1, 2, 3, 0, -1,-2,-3]})

test = test.iloc[::-1]

test.loc[test.index.max():test[test['A'].ge(0)].index[0] 1]

Output:

edit, if you want to get it back into the original order:

test.loc[test.index.max():test[test['A'].ge(0)].index[0] 1].iloc[::-1]

    A
7   -1
8   -2
9   -3

Optional also .reset_index(drop=True) if you need a index starting at 0.

CodePudding user response：

Just share a picture of performance comparing above two given answers Thansk Patry and Macro

CodePudding user response：

I improved my above test, and did another round test, as I feel the old 'testing sample' size was too small,and afaid the %%time measurement might not accurate.

My new test uses a very big head numbers with size of 10000000 and tail with 3 negative numbers

so the new test can prove how the whole data frame size impact the over all performance.

code is like bellow:

    %%time
    arr = np.arange(1,10000000,1)
    arr = np.concatenate((arr, [-2,-3,-4]))
    test = pd.DataFrame({'A':arr})
    test = test.iloc[::-1]
    test.loc[test.index.max():test[test['A'].ge(0)].index[0] 1].iloc[::-1]

    %%time
    arr = np.arange(1,10000000,1)
    arr = np.concatenate((arr, [-2,-3,-4]))
    test = pd.DataFrame({'A':arr})
    split_points = (test.A.shift(1)<0) == (test.A<0) 
    np.split(test, split_points.loc[lambda x: x==False].index.tolist())[-1]

due to system impacts, I tested 10 times, the above 2 methods are very much performs the similar. In about 50% cases Patryk's code even performs faster

Check out this image bellow