how I can do the following in pandas. let's I have a column which has an index. I want to find the index in col1 and do the following:
- If 3 rows above the index and 3 rows below the index, the sign of the values of
col1was constant and it changes in one of these rows and remains constant after the point pick the index of the row that the sign changes. If there is not such a point pick the value of index column.
It is hard to explain but I think it is more clearer in an example:
consider the following data:
ind_column col1
4 0.5
4 0.65
4 0.6
4 0.2
4 0.1
4 0.8
4 -0.3
4 -0.2
4 -0.3
here, the index column is 4, so we look at 3 rows before 4 and 3 rows after 4. Basically rows: 1,2,3,4,5,6,7. We see that in rows 6th the sign changed, since all signs are the same before this point (positive) and after this point (negative) the index 6th is selected.
Now lets consider the following data
ind_column col1
5 0.5
5 0.65
5 -0.6
5 0.2
5 -0.1
5 0.8
5 0.3
5 -0.2
5 -0.3
Now we should look at the rows 2,3,4,5,6,7,8. Since the sign is changing in these rows more than once, the index 5 is celected.
in the following data also index 5 is selected since the sign never changes.
ind_column col1
5 0.5
5 0.65
5 0.6
5 0.2
5 0.1
5 0.8
5 0.3
5 0.2
5 0.3
CodePudding user response:
One approach could be as follows:
import pandas as pd
import numpy as np
def get_index(df):
idx = df.loc[0,'ind_column']
tmp = np.sign(df.iloc[max(idx-3,0):idx 4]['col1']).diff().dropna().ne(0)
if tmp.sum() == 1:
return tmp[tmp].index.item()
return idx
indices = [get_index(df) for df in [df1,df2,df3]] # see `dfs` listed below
print(indices)
[6, 5, 5]
Explanation
- Inside the function, we first retrieve the "start" index from column
ind_column(e.g.4or5in the exampledfs). - Next, we use
df.ilocto select the index range -3 rows through to (and including) 3 rows fromidxvalue. We usemax(idx-3,0)to ensure that we won't be trying to select a negative index value (i.e. whenidx < 3) as the starting point, causing an emptydfselection. - From the
dfslice, we select columncol1and applynp.signto get a series with-1,0or1. - Now, we chain
Series.diffto get the difference between consecutive rows, and useSeries.dropnato get rid of the first row, which will beNaNby definition. - Finally, we chain
Series.newith0. This will get us the series with booleans:Truewhere there is a switch in sign,Falsewhere there is no switch. - The resulting
pd.Seriesis stored astmpand now, we simply have to check whether thesumof the series equals1. If it does, we want to get the index of theTruevalue responsible, returningtmp[tmp].index.item(). In all other cases, we either have no switches at all, or multiple ones. In both cases, we simply want to returnidx.
Data used for df1, df2, df3 above
data1 = {'ind_column': {0: 4, 1: 4, 2: 4, 3: 4, 4: 4, 5: 4, 6: 4, 7: 4, 8: 4},
'col1': {0: 0.5, 1: 0.65, 2: 0.6, 3: 0.2, 4: 0.1, 5: 0.8, 6: -0.3,
7: -0.2, 8: -0.3}}
df1 = pd.DataFrame(data1)
data2 = {'ind_column': {0: 5, 1: 5, 2: 5, 3: 5, 4: 5, 5: 5, 6: 5, 7: 5, 8: 5},
'col1': {0: 0.5, 1: 0.65, 2: -0.6, 3: 0.2, 4: -0.1, 5: 0.8, 6: 0.3,
7: -0.2, 8: -0.3}}
df2 = pd.DataFrame(data2)
data3 = {'ind_column': {0: 5, 1: 5, 2: 5, 3: 5, 4: 5, 5: 5, 6: 5, 7: 5, 8: 5},
'col1': {0: 0.5, 1: 0.65, 2: 0.6, 3: 0.2, 4: 0.1, 5: 0.8, 6: 0.3,
7: 0.2, 8: 0.3}}
df3 = pd.DataFrame(data3)
