Delete all columns for which value repents consecutively more than 3 times-CodePudding

I have adf that looks like this:

date	stock1	stock2	stock3	stock4	stock5	stock6	stock7	stock8	stock9	stock10
10/20	0.1	0.2	0.3	0.4	0.5	0.6	0.7	0.8	0.9	0.9
11/20	0.1	0.9	0.3	0.4	0.3	0.5	0.3	0.2	0.4	0.1
12/20	0.1	0.6	0.9	0.5	0.6	0.7	0.8	0.7	0.9	0.1
10/20	0.1	0.2	0.3	0.4	0.5	0.6	0.7	0.8	0.9	0.9
11/20	0.8	0.9	0.3	0.4	0.3	0.5	0.3	0.2	0.9	0.1
12/20	0.3	0.6	0.9	0.5	0.6	0.7	0.8	0.7	0.9	0.1
10/20	0.1	0.2	0.3	0.4	0.5	0.7	0.7	0.8	0.9	0.9
11/20	0.8	0.9	0.3	0.4	0.3	0.7	0.3	0.2	0.4	0.1
12/20	0.3	0.6	0.9	0.5	0.6	0.7	0.8	0.7	0.9	0.1

I want to delete all columns for which the same value repeats, consecutively, more than 3 times. In this example, the columns "stock1", "stock6" and "stock9" should be deleted. In the other columns, we have repeating values more than 3 times, but not one after the other. I think I can adapt the code from that question Removing values that repeat more than 5 times in Pandas DataFrame, but I could not manage to do that yet.

CodePudding user response：

You can set "date" aside as index, then check if the rows are different from the next one as use it to groupby cumcount.

Then compute the max count per column, if greater than N-1, drop the column:

df2 = df.set_index('date')
N = 3
df2.loc[:, df2.apply(lambda c: c.groupby(c.ne(c.shift()).cumsum()).cumcount()).max().lt(N-1)]

output:

        stock2   stock3   stock4   stock5   stock7   stock8   stock10
date                                                                 
10/20       0.2      0.3      0.4      0.5      0.7      0.8      0.9
11/20       0.9      0.3      0.4      0.3      0.3      0.2      0.1
12/20       0.6      0.9      0.5      0.6      0.8      0.7      0.1
10/20       0.2      0.3      0.4      0.5      0.7      0.8      0.9
11/20       0.9      0.3      0.4      0.3      0.3      0.2      0.1
12/20       0.6      0.9      0.5      0.6      0.8      0.7      0.1
10/20       0.2      0.3      0.4      0.5      0.7      0.8      0.9
11/20       0.9      0.3      0.4      0.3      0.3      0.2      0.1
12/20       0.6      0.9      0.5      0.6      0.8      0.7      0.1

intermediate count of successive values:

>>> df2.apply(lambda c: c.groupby(c.ne(c.shift()).cumsum()).cumcount())

        stock1   stock2   stock3   stock4   stock5   stock6   stock7   stock8   stock9   stock10
date                                                                                            
10/20         0        0        0        0        0        0        0        0        0        0
11/20         1        0        1        1        0        0        0        0        0        0
12/20         2        0        0        0        0        0        0        0        0        1
10/20         3        0        0        0        0        0        0        0        1        0
11/20         0        0        1        1        0        0        0        0        2        0
12/20         0        0        0        0        0        0        0        0        3        1
10/20         0        0        0        0        0        1        0        0        4        0
11/20         0        0        1        1        0        2        0        0        0        0
12/20         0        0        0        0        0        3        0        0        0        1

CodePudding user response：

You could want avoid apply here:

N = 3
df.loc[:, 
   df.ne(df.shift()).cumsum()
     .stack()
     .groupby(level=1)
     .value_counts()
     .max(level=0).le(N)]