I have adf that looks like this:
| date | stock1 | stock2 | stock3 | stock4 | stock5 | stock6 | stock7 | stock8 | stock9 | stock10 |
|---|---|---|---|---|---|---|---|---|---|---|
| 10/20 | 0.1 | 0.2 | 0.3 | 0.4 | 0.5 | 0.6 | 0.7 | 0.8 | 0.9 | 0.9 |
| 11/20 | 0.1 | 0.9 | 0.3 | 0.4 | 0.3 | 0.5 | 0.3 | 0.2 | 0.4 | 0.1 |
| 12/20 | 0.1 | 0.6 | 0.9 | 0.5 | 0.6 | 0.7 | 0.8 | 0.7 | 0.9 | 0.1 |
| 10/20 | 0.1 | 0.2 | 0.3 | 0.4 | 0.5 | 0.6 | 0.7 | 0.8 | 0.9 | 0.9 |
| 11/20 | 0.8 | 0.9 | 0.3 | 0.4 | 0.3 | 0.5 | 0.3 | 0.2 | 0.9 | 0.1 |
| 12/20 | 0.3 | 0.6 | 0.9 | 0.5 | 0.6 | 0.7 | 0.8 | 0.7 | 0.9 | 0.1 |
| 10/20 | 0.1 | 0.2 | 0.3 | 0.4 | 0.5 | 0.7 | 0.7 | 0.8 | 0.9 | 0.9 |
| 11/20 | 0.8 | 0.9 | 0.3 | 0.4 | 0.3 | 0.7 | 0.3 | 0.2 | 0.4 | 0.1 |
| 12/20 | 0.3 | 0.6 | 0.9 | 0.5 | 0.6 | 0.7 | 0.8 | 0.7 | 0.9 | 0.1 |
I want to delete all columns for which the same value repeats, consecutively, more than 3 times. In this example, the columns "stock1", "stock6" and "stock9" should be deleted. In the other columns, we have repeating values more than 3 times, but not one after the other. I think I can adapt the code from that question Removing values that repeat more than 5 times in Pandas DataFrame, but I could not manage to do that yet.
CodePudding user response:
You can set "date" aside as index, then check if the rows are different from the next one as use it to groupby cumcount.
Then compute the max count per column, if greater than N-1, drop the column:
df2 = df.set_index('date')
N = 3
df2.loc[:, df2.apply(lambda c: c.groupby(c.ne(c.shift()).cumsum()).cumcount()).max().lt(N-1)]
output:
stock2 stock3 stock4 stock5 stock7 stock8 stock10
date
10/20 0.2 0.3 0.4 0.5 0.7 0.8 0.9
11/20 0.9 0.3 0.4 0.3 0.3 0.2 0.1
12/20 0.6 0.9 0.5 0.6 0.8 0.7 0.1
10/20 0.2 0.3 0.4 0.5 0.7 0.8 0.9
11/20 0.9 0.3 0.4 0.3 0.3 0.2 0.1
12/20 0.6 0.9 0.5 0.6 0.8 0.7 0.1
10/20 0.2 0.3 0.4 0.5 0.7 0.8 0.9
11/20 0.9 0.3 0.4 0.3 0.3 0.2 0.1
12/20 0.6 0.9 0.5 0.6 0.8 0.7 0.1
intermediate count of successive values:
>>> df2.apply(lambda c: c.groupby(c.ne(c.shift()).cumsum()).cumcount())
stock1 stock2 stock3 stock4 stock5 stock6 stock7 stock8 stock9 stock10
date
10/20 0 0 0 0 0 0 0 0 0 0
11/20 1 0 1 1 0 0 0 0 0 0
12/20 2 0 0 0 0 0 0 0 0 1
10/20 3 0 0 0 0 0 0 0 1 0
11/20 0 0 1 1 0 0 0 0 2 0
12/20 0 0 0 0 0 0 0 0 3 1
10/20 0 0 0 0 0 1 0 0 4 0
11/20 0 0 1 1 0 2 0 0 0 0
12/20 0 0 0 0 0 3 0 0 0 1
CodePudding user response:
You could want avoid apply here:
N = 3
df.loc[:,
df.ne(df.shift()).cumsum()
.stack()
.groupby(level=1)
.value_counts()
.max(level=0).le(N)]
