I have a script that produces bins using np.digitize of values like so:
| time | butter | bin |
| :---- | :------ | :--- |
| 2022-07-04 17:33:45 00:00 | 1.041967 | 3 |
| 2022-07-04 17:34:00 00:00 | 1.041967 | 4 |
| 2022-07-04 17:34:15 00:00 | 1.041966 | 4 |
| 2022-07-04 17:34:30 00:00 | 1.041967 | 4 |
| 2022-07-04 17:34:45 00:00 | 1.041968 | 4 |
| 2022-07-04 17:35:00 00:00 | 1.041969 | 4 |
| 2022-07-04 17:35:15 00:00 | 1.041971 | 4 |
| 2022-07-04 17:35:30 00:00 | 1.041973 | 4 |
| 2022-07-04 17:35:45 00:00 | 1.041975 | 4 |
| 2022-07-04 17:36:00 00:00 | 1.041977 | 5 |
| 2022-07-04 17:36:15 00:00 | 1.041979 | 5 |
| 2022-07-04 17:36:30 00:00 | 1.041981 | 5 |
| 2022-07-04 17:36:45 00:00 | 1.041983 | 5 |
| 2022-07-04 17:37:00 00:00 | 1.041985 | 5 |
| 2022-07-04 17:37:15 00:00 | 1.041986 | 6 |
| 2022-07-04 17:37:30 00:00 | 1.041987 | 6 |
| 2022-07-04 17:37:45 00:00 | 1.041988 | 6 |
| 2022-07-04 17:38:00 00:00 | 1.041989 | 6 |
bins can increase/decrease and they can skip bins.
How can I calculate the local min/max of each bin, so that the result looks like this:
| time | butter | bin | min | max |
| :---- | :------ | :--- | :--- | :--- |
| 2022-07-04 17:33:45 00:00 | 1.041967 | 3 | 1.041967 | 1.041967 |
| 2022-07-04 17:34:00 00:00 | 1.041968 | 4 | 1.041966 | 1.041975 |
| 2022-07-04 17:34:15 00:00 | 1.041966 | 4 | 1.041966 | 1.041975 |
| 2022-07-04 17:34:30 00:00 | 1.041967 | 4 | 1.041966 | 1.041975 |
| 2022-07-04 17:34:45 00:00 | 1.041968 | 4 | 1.041966 | 1.041975 |
| 2022-07-04 17:35:00 00:00 | 1.041969 | 4 | 1.041966 | 1.041975 |
| 2022-07-04 17:35:15 00:00 | 1.041971 | 4 | 1.041966 | 1.041975 |
| 2022-07-04 17:35:30 00:00 | 1.041973 | 4 | 1.041966 | 1.041975 |
| 2022-07-04 17:35:45 00:00 | 1.041975 | 4 | 1.041966 | 1.041975 |
| 2022-07-04 17:36:00 00:00 | 1.041977 | 5 | 1.041977 | 1.041985 |
| 2022-07-04 17:36:15 00:00 | 1.041979 | 5 | 1.041977 | 1.041985 |
| 2022-07-04 17:36:30 00:00 | 1.041981 | 5 | 1.041977 | 1.041985 |
| 2022-07-04 17:36:45 00:00 | 1.041983 | 5 | 1.041977 | 1.041985 |
| 2022-07-04 17:37:00 00:00 | 1.041985 | 5 | 1.041977 | 1.041985 |
| 2022-07-04 17:37:15 00:00 | 1.041986 | 6 | 1.041986 | 1.041989 |
| 2022-07-04 17:37:30 00:00 | 1.041987 | 6 | 1.041986 | 1.041989 |
| 2022-07-04 17:37:45 00:00 | 1.041988 | 6 | 1.041986 | 1.041989 |
| 2022-07-04 17:38:00 00:00 | 1.041989 | 6 | 1.041986 | 1.041989 |
I was trying something with np.where and np.diff with np.sign but I haven't figured it out nor do I feel that's the best way to go about the question.
thank you in advance.
CodePudding user response:
Here are two ways to do what your question asks, one using pandas and the other using numpy (UPDATED to reflect OP's clarification in comments regarding binning to be on the basis of contiguously grouped bin values):
res = df.assign(bin_index = (df.bin != df.bin.shift()).cumsum())
dfAggs = res[['butter', 'bin', 'bin_index']].groupby(['bin', 'bin_index']).agg([min, max])
dfAggs.columns = dfAggs.columns.droplevel()
res = res.join(dfAggs, on=['bin', 'bin_index']).drop(columns='bin_index')
print("", "pandas:", res, sep="\n")
a = df.copy().to_numpy()
print("", "input as numpy 2d array", a, sep="\n")
bin_index = a[:,2:3] != np.concatenate((np.full((1, 1), np.NaN), a[:-1,2:3]), axis = 0)
bin_index = np.cumsum(bin_index)
bins = np.unique(bin_index)
aggs = np.empty((a.shape[0], 2))
for b in bins:
mask = bin_index==b
aggs[mask, :] = (a[mask, 1].min(), a[mask, 1].max())
res = np.concatenate((a, aggs), axis=1)
print("", "numpy:", res, sep="\n")
Output:
input as pandas dataframe
time butter bin
0 2022-07-04 17:33:45 00:00 1.041967 3
1 2022-07-04 17:34:00 00:00 1.041967 4
2 2022-07-04 17:34:15 00:00 1.041966 4
3 2022-07-04 17:34:30 00:00 1.041967 4
4 2022-07-04 17:34:45 00:00 1.041968 4
5 2022-07-04 17:35:00 00:00 1.041969 4
6 2022-07-04 17:35:15 00:00 1.041971 4
7 2022-07-04 17:35:30 00:00 1.041973 4
8 2022-07-04 17:35:45 00:00 1.041975 4
9 2022-07-04 17:36:00 00:00 1.041977 5
10 2022-07-04 17:36:15 00:00 1.041979 5
11 2022-07-04 17:36:30 00:00 1.041981 5
12 2022-07-04 17:36:45 00:00 1.041983 5
13 2022-07-04 17:37:00 00:00 1.041985 5
14 2022-07-04 17:37:15 00:00 1.041986 6
15 2022-07-04 17:37:30 00:00 1.041987 6
16 2022-07-04 17:37:45 00:00 1.041988 6
17 2022-07-04 17:38:00 00:00 1.041989 6
18 2022-07-04 17:38:15 00:00 1.041990 4
19 2022-07-04 17:38:30 00:00 1.041995 4
pandas:
time butter bin min max
0 2022-07-04 17:33:45 00:00 1.041967 3 1.041967 1.041967
1 2022-07-04 17:34:00 00:00 1.041967 4 1.041966 1.041975
2 2022-07-04 17:34:15 00:00 1.041966 4 1.041966 1.041975
3 2022-07-04 17:34:30 00:00 1.041967 4 1.041966 1.041975
4 2022-07-04 17:34:45 00:00 1.041968 4 1.041966 1.041975
5 2022-07-04 17:35:00 00:00 1.041969 4 1.041966 1.041975
6 2022-07-04 17:35:15 00:00 1.041971 4 1.041966 1.041975
7 2022-07-04 17:35:30 00:00 1.041973 4 1.041966 1.041975
8 2022-07-04 17:35:45 00:00 1.041975 4 1.041966 1.041975
9 2022-07-04 17:36:00 00:00 1.041977 5 1.041977 1.041985
10 2022-07-04 17:36:15 00:00 1.041979 5 1.041977 1.041985
11 2022-07-04 17:36:30 00:00 1.041981 5 1.041977 1.041985
12 2022-07-04 17:36:45 00:00 1.041983 5 1.041977 1.041985
13 2022-07-04 17:37:00 00:00 1.041985 5 1.041977 1.041985
14 2022-07-04 17:37:15 00:00 1.041986 6 1.041986 1.041989
15 2022-07-04 17:37:30 00:00 1.041987 6 1.041986 1.041989
16 2022-07-04 17:37:45 00:00 1.041988 6 1.041986 1.041989
17 2022-07-04 17:38:00 00:00 1.041989 6 1.041986 1.041989
18 2022-07-04 17:38:15 00:00 1.041990 4 1.041990 1.041995
19 2022-07-04 17:38:30 00:00 1.041995 4 1.041990 1.041995
input as numpy 2d array
[['2022-07-04 17:33:45 00:00' 1.041967 3]
['2022-07-04 17:34:00 00:00' 1.041967 4]
['2022-07-04 17:34:15 00:00' 1.041966 4]
['2022-07-04 17:34:30 00:00' 1.041967 4]
['2022-07-04 17:34:45 00:00' 1.041968 4]
['2022-07-04 17:35:00 00:00' 1.041969 4]
['2022-07-04 17:35:15 00:00' 1.041971 4]
['2022-07-04 17:35:30 00:00' 1.041973 4]
['2022-07-04 17:35:45 00:00' 1.041975 4]
['2022-07-04 17:36:00 00:00' 1.041977 5]
['2022-07-04 17:36:15 00:00' 1.041979 5]
['2022-07-04 17:36:30 00:00' 1.041981 5]
['2022-07-04 17:36:45 00:00' 1.041983 5]
['2022-07-04 17:37:00 00:00' 1.041985 5]
['2022-07-04 17:37:15 00:00' 1.041986 6]
['2022-07-04 17:37:30 00:00' 1.041987 6]
['2022-07-04 17:37:45 00:00' 1.041988 6]
['2022-07-04 17:38:00 00:00' 1.041989 6]
['2022-07-04 17:38:15 00:00' 1.04199 4]
['2022-07-04 17:38:30 00:00' 1.041995 4]]
numpy:
[['2022-07-04 17:33:45 00:00' 1.041967 3 1.041967 1.041967]
['2022-07-04 17:34:00 00:00' 1.041967 4 1.041966 1.041975]
['2022-07-04 17:34:15 00:00' 1.041966 4 1.041966 1.041975]
['2022-07-04 17:34:30 00:00' 1.041967 4 1.041966 1.041975]
['2022-07-04 17:34:45 00:00' 1.041968 4 1.041966 1.041975]
['2022-07-04 17:35:00 00:00' 1.041969 4 1.041966 1.041975]
['2022-07-04 17:35:15 00:00' 1.041971 4 1.041966 1.041975]
['2022-07-04 17:35:30 00:00' 1.041973 4 1.041966 1.041975]
['2022-07-04 17:35:45 00:00' 1.041975 4 1.041966 1.041975]
['2022-07-04 17:36:00 00:00' 1.041977 5 1.041977 1.041985]
['2022-07-04 17:36:15 00:00' 1.041979 5 1.041977 1.041985]
['2022-07-04 17:36:30 00:00' 1.041981 5 1.041977 1.041985]
['2022-07-04 17:36:45 00:00' 1.041983 5 1.041977 1.041985]
['2022-07-04 17:37:00 00:00' 1.041985 5 1.041977 1.041985]
['2022-07-04 17:37:15 00:00' 1.041986 6 1.041986 1.041989]
['2022-07-04 17:37:30 00:00' 1.041987 6 1.041986 1.041989]
['2022-07-04 17:37:45 00:00' 1.041988 6 1.041986 1.041989]
['2022-07-04 17:38:00 00:00' 1.041989 6 1.041986 1.041989]
['2022-07-04 17:38:15 00:00' 1.04199 4 1.04199 1.041995]
['2022-07-04 17:38:30 00:00' 1.041995 4 1.04199 1.041995]]
Pandas explanation:
- Create
bin_indexcolumn which detects changes inbinand increments an id value for each such row - Use
DataFrame.groupby()to perform the aggregation (min,max) based onbin_index - Use
DataFrame.join()(with preprocessing of the aggregation dataframeaggsto remove the level of its MultiIndex namedbutter) to addminandmaxcolumns to the original dataframe.
Numpy explanation:
- Create
bin_indexarray which detects changes inbinand increments an id value for each such row - Prepare
aggsas an array with shapea.shape[0], 2for receivingminandmaxcolumns for the correspondingbinvalue in the input arraya - Use a boolean mask for each unique
binvalue inbin_indexto perform aggregation on the corresponding rows of thebuttercolumn ofaand to place these two values in the columns ofaggsfor these same rows - Use
numpy.concatenate()to glueaandaggstogether horizontally.
