Given a dataframe df, how to calculate the rolling count of unique values through rows' direction subject to a boundary condition: window size = n?
Input data:
import pandas as pd
import numpy as np
data = {'col_0':[7, 8, 9, 10, 11, 12],
'col_1':[4, 5, 6, 7, 8, 9],
'col_2':[2, 5, 8, 11, 14, 15],
'col_3':[2, 6, 10, 14, 18, 21],
'col_4':[7, 5, 7, 5, 7, 5],
'col_5':[2, 6, 10, 14, 18, 21]}
df = pd.DataFrame(data)
print(df)
###
col_0 col_1 col_2 col_3 col_4 col_5
0 7 4 2 2 7 2
1 8 5 5 6 5 6
2 9 6 8 10 7 10
3 10 7 11 14 5 14
4 11 8 14 18 7 18
5 12 9 15 21 5 21
Expected output (with window size = 2):
print(df)
###
col_0 col_1 col_2 col_3 col_4 col_5 rolling_nunique
0 7 4 2 2 7 2 3
1 8 5 5 6 5 6 6
2 9 6 8 10 7 10 6
3 10 7 11 14 5 14 8
4 11 8 14 18 7 18 7
5 12 9 15 21 5 21 10
For the example above with window size = 2.
At
window 0's array we have row[0].[[7 4 2 2 7 2]]rolling_nunique[0]is 3 with the elements being [2, 4, 7].At
window 1's array we have row[0] & row[1].[[7 4 2 2 7 2] [8 5 5 6 5 6]]rolling_nunique[1]is 6 with the elements being [2, 4, 5, 6, 7, 8].At
window 2's array we have row[1] & row[2].[[ 8 5 5 6 5 6] [ 9 6 8 10 7 10]]rolling_nunique[2]is 6 with the elements being [5, 6, 7, 8, 9, 10].etc.
CodePudding user response:
Using sliding_window_view, you can customize how the values are aggregated in the sliding window. To get values for all rows before the sliding window is full (i.e., emulate min_periods=1 in pd.rolling), we need to add some empty rows at the top. This can be done using vstack and full. At the end, we need to account for these added nan values by filtering them away.
from numpy.lib.stride_tricks import sliding_window_view
w = 2
values = np.vstack([np.full([w-1, df.shape[1]], np.nan), df.values])
m = sliding_window_view(values, w, axis=0).reshape(len(df), -1)
unique_count = [len(np.unique(r[~np.isnan(r)])) for r in m]
df['rolling_nunique'] = unique_count
Result:
col_0 col_1 col_2 col_3 col_4 col_5 rolling_nunique
0 7 4 2 2 7 2 3
1 8 5 5 6 5 6 6
2 9 6 8 10 7 10 6
3 10 7 11 14 5 14 8
4 11 8 14 18 7 18 7
5 12 9 15 21 5 21 10
CodePudding user response:
I found it could resolve by using sliding_window_view() from numpy,
Here's the approach:
rolling = 2
ar = df.values # turn into np.ndarray
length = ar.shape[1]
head_arrs = np.zeros((rolling-1, rolling*length))
cuboid = np.lib.stride_tricks.sliding_window_view(ar, (rolling,length)).astype(float)
plane = cuboid.reshape(-1,rolling*length)
for i in range(rolling-1,0,-1):
head_arr_l = plane[0,:i*length]
head_arr_l = np.pad(head_arr_l.astype(float), (0,length*(rolling-i)), 'constant', constant_values=np.nan)
head_arr_l = np.roll(head_arr_l, length*(rolling-i))
head_arrs[i-1,:] = head_arr_l
plane = np.insert(plane, 0, head_arrs, axis=0)
df['rolling_nunique'] = pd.DataFrame(plane).nunique(axis=1)
df
###
col_0 col_1 col_2 col_3 col_4 col_5 rolling_nunique
0 7 4 2 2 7 2 3
1 8 5 5 6 5 6 6
2 9 6 8 10 7 10 6
3 10 7 11 14 5 14 8
4 11 8 14 18 7 18 7
5 12 9 15 21 5 21 10
[reference] numpy.lib.stride_tricks.sliding_window_view
