[Edited to provide better working example data]
Imagine I have a panel in pandas with multiple time_vars for each id_var. For example, imagine the data contain a state variable and a year variable:
var1 var2
stcode year
WY 1996-01-01 14 0.000059
1995-01-01 8 0.000059
1994-01-01 21 0.000182
1993-01-01 17 0.000063
1992-01-01 9 0.000000
AK 1964-01-01 11 0.000213
1965-01-01 6 0.000100
1966-01-01 10 0.000189
1967-01-01 9 0.000267
1968-01-01 9 0.000084
Is there a way to use pd.groupby and do operations on the year level of the multiindex as the object or am I constrained to the columns defined within the groupby variable?
As an example, suppose I wanted to check if the dates were sorted, but I only cared about whether the year level was sorted, and not the stcode level. I currently would implement this like so:
# unsetting year from the index, so that I can groupby the remaining level and
# check if year is monotonically increasing
df.reset_index('year').groupby('stcode').year.is_monotonic_increasing
This returns (as you can see):
stcode
AK True
WY False
Name: year, dtype: bool
However, unsetting the index in this way seems awfully inefficient. Is there a general way to do somthing like:
df.groupby(level=0)[ -- the other level -- ].apply( -- operation -- )
Similarly, and relatedly, is there a way to sort by the second level and ignore the first? In this example, I may need the time variable sorted within each stcode block to do time dependent operations like shifting, but I do not care if the outer level is sorted alphabetically or numerically. In other words, this would end up like:
var1 var2
stcode year
WY 1992-01-01 9 0.000000
1993-01-01 17 0.000063
1994-01-01 21 0.000182
1995-01-01 8 0.000059
1996-01-01 14 0.000059
AK 1964-01-01 11 0.000213
1965-01-01 6 0.000100
1966-01-01 10 0.000189
1967-01-01 9 0.000267
1968-01-01 9 0.000084
Here is a dict of the original DataFrame:
{'var1': {('WY', Timestamp('1996-01-01 00:00:00')): 14,
('WY', Timestamp('1995-01-01 00:00:00')): 8,
('WY', Timestamp('1994-01-01 00:00:00')): 21,
('WY', Timestamp('1993-01-01 00:00:00')): 17,
('WY', Timestamp('1992-01-01 00:00:00')): 9,
('AK', Timestamp('1964-01-01 00:00:00')): 11,
('AK', Timestamp('1965-01-01 00:00:00')): 6,
('AK', Timestamp('1966-01-01 00:00:00')): 10,
('AK', Timestamp('1967-01-01 00:00:00')): 9,
('AK', Timestamp('1968-01-01 00:00:00')): 9},
'var2': {('WY', Timestamp('1996-01-01 00:00:00')): 5.855486597283743e-05,
('WY', Timestamp('1995-01-01 00:00:00')): 5.91261159570422e-05,
('WY', Timestamp('1994-01-01 00:00:00')): 0.00018243736121803522,
('WY', Timestamp('1993-01-01 00:00:00')): 6.34477473795414e-05,
('WY', Timestamp('1992-01-01 00:00:00')): 0.0,
('AK', Timestamp('1964-01-01 00:00:00')): 0.0002131750516127795,
('AK', Timestamp('1965-01-01 00:00:00')): 0.00010040206689154729,
('AK', Timestamp('1966-01-01 00:00:00')): 0.0001891511055873707,
('AK', Timestamp('1967-01-01 00:00:00')): 0.00026726152282208204,
('AK', Timestamp('1968-01-01 00:00:00')): 8.391729352297261e-05}}
CodePudding user response:
IIUC, try:
df.groupby('stcode').apply(lambda x: x.index.get_level_values(1).is_monotonic_increasing)
For these data, timings:
%timeit df.groupby('stcode').apply(lambda x: x.index.get_level_values(1).is_monotonic_increasing)
820 µs ± 28.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit df.reset_index('year').groupby('stcode').year.is_monotonic_increasing
1.33 ms ± 38.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Sorting level 1 without sorting level 0.... Pretty tough to do, but we could use using category dtype or creating a psuedo numeric key and sorting on that.
CodePudding user response:
You can access different levels of the MultiIndex by level parameter. For example, if you want to sort the dates in ascending order but state names in reverse lexicographic order, you can do it by:
df = df.sort_index(level=[0,1], ascending=[False, True])
Output:
var1 var2
WY 1992-01-01 9 0.000000
1993-01-01 17 0.000063
1994-01-01 21 0.000182
1995-01-01 8 0.000059
1996-01-01 14 0.000059
AK 1964-01-01 11 0.000213
1965-01-01 6 0.000100
1966-01-01 10 0.000189
1967-01-01 9 0.000267
1968-01-01 9 0.000084
or if you want to find the average of var1 for each state, then you can do it by:
out = df.groupby(level=0)['var1'].mean()
Output:
AK 9.0
WY 13.8
Name: var1, dtype: float64
CodePudding user response:
I personally find the approach you used clean and reasonable.
However, you could use:
df.groupby(level=0).apply(lambda g: g.index.is_monotonic_increasing)
as the first level will necessarily be monotonic increasing per group.
Or, to really drop the first level, if needed (not required here):
df.groupby(level=0).apply(lambda g: g.index.droplevel(0).is_monotonic_increasing)
Output:
AK True
WY False
dtype: bool
sorting only the second level
Use groupby with group_keys=False, sort=False:
df.groupby(level=0, group_keys=False, sort=False).apply(lambda g: g.sort_index())
Output:
var1 var2
WY 1992-01-01 9 0.000000
1993-01-01 17 0.000063
1994-01-01 21 0.000182
1995-01-01 8 0.000059
1996-01-01 14 0.000059
AK 1964-01-01 11 0.000213
1965-01-01 6 0.000100
1966-01-01 10 0.000189
1967-01-01 9 0.000267
1968-01-01 9 0.000084
