Home > database >  Way to use groupby over indices in pandas
Way to use groupby over indices in pandas

Time:01-11

[Edited to provide better working example data]

Imagine I have a panel in pandas with multiple time_vars for each id_var. For example, imagine the data contain a state variable and a year variable:

                   var1      var2
stcode year                      
WY     1996-01-01    14  0.000059
       1995-01-01     8  0.000059
       1994-01-01    21  0.000182
       1993-01-01    17  0.000063
       1992-01-01     9  0.000000
AK     1964-01-01    11  0.000213
       1965-01-01     6  0.000100
       1966-01-01    10  0.000189
       1967-01-01     9  0.000267
       1968-01-01     9  0.000084

Is there a way to use pd.groupby and do operations on the year level of the multiindex as the object or am I constrained to the columns defined within the groupby variable?

As an example, suppose I wanted to check if the dates were sorted, but I only cared about whether the year level was sorted, and not the stcode level. I currently would implement this like so:

# unsetting year from the index, so that I can groupby the remaining level and 
# check if year is monotonically increasing
df.reset_index('year').groupby('stcode').year.is_monotonic_increasing

This returns (as you can see):

stcode
AK     True
WY    False
Name: year, dtype: bool

However, unsetting the index in this way seems awfully inefficient. Is there a general way to do somthing like:

df.groupby(level=0)[ -- the other level -- ].apply( -- operation -- )

Similarly, and relatedly, is there a way to sort by the second level and ignore the first? In this example, I may need the time variable sorted within each stcode block to do time dependent operations like shifting, but I do not care if the outer level is sorted alphabetically or numerically. In other words, this would end up like:

                   var1      var2
stcode year                      
WY     1992-01-01     9  0.000000
       1993-01-01    17  0.000063
       1994-01-01    21  0.000182
       1995-01-01     8  0.000059
       1996-01-01    14  0.000059
AK     1964-01-01    11  0.000213
       1965-01-01     6  0.000100
       1966-01-01    10  0.000189
       1967-01-01     9  0.000267
       1968-01-01     9  0.000084

Here is a dict of the original DataFrame:

{'var1': {('WY', Timestamp('1996-01-01 00:00:00')): 14,
  ('WY', Timestamp('1995-01-01 00:00:00')): 8,
  ('WY', Timestamp('1994-01-01 00:00:00')): 21,
  ('WY', Timestamp('1993-01-01 00:00:00')): 17,
  ('WY', Timestamp('1992-01-01 00:00:00')): 9,
  ('AK', Timestamp('1964-01-01 00:00:00')): 11,
  ('AK', Timestamp('1965-01-01 00:00:00')): 6,
  ('AK', Timestamp('1966-01-01 00:00:00')): 10,
  ('AK', Timestamp('1967-01-01 00:00:00')): 9,
  ('AK', Timestamp('1968-01-01 00:00:00')): 9},
 'var2': {('WY', Timestamp('1996-01-01 00:00:00')): 5.855486597283743e-05,
  ('WY', Timestamp('1995-01-01 00:00:00')): 5.91261159570422e-05,
  ('WY', Timestamp('1994-01-01 00:00:00')): 0.00018243736121803522,
  ('WY', Timestamp('1993-01-01 00:00:00')): 6.34477473795414e-05,
  ('WY', Timestamp('1992-01-01 00:00:00')): 0.0,
  ('AK', Timestamp('1964-01-01 00:00:00')): 0.0002131750516127795,
  ('AK', Timestamp('1965-01-01 00:00:00')): 0.00010040206689154729,
  ('AK', Timestamp('1966-01-01 00:00:00')): 0.0001891511055873707,
  ('AK', Timestamp('1967-01-01 00:00:00')): 0.00026726152282208204,
  ('AK', Timestamp('1968-01-01 00:00:00')): 8.391729352297261e-05}}

CodePudding user response:

IIUC, try:

df.groupby('stcode').apply(lambda x: x.index.get_level_values(1).is_monotonic_increasing)

For these data, timings:

%timeit df.groupby('stcode').apply(lambda x: x.index.get_level_values(1).is_monotonic_increasing)

820 µs ± 28.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit df.reset_index('year').groupby('stcode').year.is_monotonic_increasing

1.33 ms ± 38.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Sorting level 1 without sorting level 0.... Pretty tough to do, but we could use using category dtype or creating a psuedo numeric key and sorting on that.

CodePudding user response:

You can access different levels of the MultiIndex by level parameter. For example, if you want to sort the dates in ascending order but state names in reverse lexicographic order, you can do it by:

df = df.sort_index(level=[0,1], ascending=[False, True])

Output:

               var1      var2
WY 1992-01-01     9  0.000000
   1993-01-01    17  0.000063
   1994-01-01    21  0.000182
   1995-01-01     8  0.000059
   1996-01-01    14  0.000059
AK 1964-01-01    11  0.000213
   1965-01-01     6  0.000100
   1966-01-01    10  0.000189
   1967-01-01     9  0.000267
   1968-01-01     9  0.000084

or if you want to find the average of var1 for each state, then you can do it by:

out = df.groupby(level=0)['var1'].mean()

Output:

AK     9.0
WY    13.8
Name: var1, dtype: float64

CodePudding user response:

I personally find the approach you used clean and reasonable.

However, you could use:

df.groupby(level=0).apply(lambda g: g.index.is_monotonic_increasing)

as the first level will necessarily be monotonic increasing per group.

Or, to really drop the first level, if needed (not required here):

df.groupby(level=0).apply(lambda g: g.index.droplevel(0).is_monotonic_increasing)

Output:

AK     True
WY    False
dtype: bool

sorting only the second level

Use groupby with group_keys=False, sort=False:

df.groupby(level=0, group_keys=False, sort=False).apply(lambda g: g.sort_index())

Output:

               var1      var2
WY 1992-01-01     9  0.000000
   1993-01-01    17  0.000063
   1994-01-01    21  0.000182
   1995-01-01     8  0.000059
   1996-01-01    14  0.000059
AK 1964-01-01    11  0.000213
   1965-01-01     6  0.000100
   1966-01-01    10  0.000189
   1967-01-01     9  0.000267
   1968-01-01     9  0.000084
  •  Tags:  
  • Related