I have a pandas dataframe df that I built using 3 levels of columns, as follows:
a1 a2 a3
b1 b2 b1 b3 b1 b4
c1 c2 c1 c2 c1 c2 c1 c2 c1 c2 c1 c2
... (data) ...
Note that each a column may have different b subcolumns, but each b column has the same c subcolumns.
I can extract e.g. the subcolumns from a2 using df["a2"].
How can I select based on the second or third level without having to specify the first and second level respectively? For instance I would like to say "give me all the c2 columns you can find" and I would get:
a1 a2 a3
b1 b2 b1 b3 b1 b4
... (data for the c2 columns) ...
Or "give me all the b1 columns" and I would get:
a1 a2 a3
c1 c2 c1 c2 c1 c2
... (data for the b1 columns) ...
CodePudding user response:
The docs provide some info on that. Adapting the examples from there to your example, either use tuples with slice objects you pass None,
df.loc[:, (slice(None), slice(None), "c2")]
or use pd.IndexSliceto use the familiar : notation:
idx = pd.IndexSlice
df.loc[:, idx[:, :, "c2"]]
CodePudding user response:
When you have a dataframe with a MultiIndex the columns are a list of tuples containing the keys of each level.
You can get the columns you want with a list comprehension checking if the name you want is in the tuple:
# get all the "c2"
df[[col for col in df.columns if "c2" in col]]
# get all the "b1"
df[[col for col in df.columns if "b1" in col]]
If you want to be sure that the column name is in the right position:
# get all the "c2"
df[[col for col in df.columns if col[2] == "c2"]]
# get all the "b1"
df[[col for col in df.columns if col[1] == "b1"]]
