CSV & Pandas: Unnamed columns and multi-index-CodePudding

I have a set of data:

,,England,,,,,,,,,,,,France,,,,,,,,,,,
,,,,,,,,,,,,,,,,,,,,,,,,,
,,Store 1,,,,Store 2,,,,Store 3,,,,Store 1,,,,Store 2,,,,Store 3,,,
,,,,,,,,,,,,,,,,,,,,,,,,,
,,F,P,M,D,F,P,M,D,F,P,M,D,F,P,M,D,F,P,M,D,F,P,M,D
,,,,,,,,,,,,,,,,,,,,,,,,,
Year 1,M,0,5,7,9,2,18,5,10,4,9,6,2,4,14,18,11,10,19,18,20,3,17,19,13
,,,,,,,,,,,,,,,,,,,,,,,,,
,F,0,13,14,11,0,6,8,6,2,12,14,9,9,17,12,18,6,17,16,14,0,4,2,5
,,,,,,,,,,,,,,,,,,,,,,,,,
Year 2,M,5,10,6,6,1,20,5,18,4,9,6,2,10,13,15,19,2,18,16,13,1,19,5,12
,,,,,,,,,,,,,,,,,,,,,,,,,
,F,1,11,14,15,0,9,9,2,2,12,14,9,7,17,18,14,9,18,13,14,0,9,2,10
,,,,,,,,,,,,,,,,,,,,,,,,,
Evening,M,4,10,6,5,3,13,19,5,4,9,6,2,8,17,10,18,3,11,20,11,4,18,17,20
,,,,,,,,,,,,,,,,,,,,,,,,,
,F,4,12,12,13,0,9,3,8,2,12,14,9,0,18,11,18,1,13,13,10,0,6,2,8

The desired output I'm trying to achieve is:

I know that I can read the CSV and remove any NaN rows with:

df = pd.read_csv("Stores.csv",skipinitialspace=True)
df.dropna(how="all", inplace=True)

My 2 main issues are:

How do I group the unnamed columns so that they are just the countries "England" and "France"
How do I setup an index so that each of the 3 stores fall under the relevant countries?

I believe that I can use hierarchical indexing for the headings but all examples I've come across use nice, clean data frames unlike my CSV. I'd be very grateful if someone could point me in the right direction as I'm fairly new to pandas.

Thank you.

CodePudding user response：

You'll have to set the (multi) index and headers yourself:

df = pd.read_csv("Stores.csv", header=None)
df.dropna(how='all', inplace=True)
df.reset_index(inplace=True, drop=True)

# getting headers as a product of [England, France], [Store1, Store2, Store3] and [F, P, M, D]
headers = pd.MultiIndex.from_product([df.iloc[0].dropna().unique(),
                                    df.iloc[1].dropna().unique(),
                                    df.iloc[2].dropna().unique()])

df.drop([0, 1, 2], inplace=True) # removing header rows
df[0].ffill(inplace=True) # filling nan values for first index col 
df.set_index([0,1], inplace=True) # setting mulitiindex
df.columns = headers
print(df)

Output:

          England                                                  ...  France                                                
          Store 1             Store 2             Store 3          ... Store 1         Store 2             Store 3            
                F   P   M   D       F   P   M   D       F   P   M  ...       P   M   D       F   P   M   D       F   P   M   D
0       1                                                          ...                                                        
Year 1  M       0   5   7   9       2  18   5  10       4   9   6  ...      14  18  11      10  19  18  20       3  17  19  13
        F       0  13  14  11       0   6   8   6       2  12  14  ...      17  12  18       6  17  16  14       0   4   2   5
Year 2  M       5  10   6   6       1  20   5  18       4   9   6  ...      13  15  19       2  18  16  13       1  19   5  12
        F       1  11  14  15       0   9   9   2       2  12  14  ...      17  18  14       9  18  13  14       0   9   2  10
Evening M       4  10   6   5       3  13  19   5       4   9   6  ...      17  10  18       3  11  20  11       4  18  17  20
        F       4  12  12  13       0   9   3   8       2  12  14  ...      18  11  18       1  13  13  10       0   6   2   8

[6 rows x 24 columns]

CodePudding user response：

You can try this:

from io import StringIO
import pandas as pd
import numpy as np

test=StringIO(""",,England,,,,,,,,,,,,France,,,,,,,,,,,
,,,,,,,,,,,,,,,,,,,,,,,,,
,,Store 1,,,,Store 2,,,,Store 3,,,,Store 1,,,,Store 2,,,,Store 3,,,
,,,,,,,,,,,,,,,,,,,,,,,,,
,,F,P,M,D,F,P,M,D,F,P,M,D,F,P,M,D,F,P,M,D,F,P,M,D
,,,,,,,,,,,,,,,,,,,,,,,,,
Year 1,M,0,5,7,9,2,18,5,10,4,9,6,2,4,14,18,11,10,19,18,20,3,17,19,13
,,,,,,,,,,,,,,,,,,,,,,,,,
,F,0,13,14,11,0,6,8,6,2,12,14,9,9,17,12,18,6,17,16,14,0,4,2,5
,,,,,,,,,,,,,,,,,,,,,,,,,
Year 2,M,5,10,6,6,1,20,5,18,4,9,6,2,10,13,15,19,2,18,16,13,1,19,5,12
,,,,,,,,,,,,,,,,,,,,,,,,,
,F,1,11,14,15,0,9,9,2,2,12,14,9,7,17,18,14,9,18,13,14,0,9,2,10
,,,,,,,,,,,,,,,,,,,,,,,,,
Evening,M,4,10,6,5,3,13,19,5,4,9,6,2,8,17,10,18,3,11,20,11,4,18,17,20
,,,,,,,,,,,,,,,,,,,,,,,,,
,F,4,12,12,13,0,9,3,8,2,12,14,9,0,18,11,18,1,13,13,10,0,6,2,8""")

df = pd.read_csv(test, index_col=[0,1], header=[0,1,2], skiprows=lambda x: x%2 == 1)


df.columns = pd.MultiIndex.from_frame(df.columns
                                        .to_frame()
                                        .apply(lambda x: np.where(x.str.contains('Unnamed'), np.nan, x))\
                          .ffill())

df.index = pd.MultiIndex.from_frame(df.index.to_frame().ffill())

print(df)

Output:

0         England                                              ...  France                                            
1         Store 1             Store 2             Store 3      ... Store 1     Store 2             Store 3            
2               F   P   M   D       F   P   M   D       F   P  ...       M   D       F   P   M   D       F   P   M   D
0       1                                                      ...                                                    
Year 1  M       0   5   7   9       2  18   5  10       4   9  ...      18  11      10  19  18  20       3  17  19  13
        F       0  13  14  11       0   6   8   6       2  12  ...      12  18       6  17  16  14       0   4   2   5
Year 2  M       5  10   6   6       1  20   5  18       4   9  ...      15  19       2  18  16  13       1  19   5  12
        F       1  11  14  15       0   9   9   2       2  12  ...      18  14       9  18  13  14       0   9   2  10
Evening M       4  10   6   5       3  13  19   5       4   9  ...      10  18       3  11  20  11       4  18  17  20
        F       4  12  12  13       0   9   3   8       2  12  ...      11  18       1  13  13  10       0   6   2   8

[6 rows x 24 columns]