Gather data by year and also by industry-CodePudding

I have this very large Dataframe containing statistics for various firms for years 1950 to 2020. I have been trying to divide the data first by year and then by industry code (4 digits). Both 'year' and 'industry_code' are columns from the Dataframe. I have created a dictionary in order to obtain data by year, but then I find myself stuck when trying to divide each key by industry, since all of my columns from my initial Dataframe find themselves in the 'value' part of the dictionary. Here is my starting code:

df= pd.read_csv('xyz')

dictio = {}
for year in df['year'].unique():
    dictio[year] = df[ df['year'] == year ]

Could someone help me figure out a groupby / loc / if statement or other in order to complete the sampling by year and by industry? Thank you!

CodePudding user response：

Try using dict comprehension groupby:

dct = {key1: {key2: df2 for key2, df2 in df1.groupby('industry_code')} for key1, df1 in df.groupby('year')}

Now try accessing one:

firm_year_df = dct[1994]['My Firm']

CodePudding user response：

The pandas .groupby() method lets you analyze data with multiple keys (e.g., year and industry).

import random
import pandas as pd

# create data frame
random.seed(1234)
data = {
    'year': [2017, 2017, 2017, 2018, 2018, 2018, 2019, 2019, 2019],
    'industry': [111, 222, 333, 111, 222, 333, 111, 222, 333, ],
    'x': [random.randint(1000, 1100) for _ in range(9)],
    'y': [random.randint(5000, 6000) for _ in range(9)], }
df = pd.DataFrame(data)

# aggregate with `.groupby()`
gdf = df.groupby(['year', 'industry']).sum()
print(gdf)

                  x     y
year industry            
2017 111       1099  5085
     222       1056  5100
     333       1014  5784
2018 111       1000  5363
     222       1011  5242
     333       1074  5017
2019 111       1004  5031
     222       1085  5807
     333       1088  5016

The .groupby() aggregates on two keys, so the resulting data frame has a MultiIndex. You can index like this:

# perform lookup
print(gdf.loc[(2018, 222), :])

x    1011
y    5242
Name: (2018, 222), dtype: int64