I have this very large Dataframe containing statistics for various firms for years 1950 to 2020. I have been trying to divide the data first by year and then by industry code (4 digits). Both 'year' and 'industry_code' are columns from the Dataframe. I have created a dictionary in order to obtain data by year, but then I find myself stuck when trying to divide each key by industry, since all of my columns from my initial Dataframe find themselves in the 'value' part of the dictionary. Here is my starting code:
df= pd.read_csv('xyz')
dictio = {}
for year in df['year'].unique():
dictio[year] = df[ df['year'] == year ]
Could someone help me figure out a groupby / loc / if statement or other in order to complete the sampling by year and by industry? Thank you!
CodePudding user response:
Try using dict comprehension groupby:
dct = {key1: {key2: df2 for key2, df2 in df1.groupby('industry_code')} for key1, df1 in df.groupby('year')}
Now try accessing one:
firm_year_df = dct[1994]['My Firm']
CodePudding user response:
The pandas .groupby() method lets you analyze data with multiple keys (e.g., year and industry).
import random
import pandas as pd
# create data frame
random.seed(1234)
data = {
'year': [2017, 2017, 2017, 2018, 2018, 2018, 2019, 2019, 2019],
'industry': [111, 222, 333, 111, 222, 333, 111, 222, 333, ],
'x': [random.randint(1000, 1100) for _ in range(9)],
'y': [random.randint(5000, 6000) for _ in range(9)], }
df = pd.DataFrame(data)
# aggregate with `.groupby()`
gdf = df.groupby(['year', 'industry']).sum()
print(gdf)
x y
year industry
2017 111 1099 5085
222 1056 5100
333 1014 5784
2018 111 1000 5363
222 1011 5242
333 1074 5017
2019 111 1004 5031
222 1085 5807
333 1088 5016
The .groupby() aggregates on two keys, so the resulting data frame has a MultiIndex. You can index like this:
# perform lookup
print(gdf.loc[(2018, 222), :])
x 1011
y 5242
Name: (2018, 222), dtype: int64
