Could someone please clarify me this:
df = pd.DataFrame({'years': [2015, 2016, 2017,2017, 2018, 2019, 2019, 2020]})
df['years'] = df['years'].astype('category')
print(df.dtypes)
years category
dtype: object
now, I create a new variable to subset the years column:
subset_years = [2015, 2016, 2017, 2018]
then, filter the years:
subset_df = df[df['years'].isin(subset_years)]
print(subset_df)
years
0 2015
1 2016
2 2017
3 2017
4 2018
now, I take the unique elements:
subset_df.years.unique()
and I get:
[2015, 2016, 2017, 2018]
Categories (4, int64): [2015, 2016, 2017, 2018]
but, if I do subset_df.years.value_counts(), I get:
2015 1
2016 1
2017 2
2018 1
2019 0
2020 0
Name: years, dtype: int64
My question is that why does subset_df.years.value_counts() return 2019 and 2020 years and with count of 0 ? Since I already filter the years... was it not suppose to remove those years during subset/filter?
Could someone please clarify what is happening?
CodePudding user response:
It's because 2019 and 2020 are still within the categories. You can reset category before value_counts if you don't want filtered years to show up:
subset_df.years.cat.set_categories(subset_years).value_counts()
#2017 2
#2015 1
#2016 1
#2018 1
#Name: years, dtype: int64
