I have this data frame called df, which has certain columns with object values. I want to turn those into categorical columns by using the following for loop:
df_numerized = df
for col in df_numerized.columns:
if df_numerized[col].dtype == 'object':
df_numerized[col] = df_numerized[col].astype('category')
df_numerized[col] = df_numerized[col].cat.codes
Now when I call df_numerized I get what I want, but this has also changed the original data frame df in a similar way. How can I run my code without 'numerizing' the original data frame?
CodePudding user response:
Please strat with usage of the copy method.
df_numerized = df.copy()
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.copy.html
CodePudding user response:
df_numerized is simply a named reference to df, so when you update df_numerized the change will be propagated to df, to prevent this you can create a copy of df for e.g you can do df_numerized = df.copy(). However, there is more consice approach using factorize:
cols = df.select_dtypes('object')
df_numerized = df.assign(**cols.apply(lambda s: s.factorize(sort=True)[0]))
