I have a Pandas data frame with several columns, with some columns comprising categorical entries. I am 'manually' converting these entries to numerical values. For example,
df['gender'] = pd.Series(df['gender'].factorize()[0])
df['race'] = pd.Series(df['race'].factorize()[0])
df['city'] = pd.Series(df['city'].factorize()[0])
df['state'] = pd.Series(df['state'].factorize()[0])
If the number of columns is huge, this method is obviously inefficient. Is there a way to do this by constructing a loop over all columns (only those columns with categorical entries)?
CodePudding user response:
Use DataFrame.apply by columns in variable cols:
cols = df.select_dtypes(['category']).columns
df[cols] = df[cols].apply(lambda x: x.factorize()[0])
EDIT:
Your solution should be simplify:
for column in df.select_dtypes(['category']):
df[column] = df[column].factorize()[0]
CodePudding user response:
I tried the following, which seems to work fine:
for column in df.select_dtypes(['category']):
df[column] = pd.Series(df[column].factorize()[0])
where 'category' could be 'bool', 'object', etc.
