Map unique values in 2 columns to integers-CodePudding

I have a dataframe with 2 categorical columns (col1, col2).

  col1 col2
0    A   DE
1    A    B
2    B   BA
3    A    A
4    C    C

I want to map the unique string values to integers, for example (A:0, B:1, BA:2, C:3, DE:4)

  col1 col2  ideal1  ideal2
0    A   DE       0       4
1    A    B       0       1
2    B   BA       1       2
3    A    A       0       0
4    C    C       3       3

I am have tried to use factorize or category, but I am not getting the same unique value for both columns, as can be seen from ROW C:

Here is my code:

df = pd.DataFrame({'col1': ["A", "A", "B", "A" , "C"], 'col2': ["DE", "B", "BA", "A", "C"]}) 

#ideal map alphabetical: A:0, B:1, BA:2, C:3, DE:4

 #ideal result 
df["ideal1"] = [0, 0, 1,0, 3]
df["ideal2"] = [4,1,2,0,3]


 #trial #1 --> C value 2 & 3 : not matching
df["cat1"] = df['col1'].astype("category").cat.codes
df["cat2"] = df['col2'].astype("category").cat.codes

 #trial #2 --> C value 2 & 4 : not matching 
df["fac1"] = pd.factorize(df["col1"])[0]
df["fac2"] = pd.factorize(df["col2"])[0]


 
print (df)

OUT: 

  col1 col2  ideal1  ideal2  cat1  cat2  fac1  fac2
0    A   DE       0       4     0     4     0     0
1    A    B       0       1     0     1     0     1
2    B   BA       1       2     1     2     1     2
3    A    A       0       0     0     0     0     3
4    C    C       3       3     2     3     2     4

CodePudding user response：

To get the same categories across columns you need to reshape to a single dimension first. Then use factorize and restore the original shape.

Here is an example using stack/unstack:

x = df.stack()
x[:] = x.factorize()[0]
df2 = x.unstack()

Output:

  col1 col2
0    0    1
1    0    2
2    2    3
3    0    0
4    4    4

Joining to the original data:

x = df.stack()
x[:] = x.factorize()[0]
df2 = df.join(x.unstack().add_suffix('_cat'))

Output:

  col1 col2 col1_cat col2_cat
0    A   DE        0        1
1    A    B        0        2
2    B   BA        2        3
3    A    A        0        0
4    C    C        4        4

alphabetical order

If you really want alphabetical order, you could create you own custom mapping dictionary:

import numpy as np
cats = {k:v for v,k in enumerate(np.unique(df.values))}
df.replace(cats)

Output:

   col1  col2
0     0     4
1     0     1
2     1     2
3     0     0
4     3     3

CodePudding user response：

If you have a preference about which codes map to which integers, I would suggest using map, which accepts a dictionary specifying how the mapping should be done.

Example:

import pandas as pd
df = pd.DataFrame({'col1': ["A", "A", "B", "A" , "C"], 'col2': ["DE", "B", "BA", "A", "C"]}) 

mapping_dict = {'A':0, 'B':1, 'BA':2, 'C':3, 'DE':4}
df['ideal1'] = df['col1'].map(mapping_dict)
df['ideal2'] = df['col2'].map(mapping_dict)
print(df)

Output:

  col1 col2  ideal1  ideal2
0    A   DE       0       4
1    A    B       0       1
2    B   BA       1       2
3    A    A       0       0
4    C    C       3       3

The advantage of @mozway's solution is that you don't need to specify the mapping.