How to convert categorial data into indices with nan values present in Python?-CodePudding

Context

I have created a function, that converts Categorial Data into its unique indices. This works great with all values except NaN. It seems that the comparison with NaN does not work. This results in the two problems seen below.

Code

   col1
0  male
1  female
2  NaN
3  female

def categorial(series: pandas.Series) -> pandas.Series:
    series = series.copy()

    for index, value in enumerate(series.unique()):
        # Problem 1: The output for the Value NaN is always 0.0 %, even though nan is present in the given series.
        print(index, value, round(series[series == value].count() / len(series) * 100, 2), '%')

    for index, value in enumerate(series.unique()):
        # Problem 2: Every unique Value is converted to its Index except NaN.
        series[series == value] = index

    return series.astype(pandas.Int64Dtype())

Question

How can I solve the two problems seen in the code above?

CodePudding user response：

You can use fillna with astype and factorize:

df['col1'] = df['col1'].fillna('nan').astype(str).factorize()[0]

Sample:

df = pd.DataFrame({'col1':['a','b',np.nan,'c']})
print (df)
  col1
0    a
1    b
2  NaN
3    c

df['col1'] = df['col1'].fillna('nan').astype(str).factorize()[0]
print (df)
   col1
0     0
1     1
2     2
3     3

CodePudding user response：

How should be encoded missing values nans?

In pandas it is obviously -1:

print (pd.factorize(categorial(df['col1']))[0])
[ 0  1 -1  1]

print (df['col1'].astype('category').cat.codes)
0    1
1    0
2   -1
3    0
dtype: int8