Context
I have created a function, that converts Categorial Data into its unique indices. This works great with all values except NaN.
It seems that the comparison with NaN does not work. This results in the two problems seen below.
Code
col1
0 male
1 female
2 NaN
3 female
def categorial(series: pandas.Series) -> pandas.Series:
series = series.copy()
for index, value in enumerate(series.unique()):
# Problem 1: The output for the Value NaN is always 0.0 %, even though nan is present in the given series.
print(index, value, round(series[series == value].count() / len(series) * 100, 2), '%')
for index, value in enumerate(series.unique()):
# Problem 2: Every unique Value is converted to its Index except NaN.
series[series == value] = index
return series.astype(pandas.Int64Dtype())
Question
- How can I solve the two problems seen in the code above?
CodePudding user response:
You can use fillna with astype and factorize:
df['col1'] = df['col1'].fillna('nan').astype(str).factorize()[0]
Sample:
df = pd.DataFrame({'col1':['a','b',np.nan,'c']})
print (df)
col1
0 a
1 b
2 NaN
3 c
df['col1'] = df['col1'].fillna('nan').astype(str).factorize()[0]
print (df)
col1
0 0
1 1
2 2
3 3
CodePudding user response:
How should be encoded missing values nans?
In pandas it is obviously -1:
print (pd.factorize(categorial(df['col1']))[0])
[ 0 1 -1 1]
print (df['col1'].astype('category').cat.codes)
0 1
1 0
2 -1
3 0
dtype: int8
