python pandas apply not accepting numpy.float64 args-CodePudding

I am experiencing issues passing numpy.float64 variables as arguments to pandas.Series.apply(). Is there any way to forcefully use pandas version of the .mean() and .std() functions to hopefully satisfy Pandas?

The Code

def normalization(val_to_norm, col_mean, col_sd):
    return (val_to_norm - col_mean) / col_sd

voting_df['pop_estimate'].info()

pop_mean, pop_sd = voting_df['pop_estimate'].mean(), voting_df['pop_estimate'].std()

voting_df['pop_estimate'] = voting_df['pop_estimate'].apply(normalization, pop_mean, pop_sd)

output

The key line is at the bottom.

<class 'pandas.core.series.Series'>
Int64Index: 3145 entries, 0 to 3144
Series name: pop_estimate
Non-Null Count  Dtype  
--------------  -----  
3145 non-null   float64
dtypes: float64(1)
memory usage: 49.1 KB

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In [46], line 7
      4 voting_df['pop_estimate'].info()
      6 pop_mean, pop_sd = voting_df['pop_estimate'].mean(), voting_df['pop_estimate'].std()
----> 7 voting_df['pop_estimate'] = voting_df['pop_estimate'].apply(normalization, pop_mean, pop_sd)

File c:\Users\chris\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\core\series.py:4774, in Series.apply(self, func, convert_dtype, args, **kwargs)
   4664 def apply(
   4665     self,
   4666     func: AggFuncType,
   (...)
   4669     **kwargs,
   4670 ) -> DataFrame | Series:
   4671     """
   4672     Invoke function on values of Series.
   4673 
   (...)
   4772     dtype: float64
   4773     """
-> 4774     return SeriesApply(self, func, convert_dtype, args, kwargs).apply()

File c:\Users\chris\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\core\apply.py:1100, in SeriesApply.apply(self)
   1097     return self.apply_str()
   1099 # self.f is Callable
-> 1100 return self.apply_standard()

File c:\Users\chris\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\core\apply.py:1151, in SeriesApply.apply_standard(self)
   1149     else:
   1150         values = obj.astype(object)._values
-> 1151         mapped = lib.map_infer(
   1152             values,
   1153             f,
   1154             convert=self.convert_dtype,
   1155         )
   1157 if len(mapped) and isinstance(mapped[0], ABCSeries):
   1158     # GH#43986 Need to do list(mapped) in order to get treated as nested
   1159     #  See also GH#25959 regarding EA support
   1160     return obj._constructor_expanddim(list(mapped), index=obj.index)

File c:\Users\chris\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\_libs\lib.pyx:2919, in pandas._libs.lib.map_infer()

File c:\Users\chris\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\core\apply.py:139, in Apply.__init__.<locals>.f(x)
    138 def f(x):
--> 139     return func(x, *args, **kwargs)

TypeError: Value after * must be an iterable, not numpy.float64

CodePudding user response：

To provide additional arguments to a function called with pd.Series.apply, you need to pass them as keyword arguments, or using a tuple keyword argument args.

From the docs:

Series.apply(func, convert_dtype=True, args=(), **kwargs)

Invoke function on values of Series.

Can be ufunc (a NumPy function that applies to the entire Series) or a Python function that only works on single values.

Parameters

func: function
Python function or NumPy ufunc to apply.

convert_dtype: bool, default True
Try to find better dtype for elementwise function results. If False, leave as dtype=object. Note that the dtype is always preserved for some extension array dtypes, such as Categorical.

args: tuple
Positional arguments passed to func after the series value.

**kwargs
Additional keyword arguments passed to func.

So to call this with positional arguments:

voting_df['pop_estimate'].apply(normalization, args=(pop_mean, pop_sd))

Alternatively, with keyword arguments:

voting_df['pop_estimate'].apply(normalization, col_mean=pop_mean, col_sd=pop_sd)

CodePudding user response：

It has nothing to do with data type. You are passing pop_mean and pop_sd as positional argument and it is used by apply not normalization.

In order to pass to normalization use args or keyword arguments:

# sample data setup
voting_df = pd.DataFrame({"pop_estimate": range(3144)})

def normalization(val_to_norm, col_mean, col_sd):
    return (val_to_norm - col_mean) / col_sd

pop_mean, pop_sd = voting_df['pop_estimate'].mean(), voting_df['pop_estimate'].std()

Method 1: Use args:

method1 = voting_df['pop_estimate'].apply(normalization, args=(pop_mean, pop_sd))

Method 2: Use keyword arguments:

method2 = voting_df['pop_estimate'].apply(normalization,  
                                          col_mean=pop_mean, 
                                          col_sd=pop_sd)

Besides, in your case, you don't need apply. Instead, directly use normalization:

method3 = normalization(voting_df["pop_estimate"], pop_mean, pop_sd)

Or even better, use already well built libraries. For example, scipy.stats.zscore:

from scipy.stats import zscore

method4 = zscore(voting_df["pop_estimate"], ddof=1)

Validation:

import numpy as np

np.all([
    np.array_equal(method1, method2),
    np.array_equal(method2, method3),
    np.array_equal(method3, method4)    
])
# True