I am working on a pandas DataFrame containing numerical columns as well as string columns (dtype is object), and would like to remove the rows containing outliers with respect to the distributions within a column. In other words, detect the outliers in each column and drop the corresponding rows.
I have found two solutions to this, but neither takes into account that my df does not contain only numbers, hence they both result in errors (when encountering strings, I assume).
from scipy import stats
df[(np.abs(stats.zscore(df)) < 3).all(axis=1)]
returns TypeError: unsupported operand type(s) for /: 'str' and 'int'.
This is why I guess the error arises from the df having mixed dtypes.
for col in df.columns:
lower = df[col].quantile(0.05)
upper = df[col].quantile(0.95)
df = df[col].clip(lower=lower, upper=upper)
returns KeyError with this traceback:
File omissis, in Class.remove_outliers(self, df)
423 def remove_outliers(self, df):
424 for col in df.columns:
--> 425 lower = df[col].quantile(0.05)
426 upper = df[col].quantile(0.95)
427 df = df[col].clip(lower=lower, upper=upper)
File omissis, in Series.__getitem__(self, key)
955 return self._values[key]
957 elif key_is_scalar:
--> 958 return self._get_value(key)
960 if is_hashable(key):
961 # Otherwise index.get_value will raise InvalidIndexError
962 try:
963 # For labels that don't resolve as scalars like tuples and frozensets
File omissis, in Series._get_value(self, label, takeable)
1066 return self._values[label]
1068 # Similar to Index.get_value, but we do not fall back to positional
-> 1069 loc = self.index.get_loc(label)
1070 return self.index._get_values_for_loc(self, loc, label)
File omissis, in RangeIndex.get_loc(self, key, method, tolerance)
387 raise KeyError(key) from err
388 self._check_indexing_error(key)
--> 389 raise KeyError(key)
390 return super().get_loc(key, method=method, tolerance=tolerance)
KeyError: 'colname'
How would you solve this?
EDIT: the idea is to skip the non numeric columns, to ignore them.
CodePudding user response:
I would break the problem into stages:
Firstly, identify (numeric) columns you want to do the outlier removal. Reference
newdf = df.select_dtypes(include=np.number)
Now perform whatever filtering/outlier removal you want on the rows of newdf. Afterwards, newdf should contain only rows you wish to retain.
Then keep only the rows of df those index are in newdf. Reference
df = df[df.index.isin(newdf.index)]
CodePudding user response:
@Ipounng 's solution in copy-paste ready code:
def remove_outliers(df):
newdf = df.select_dtypes(include=np.number)
newdf = newdf[(np.abs(stats.zscore(newdf)) < 3).all(axis=1)]
df = df[df.index.isin(newdf.index)]
return df
CodePudding user response:
In addition to @lpounng solution. For categorical variable you cannot use zscore, but you could consider low valued class as outliers. You can do it setting a threshold for the value counts.
Example with a toy dataset:
import random
import pandas as pd
colors = []
for i in range(100):
colors.append(random.choices(['yellow','white', 'red'], weights = [10, 1, 2])[0])
df = pd.DataFrame(colors, columns=['colors'])
I randomly generate a columns with yellow, white and red categorical values with a weights of 10,1,2.
With value_count() pandas method you can compute the counts of unique category in the column
df['colors'].value_counts()
>>> yellow 68
red 20
white 12
Name: colors, dtype: int64
Now you can set a threshold and remove the categories that are sparsely populated such us white.
