In the dataset I'm working on, the Adult dataset, the missing values are indicated with the "?" string, and I want to discard the rows containing missing values.
In the documentation of the method df.dropna() there is no argument that offers the possibility of passing a custom value to interpret as the null/missing value,
I know I can simply solve the problem with something like:
df_str = df.select_dtypes(['object']) # get the columns containing the strings
for col in df_str.columns:
df = df[df[col] != '?']
but I was wondering if there is a standard way of achieving this using Pandas apis which possibly offers more flexibility all while being faster.
CodePudding user response:
You can do any, this is to check row not contain ?: if match it will return True, the ~ will turn that to False and filter
df = df[~df_str.eq('?').any(1)]
CodePudding user response:
You could replace it with NaN and dropna:
df = df.replace('?', float('nan')).dropna()
CodePudding user response:
df.replace('?', np.nan, inplace=True)
followed by .dropna()
