I know that there are many ways to delete rows containing a specific value in a column in python, but I'm wondering if there is a more efficient way to do this by checking all columns in a dataset at once and deleting all rows that contain a specific value WITHOUT turning it into NaN and dropping all of them. To clarify, I don't want to lose all columns with strings/NaN I just want to lose rows that have a specific value.
For example, I'm looking to delete all rows with participants that contain an answer "refused" in any column. So if my table looked like this:
| Subject | Race | Gender | Weight |
|---|---|---|---|
| 1 | black | female | 123 |
| 2 | white | refused | 145 |
| 3 | white | male | 165 |
| 4 | asian | male | refused |
| 5 | refused | male | 128 |
| 6 | white | male | nan |
| 7 | asian | male | refused |
| 8 | black | male | nan |
I would want to implement a statement that would filter it to keep only subjects that didn't have any responses with a string containing "refused":
| Subject | Race | Gender | Weight |
|---|---|---|---|
| 1 | black | female | 123 |
| 3 | white | male | 165 |
| 6 | white | male | nan |
| 8 | black | male | nan |
Does anyone know how to filter this way across an entire dataset?
CodePudding user response:
You can replace all refused with NaN using df.replace and then use df.dropna to drop the rows with missing values.
df = df.replace("refused", np.nan)
And then drop the rows with NaN.
df = df.dropna()
CodePudding user response:
You could try this, don't really know if it is good performance-wise though:
df = df.replace({"refused": np.nan})
Then just drop rows with nan in it:
df = df.dropna()
CodePudding user response:
You can use isin with any.
df = df[~df.isin(['refused']).any(axis=1)]
CodePudding user response:
Another method with apply-lambda:
df = df.loc[~df.apply(lambda row : any('refused' in str(cell) for cell in row) ,axis=1)]
CodePudding user response:
df = df[(df.Gender != 'refused') & (df.Race != 'refused').... ]
or alternatively
filter = reduce(lambda column1, column2: (df[column1] != 'refused') & (df[column2] != 'refused'), df.columns)
df = df[filter]
