I have a Pandas dataframe that looks like this:
| Test | LSL | USL | Day0 | Day1 | Day2 | Day3 |
|---|---|---|---|---|---|---|
| 1 | 0 | 100 | 10 | 10 | 10 | 200 |
| 2 | 0 | 100 | 10 | 10 | 10 | 10 |
| 3 | 10 | 100 | 20 | 5 | 5 | 20 |
| 4 | 10 | 100 | 20 | 20 | 20 | 20 |
I want to keep all rows where at least one value in columns Day0-Day3 is less than LSL or greater than USL. In my example the dataframe I need is:
| Test | LSL | USL | Day0 | Day1 | Day2 | Day3 |
|---|---|---|---|---|---|---|
| 1 | 0 | 100 | 10 | 10 | 10 | 200 |
| 3 | 10 | 100 | 20 | 5 | 5 | 20 |
The number of columns and the names of columns for Day0-Day3 can change. For example, I could have:
| Test | LSL | USL | Week0 | Week1 |
|---|---|---|---|---|
| 1 | 0 | 100 | 10 | 200 |
| 2 | 0 | 100 | 10 | 10 |
| 3 | 10 | 100 | 20 | 5 |
| 4 | 10 | 100 | 20 | 20 |
I have not been able to figure out how to do this, any suggestions would be greatly appreciated.
CodePudding user response:
Find max, min values across columns and compare it with "LSL" and "USL" values to create a mask to filter df with:
max_vals = df.drop(columns=['Test','LSL','USL']).max(axis=1)
min_vals = df.drop(columns=['Test','LSL','USL']).min(axis=1)
mask = (df['LSL'] > min_vals) | (df['USL'] < max_vals)
out = df[mask]
Output:
Test LSL USL Day0 Day1 Day2 Day3
0 1 0 100 10 10 10 200
2 3 10 100 20 5 5 20
For the second example:
Test LSL USL Week0 Week1
0 1 0 100 10 200
2 3 10 100 20 5
CodePudding user response:
Something like this should do the trick:
df_subset = df.iloc[:, 3:]
m = (df_subset.lt(df["LSL"], axis=0) | df_subset.gt(df["USL"], axis=0)).any(axis=1)
filtered_df = df[m]
>>> filtered_df
Test LSL USL Week0 Week1
0 1 0 100 10 200
2 3 10 100 20 5
It's always easier to tackle these kinds of problems by trying to create a mask (a series of True/False values that define whether a row should be part of the result) and then apply that mask to your dataframe.
You can use DataFrame.gt and DataFrame.lt to perform element-wise comparisons, combine the results on each column using the or operator (|), and then combine the columns using DataFrame.any.
CodePudding user response:
You can use multiple conditions and boolean indexing.
use_cols = df[[c for c in df.columns if c[-1].isdigit()]]
df.loc[(use_cols.lt(df['LSL'].values) | (use_cols.gt(df['USL'].values))).any(1)]
Test LSL USL Day0 Day1 Day2 Day3
0 1 0 100 10 10 10 200
2 3 10 100 20 5 5 20
We are basically evaluating two conditions:
- is a value in
use_colscolumns less than LSL? (lt) - is a value in
use_colscolumns greater than LSL? (gt)
To use multiple conditions, using or | per se, you use brackets () surrounding each condition, and by chaining any(axis=1) checks whether at least 1 True is returned.
To allow for different column names, one idea would to select the columns that end with a digit (i.e. Week0, Day0, Month1, Year2 etc). However keep in mind that if any other column ends with a digit it will be included here. Or, equivalently, if the columns are always at a specific position you can use iloc. This requires some domain knowledge that you have.
