Filter dataframe based on values in multiple columns - column names can change-CodePudding

I have a Pandas dataframe that looks like this:

Test	LSL	USL	Day0	Day1	Day2	Day3
1	0	100	10	10	10	200
2	0	100	10	10	10	10
3	10	100	20	5	5	20
4	10	100	20	20	20	20

I want to keep all rows where at least one value in columns Day0-Day3 is less than LSL or greater than USL. In my example the dataframe I need is:

Test	LSL	USL	Day0	Day1	Day2	Day3
1	0	100	10	10	10	200
3	10	100	20	5	5	20

The number of columns and the names of columns for Day0-Day3 can change. For example, I could have:

Test	LSL	USL	Week0	Week1
1	0	100	10	200
2	0	100	10	10
3	10	100	20	5
4	10	100	20	20

I have not been able to figure out how to do this, any suggestions would be greatly appreciated.

CodePudding user response：

Find max, min values across columns and compare it with "LSL" and "USL" values to create a mask to filter df with:

max_vals = df.drop(columns=['Test','LSL','USL']).max(axis=1)
min_vals = df.drop(columns=['Test','LSL','USL']).min(axis=1)
mask = (df['LSL'] > min_vals) | (df['USL'] < max_vals)
out = df[mask]

Output:

Test  LSL  USL  Day0  Day1  Day2  Day3
0     1    0  100    10    10    10   200
2     3   10  100    20     5     5    20

For the second example:

   Test  LSL  USL  Week0  Week1
0     1    0  100     10    200
2     3   10  100     20      5

CodePudding user response：

Something like this should do the trick:

df_subset = df.iloc[:, 3:]
m = (df_subset.lt(df["LSL"], axis=0) | df_subset.gt(df["USL"], axis=0)).any(axis=1)

filtered_df = df[m]

>>> filtered_df 
   Test  LSL  USL  Week0  Week1
0     1    0  100     10    200
2     3   10  100     20      5

It's always easier to tackle these kinds of problems by trying to create a mask (a series of True/False values that define whether a row should be part of the result) and then apply that mask to your dataframe.

You can use DataFrame.gt and DataFrame.lt to perform element-wise comparisons, combine the results on each column using the or operator (|), and then combine the columns using DataFrame.any.

CodePudding user response：

You can use multiple conditions and boolean indexing.

use_cols = df[[c for c in df.columns if c[-1].isdigit()]]
df.loc[(use_cols.lt(df['LSL'].values) | (use_cols.gt(df['USL'].values))).any(1)]

   Test  LSL  USL  Day0  Day1  Day2  Day3
0     1    0  100    10    10    10   200
2     3   10  100    20     5     5    20

We are basically evaluating two conditions:

is a value in use_cols columns less than LSL? (lt)
is a value in use_cols columns greater than LSL? (gt)

To use multiple conditions, using or | per se, you use brackets () surrounding each condition, and by chaining any(axis=1) checks whether at least 1 True is returned.

To allow for different column names, one idea would to select the columns that end with a digit (i.e. Week0, Day0, Month1, Year2 etc). However keep in mind that if any other column ends with a digit it will be included here. Or, equivalently, if the columns are always at a specific position you can use iloc. This requires some domain knowledge that you have.