Home > database >  Filter dataframe based on values in multiple columns - column names can change
Filter dataframe based on values in multiple columns - column names can change

Time:01-22

I have a Pandas dataframe that looks like this:

Test LSL USL Day0 Day1 Day2 Day3
1 0 100 10 10 10 200
2 0 100 10 10 10 10
3 10 100 20 5 5 20
4 10 100 20 20 20 20

I want to keep all rows where at least one value in columns Day0-Day3 is less than LSL or greater than USL. In my example the dataframe I need is:

Test LSL USL Day0 Day1 Day2 Day3
1 0 100 10 10 10 200
3 10 100 20 5 5 20

The number of columns and the names of columns for Day0-Day3 can change. For example, I could have:

Test LSL USL Week0 Week1
1 0 100 10 200
2 0 100 10 10
3 10 100 20 5
4 10 100 20 20

I have not been able to figure out how to do this, any suggestions would be greatly appreciated.

CodePudding user response:

Find max, min values across columns and compare it with "LSL" and "USL" values to create a mask to filter df with:

max_vals = df.drop(columns=['Test','LSL','USL']).max(axis=1)
min_vals = df.drop(columns=['Test','LSL','USL']).min(axis=1)
mask = (df['LSL'] > min_vals) | (df['USL'] < max_vals)
out = df[mask]

Output:

Test  LSL  USL  Day0  Day1  Day2  Day3
0     1    0  100    10    10    10   200
2     3   10  100    20     5     5    20

For the second example:

   Test  LSL  USL  Week0  Week1
0     1    0  100     10    200
2     3   10  100     20      5

CodePudding user response:

Something like this should do the trick:

df_subset = df.iloc[:, 3:]
m = (df_subset.lt(df["LSL"], axis=0) | df_subset.gt(df["USL"], axis=0)).any(axis=1)

filtered_df = df[m]
>>> filtered_df 
   Test  LSL  USL  Week0  Week1
0     1    0  100     10    200
2     3   10  100     20      5

It's always easier to tackle these kinds of problems by trying to create a mask (a series of True/False values that define whether a row should be part of the result) and then apply that mask to your dataframe.

You can use DataFrame.gt and DataFrame.lt to perform element-wise comparisons, combine the results on each column using the or operator (|), and then combine the columns using DataFrame.any.

CodePudding user response:

You can use multiple conditions and boolean indexing.

use_cols = df[[c for c in df.columns if c[-1].isdigit()]]
df.loc[(use_cols.lt(df['LSL'].values) | (use_cols.gt(df['USL'].values))).any(1)]

   Test  LSL  USL  Day0  Day1  Day2  Day3
0     1    0  100    10    10    10   200
2     3   10  100    20     5     5    20

We are basically evaluating two conditions:

  1. is a value in use_cols columns less than LSL? (lt)
  2. is a value in use_cols columns greater than LSL? (gt)

To use multiple conditions, using or | per se, you use brackets () surrounding each condition, and by chaining any(axis=1) checks whether at least 1 True is returned.

To allow for different column names, one idea would to select the columns that end with a digit (i.e. Week0, Day0, Month1, Year2 etc). However keep in mind that if any other column ends with a digit it will be included here. Or, equivalently, if the columns are always at a specific position you can use iloc. This requires some domain knowledge that you have.

  •  Tags:  
  • Related