mean of the values in interquartile range in python-CodePudding

above25percentile=df.loc[df["order_amount"]>np.percentile(df["order_amount"],25)]
below75percentile=df.loc[df["order_amount"]<np.percentile(df["order_amount"],75)]
interquartile=above25percentile & below75percentile
print(interquartile.mean())

can't seem to get the mean here. any thoughts?

CodePudding user response：

You attempt to compute interquartile as a boolean mask based on the & operator, but its components are Series containing values from the ranges. While the two series are likely to be similar sizes, & will not give you an intersection of their indices. If they were boolean masks, in your subsequent usage, you'd be taking the mean of a bunch of zeros and ones, which is going to be 0.5 (the ratio of data that falls within the IQR as a matter of fact).

First, compute interquartile as a proper mask. Pandas has its own quantile method, which, like np.percentile and siblings, accepts multiple percentiles simultaneously. You can combine that with between to get your mask more efficiently:

interquartile = df['order_amount'].between(*df['order_amount'].quantile([0.25, 0.75]))

You can apply the mask to the column and take the mean like this:

df.loc[interquartile, 'order_amount'].mean()

CodePudding user response：

Try:

above25percentile = df["order_amount"]>np.percentile(df['order_amount'],25)
below75percentile = df['order_amount']<np.percentile(df['order_amount'],75)
print(df.loc[above25percentile & below75percentile, 'order_amount'].mean())

Or you can use between:

df.loc[df['order_amount'].between(np.percentile(df['order_amount'], 25),
                                  np.percentile(df['order_amount'], 75),
                                  inclusive='neither'), 'order_amount'].mean()

Suppose the following dataframe:

df = pd.DataFrame({'order_amount': range(0, 10)})
print(df)

# Output
   order_amount
0             0  # Excluded
1             1  # "
2             2  # "
3             3
4             4  # mean <- (3   4   5   6) / 4 = 4.5
5             5
6             6
7             7  # Excluded
8             8  # "
9             9  # "

Output:

>>> df.loc[df['order_amount'].between(np.percentile(df['order_amount'], 25),
                                  np.percentile(df['order_amount'], 75),
                                  inclusive='neither'), 'order_amount'].mean()
4.5