Calculate average based on available data points-CodePudding

Imagine I have the following data frame:

Product	Month 1	Month 2	Month 3	Month 4	Total
Stuff A	5	0	3	3	11
Stuff B	10	11	4	8	33
Stuff C	0	0	23	30	53

This data frame shows the amount of units sold per product, per month.

Now, what I want to do is to create a new column called "Average" that calculates the average units sold per month. HOWEVER, notice in this example that Stuff C's values for months 1 and 2 are 0. This product was probably introduced in Month 3, so its average should be calculated based on months 3 and 4 only. Also notice that Stuff A's units sold in Month 2 were 0, but that does not mean the product was introduced in Month 3 since 5 units were sold in Month 1. That is, its average should be calculated based on all four months. Assume that the provided data frame may contain any number of months.

Based on these conditions, I have come up with the following solution in pseudo-code:

months = ["list of index names of months to calculate"]
x = len(months)

if df["Month 1"] != 0:
  df["Average"] = df["Total"] / x
elif df["Month 2"] != 0:
  df["Average"] = df["Total"] / x - 1
...
elif df["Month "   str(x)] != 0:
  df["Average"] = df["Total"] / 1
else:
  df["Average"] = 0

That way, the average would be calculated starting from the first month where units sold are different from 0. However, I haven't been able to translate this logical abstraction into actual working code. I couldn't manage to iterate over len(months) while maintaining the elif conditions. Or maybe there is a better, more practical approach.

I would appreciate any help, since I've been trying to crack this problem for a while with no success.

CodePudding user response：

Try:

df = df.set_index(['Product','Total'])
df['Average'] = df.where(df.ne(0).cummax(axis=1)).mean(axis=1)
df_out=df.reset_index()
print(df_out)

Output:

   Product  Total  Month 1  Month 2  Month 3  Month 4  Average
0  Stuff A     11        5        0        3        3     2.75
1  Stuff B     33       10       11        4        8     8.25
2  Stuff C     53        0        0       23       30    26.50

Details:

Move Product and Total into the dataframe index, so we can do calcation on the rest of the dataframe.

First create a boolean matrix using ne to zero. Then, use cummax along the rows which means that if there is a non-zero value, It will remain True until then end of the row. If it starts with a zero, then the False will stay until first non-zero then turns to Turn and remain True.

Next, use pd.DataFrame.where to only select those values for that boolean matrix were Turn, other values (leading zeros) will be NaN and not used in the calcuation of mean.

CodePudding user response：

There is numpy method np.trim_zeros that trims leading and/or trailing zeros. Using a list comprehension, you can iterate over the relevant DataFrame rows, trim the leading zeros and find the average of what remains for each row.

Note that since 'Month 1' to 'Month 4' are consecutive, you can slice the columns between them using .loc.

import numpy as np
df['Average Sales'] = [np.trim_zeros(row, trim='f').mean() for row in df.loc[:, 'Month 1':'Month 4'].to_numpy()]

Output:

   Product  Month 1  Month 2  Month 3  Month 4  Total  Average Sales
0  Stuff A        5        0        3        3     11     2.75
1  Stuff B       10       11        4        8     33     8.25
2  Stuff C        0        0       23       30     53    26.50

CodePudding user response：

If you don't mind it being a little memory inefficient, you could put your dataframe into a numpy array. Numpy has a built-in function to remove zeroes from an array, and then you could use the mean function to calculate the average. It could look something like this:

import numpy as np

arr = np.array(Stuff_A_DF)
mean = arr[np.nonzero(arr)].mean()

Alternatively, you could manually extract the row to a list, then loop through to remove the zeroes.