conditinal probabilty function that receieves variable length arguments-CodePudding

I have this dataframe df :

    A  B  C
0   0  1  0
1   0  1  1
2   0  1  1
3   1  0  1
4   0  0  0
5   1  0  0
6   0  0  0
7   0  0  1
8   1  0  0
9   0  0  0
10  1  0  1
11  1  0  1
12  0  1  1
13  1  0  0
14  1  0  0
15  0  1  0
16  1  1  0
17  0  0  1
18  1  0  1
19  1  0  0
20  1  0  1
21  1  1  0
22  1  1  1
23  1  1  1
24  1  0  0
25  1  1  0
26  0  0  1
27  0  1  1
28  0  1  0
29  1  1  0
30  1  0  1
31  0  1  0
32  0  0  1
33  1  1  1
34  0  1  0
35  1  1  0
36  0  1  0
37  0  0  1
38  0  1  1
39  0  1  1

And I got the joint probability P(A,B,C) by this :

grp = df.apply(tuple, axis=1)
PrD=pd.concat([df.groupby(grp).first(),
           grp.groupby(grp).count().div(len(df)).rename("Probs")],
          axis=1).reset_index(drop=True)
print (PrD)

which outputs the joint probability P(A,B,C)

   A  B  C  Probs
0  0  0  0  0.075
1  0  0  1  0.125
2  0  1  0  0.150
3  0  1  1  0.150
4  1  0  0  0.150
5  1  0  1  0.150
6  1  1  0  0.125
7  1  1  1  0.075

I am trying to write a function that receives a subset of column names of PrD and computes conditional probability that follows the rule P(A|B)= P(A,B)/P(B) and in case it receives 3 variables : P(A|B,C)=P(A,B,C)/P(B,C) if it receives 4 variables: P(A|B,C,D) =P(A,B,C,D)/P(B,C,D) and so on. For example if the function receives P(A=0|B=0) the output should be calculated by (0.075 0.125)/(0.075 0.125 0.150 0.150) = 0.2 where the numerator is when both A and B = 0 and the denominator is where B =0 and In case it receives one variable A=0 for example it returns the (0.075 0.125 0.150 0.150 ) which only rows where A=0 I tried loc and query but they only receive one variable not multiple variables I want a function that calculates based whatever number of variables it receives

CodePudding user response：

You are doing math on a computer and that means something should be done differently.

You don't need to build joint probability tables and such. You can count how many rows where A = 0 and A and B = 0 and divide the two:

def prob(df, a, *cols):
    """Return the probability that all columns in `cols` are 0 given column `a` is 0
    """
    if len(cols) == 0:
        return df[a].eq(0).sum() / len(df)
    else:
        return df[[a]   list(cols)].eq(0).all(axis=1).sum() / df[list(cols)].eq(0).sum()

Usage:

prob(df, "A", "B")      # 0.4
prob(df, "A", "B", "C") # 0.15