I have this dataframe df :
A B C
0 0 1 0
1 0 1 1
2 0 1 1
3 1 0 1
4 0 0 0
5 1 0 0
6 0 0 0
7 0 0 1
8 1 0 0
9 0 0 0
10 1 0 1
11 1 0 1
12 0 1 1
13 1 0 0
14 1 0 0
15 0 1 0
16 1 1 0
17 0 0 1
18 1 0 1
19 1 0 0
20 1 0 1
21 1 1 0
22 1 1 1
23 1 1 1
24 1 0 0
25 1 1 0
26 0 0 1
27 0 1 1
28 0 1 0
29 1 1 0
30 1 0 1
31 0 1 0
32 0 0 1
33 1 1 1
34 0 1 0
35 1 1 0
36 0 1 0
37 0 0 1
38 0 1 1
39 0 1 1
And I got the joint probability P(A,B,C) by this :
grp = df.apply(tuple, axis=1)
PrD=pd.concat([df.groupby(grp).first(),
grp.groupby(grp).count().div(len(df)).rename("Probs")],
axis=1).reset_index(drop=True)
print (PrD)
which outputs the joint probability P(A,B,C)
A B C Probs
0 0 0 0 0.075
1 0 0 1 0.125
2 0 1 0 0.150
3 0 1 1 0.150
4 1 0 0 0.150
5 1 0 1 0.150
6 1 1 0 0.125
7 1 1 1 0.075
I am trying to write a function that receives a subset of column names of PrD and computes conditional probability that follows the rule P(A|B)= P(A,B)/P(B)
and in case it receives 3 variables : P(A|B,C)=P(A,B,C)/P(B,C) if it receives 4 variables: P(A|B,C,D) =P(A,B,C,D)/P(B,C,D) and so on. For example if the function receives P(A=0|B=0) the output should be calculated by (0.075 0.125)/(0.075 0.125 0.150 0.150) = 0.2 where the numerator is when both A and B = 0 and the denominator is where B =0
and In case it receives one variable A=0 for example it returns the (0.075 0.125 0.150 0.150 ) which only rows where A=0
I tried loc and query but they only receive one variable not multiple variables
I want a function that calculates based whatever number of variables it receives
CodePudding user response:
You are doing math on a computer and that means something should be done differently.
You don't need to build joint probability tables and such. You can count how many rows where A = 0 and A and B = 0 and divide the two:
def prob(df, a, *cols):
"""Return the probability that all columns in `cols` are 0 given column `a` is 0
"""
if len(cols) == 0:
return df[a].eq(0).sum() / len(df)
else:
return df[[a] list(cols)].eq(0).all(axis=1).sum() / df[list(cols)].eq(0).sum()
Usage:
prob(df, "A", "B") # 0.4
prob(df, "A", "B", "C") # 0.15
