Background -
I have a pandas DataFrame, containing data that I would like to do some calculation on. Specifically, I need to iterate through every row of the DataFrame and compute a calculation, which if in Excel, would be =SUMIF.
Calculation (=SUMIFS([% Ownership],[Entity ID],[@[Entity ID]])) -
In this calculation, the integer value from the Entity ID column is taken, and for every instance of that Entity ID located in the Entity ID column throughout the DataFrame, it's corresponding % Ownership value is summed and printed in the Entity ID % Column.
>>> df
Entity ID % Ownership Entity ID %
0 12345 100.00 1.000000
1 45643 49.56 0.495551
2 00000 100.00 2.000000
3 00000 100.00 2.000000
In the above example:
12345only appears once in the DataFrame, therefore itsEntity IDvalue is 1.000000.45643only appears once in the DataFrame (with a 49.56 value), therefore itsEntity IDvalue is 0.495551.00000appears twice in the DataFrame (both instance with 1.0 values) therefore itsEntity IDvalue is 2.000000.
Useful info:
- You will note that the data formatting in the
% OwnershipandEntity ID %columns is different (e.g., 100.00 vs 1.00000), this is how the cells are formatted in my Excel workbook, and I would ideally like to use this formatting and decimal rounding if possible. - I would like to understand how to iterate through every row in my DataFrame (it's 4000 rows) and perform such a calculation using a for loop, so to effectively perform this calculation for every row.
CodePudding user response:
We can use a groupby to get the sum and assign the result directly into the DataFrame like so :
>>> df.assign(Entity_ID_perc = df.groupby('Entity ID')['% Ownership'].sum() / 100)
Entity ID % Ownership Entity_ID_perc
0 12345 100.00 1.0000
1 45643 49.56 0.4956
2 00000 100.00 2.0000
3 00000 100.00 2.0000
