df[percent] = (df['column_name'] / df['column_name'].sum()) * 100
My dataset is:
| name_d | year | regime_origin | FLOW | GDP_d |
|---|---|---|---|---|
| ITALY | 1990 | 0 | 120 | 200 |
| ITALY | 1991 | 0 | 239 | 500 |
| SPAIN | 1990 | 1 | 123 | 399 |
| ITALY | 1993 | 1 | 120 | 200 |
I would like to have the percentage of the FLOW.
I would have:
df[percent] = (df['FLOW'] / df['GDP_d']) * 100
But some source says I have to use .sum(). Why is sum() necessary?
CodePudding user response:
If you're trying to compute the percentage that each value from the column FLOW has relative to the total flow, then you really need to divide each value by the sum of all values:
df[percent] = (df['column_name']/df['column_name'].sum()) * 100
When you have something like:
df[percent] = (df['FLOW'] / df['GDP_d']) * 100
What you're actually calculating is the value that the FLOW of each row has, relative to that same row's GDP_d. So for example, on the first line, a flow of 120 represents 60% of the GDP_d (200). The second line equals 47.80%, and so on. If FLOW and GDP_d columns are relatable in a sense that GDP_d represents the maximum that every element from FLOW could have then it's completely fine to use (df['FLOW'] / df['GDP_d']) * 100.
However if in reality you wanted to find the average percentage of each FLOW observation (observation = row), in respect to its total population (all values of FLOW combined), then you really need to divide each value by the sum of all values.
It's important to note that the sum of all values you get when you divide FLOW by GDP_d won't add up to 100%. In your example, the sum would be equal to 198.62%, and the average of df[percent] would be 49.65%. This means that, on average, the value of each row from FLOW is about 50% of size of GDP_d.
On the other hand, when you calculate the percentage FLOW in respect to the total FLOW population the sum of all values is always equal to 100%.
Here's an comparison:
| name_d | year | regime_origin | FLOW | GDP_d | FLOW/GDP_d | EQUATION(FLOW/GDP_d) | FLOW/SUM(FLOW) | EQUATION(FLOW/SUM(FLOW)) | FLOW/SUM(GDP_d) | EQUATION(FLOW/SUM(GDP_d)) | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | ITALY | 1990 | 0 | 120 | 200 | 60 | 120/200 | 19.93 | 120/602 | 9.24 | 120/1299 |
| 1 | ITALY | 1991 | 0 | 239 | 500 | 47.8 | 239/500 | 39.7 | 239/602 | 18.4 | 239/1299 |
| 2 | SPAIN | 1990 | 1 | 123 | 399 | 30.83 | 123/399 | 20.43 | 123/602 | 9.47 | 123/1299 |
| 3 | ITALY | 1993 | 1 | 120 | 200 | 60 | 120/200 | 19.93 | 120/602 | 9.24 | 120/1299 |
import pandas as pd
df = pd.DataFrame(
{
"name_d": ["ITALY", "ITALY", "SPAIN", "ITALY"],
"year": [1990, 1991, 1990, 1993],
"regime_origin": [0, 0, 1, 1],
"FLOW": [120, 239, 123, 120],
"GDP_d": [200, 500, 399, 200],
}
)
df['FLOW/GDP_d'] = ((df['FLOW'] / df['GDP_d']) * 100).round(2)
df['EQUATION(FLOW/GDP_d)'] = df.apply(lambda row: f'{row["FLOW"]:0d}/{row["GDP_d"]:0d}', axis=1)
df['FLOW/SUM(FLOW)'] = ((df['FLOW'] / df['FLOW'].sum()) * 100).round(2)
df['EQUATION(FLOW/SUM(FLOW))'] = df.apply(lambda row: f'{row["FLOW"]:0d}/{df["FLOW"].sum():0d}', axis=1)
df['FLOW/SUM(GDP_d)'] = ((df['FLOW'] / df['GDP_d'].sum()) * 100).round(2)
df['EQUATION(FLOW/SUM(GDP_d))'] = df.apply(lambda row: f'{row["FLOW"]:0d}/{df["GDP_d"].sum():0d}', axis=1)
