I have 21 pandas data frames, all of them looking like this:
|country | reference | price_1|price_2 |price 3| price_4 |
|------------------------------------------------------------
|A |1 | 0 | 5 | 13 | 4 |
|A |2 | 3 | 8 | 4 | 5 |
|A |3 | 0 | 4 | 6 | 9 |
|A |4 | 1 | 12 | 11 | 12 |
|A |5 |45 | 9 | 3 | 13 |
|A |6 |3 | 6 | 5 | 16 |
|A |7 |34 | 7 | 2 | 17 |
Here I use A as the name of the country, and the dataframe will be name dataA.
I want to get the mean of each column, price_1, price_2 and price_3, so I did this
datosA1 = dataA.iloc[:,2:5]
A_mean_loss = dataA1.mean(axis=0)
So, I get the dataframe A_mean_loss:
|price_1|12.28|
|price_2|7.3 |
|price_3|6.29 |
|price_4|10.85|
What I need is to do the same in a efficiently way for the 21 panda dataframes and to get a new dataframe that looks like this:
|price id| A |B |C |D |E |.......|(Country21)|
---------------------------------------------------------
|1 |12.28|x |x |x |x |.......|x |
|2 |7.3 |x |x |x |x |.......|x |
|3 |6.29 |x |x |x |x |.......|x |
|4 |10.85|x |x |x |x |x |x |
I'm actually really rookie in Python and programing, but I think that this could be solved generalizing what I did with a function with a for loop over the list of data frames data_countries=[dataA,dataB,dataC,dataD,...,dataU]
I would really apreciate any help!
CodePudding user response:
If you use groupby to do the summing, you can just concatenate the resulting series:
import pandas as pd
import numpy as np
data = [
('A',1, 0, 5, 13, 4),
('A',2, 3, 8, 4, 5),
('A',3, 0, 4, 6, 9),
('A',4, 1, 12, 11, 12),
('A',5,45, 9, 3, 13),
('A',6,3, 6, 5, 16),
('A',7,34, 7, 2, 17)
]
data = pd.DataFrame( data, columns=['country','reference','price_1','price_2','price_3','price_4'])
print(data)
a1 = data.groupby(['country']).mean()
a2 = data.groupby(['country']).mean()
a3 = data.groupby(['country']).mean()
df4 = pd.concat([a1,a2,a3])
print(df4)
Output:
country reference price_1 price_2 price_3 price_4
0 A 1 0 5 13 4
1 A 2 3 8 4 5
2 A 3 0 4 6 9
3 A 4 1 12 11 12
4 A 5 45 9 3 13
5 A 6 3 6 5 16
6 A 7 34 7 2 17
reference price_1 price_2 price_3 price_4
country
A 4.0 12.285714 7.285714 6.285714 10.857143
A 4.0 12.285714 7.285714 6.285714 10.857143
A 4.0 12.285714 7.285714 6.285714 10.857143
You would want to drop the useless reference column here.
CodePudding user response:
I don't have datasets to test it but i think it work.
import os
import pandas as pd
def country_mean_loss(dir):
#not sure if your files are csv files.
data = pd.read_csv(dir)
country = data.columns[0]
data.drop(columns=['country','reference'], inplace=True)
mean_loss = data.mean(axis=0).values
return country, mean_loss
mean_loss_df = pd.DataFrame()
path = 'your folder path'
#this will list all the files inside your folder.
files = os.listdir(path)
for f in files:
country, mean_loss = country_mean_loss(path f)
mean_loss_df[country] = mean_loss
