I have a car_data df:
print(car_data.head())
price model_year model condition cylinders fuel odometer \
0 9400 2011 bmw x5 good 6 gas 145000
1 25500 2010 ford f-150 good 8 gas 88705
2 5500 2013 hyundai sonata like new 4 gas 110000
3 1500 2003 ford f-150 fair 8 gas 226000
4 14900 2017 chrysler 200 excellent 4 gas 80903
transmission type paint_color is_4wd date_posted days_listed car_age \
0 automatic SUV white 1 2018-06-23 19 7
1 automatic pickup white 1 2018-10-19 50 8
2 automatic sedan red 1 2019-02-07 79 6
3 automatic pickup white 1 2019-03-22 9 16
4 automatic sedan black 1 2019-04-02 28 2
average_mileage
0 20714.285714
1 11088.125000
2 18333.333333
3 14125.000000
4 40451.500000
As I want to figure out the median prices for model_year 2018, I created the following pivot table:
car_data[(car_data.model_year == 2018)].pivot_table(index=['model', 'model_year'], values='price', aggfunc='median').head(50)
Resluts:
price
model model_year
bmw x5 2018 40625.0
buick enclave 2018 37382.0
cadillac escalade 2018 51900.0
chevrolet camaro 2018 22988.0
chevrolet camaro lt coupe 2d 2018 24200.0
chevrolet colorado 2018 30588.0
chevrolet corvette 2018 47950.0
chevrolet cruze 2018 13995.0
chevrolet equinox 2018 16349.0
chevrolet impala 2018 15995.0
chevrolet malibu 2018 16690.0
chevrolet silverado 2018 34988.0
chevrolet silverado 1500 2018 31500.0
chevrolet silverado 1500 crew 2018 33750.0
chevrolet silverado 2500hd 2018 46995.0
chevrolet silverado 3500hd 2018 42980.0
chevrolet suburban 2018 42995.0
chevrolet tahoe 2018 16500.5
chevrolet traverse 2018 26947.5
chrysler 300 2018 12291.5
dodge charger 2018 1.0
Now the issue is that dodge charger is showing a value of 1.0, so I checked dodge charger prices, and there are prices that are above 1.0 so how come the median is 1.0?
car_data[(car_data['model']=='dodge charger') & (car_data['model_year']==2018)]['price']
The results I get :
11391 1
11396 1
11633 1
12935 17690
13105 1
13449 1
13457 1
13995 1
14006 1
14255 1
14332 32999
14762 1
14764 1
16711 23985
16876 1
20916 1
23245 1
23333 1
23370 1
26548 1
26550 1
30233 40000
31157 1
31399 21900
32079 17690
32201 1
33348 25000
40326 1
40488 21900
42596 27985
43453 38997
43524 38997
45058 1
45080 25390
45100 1
45593 25588
48968 1
CodePudding user response:
Do not confuse the mean and the median:
the median is the value separating the higher half from the lower half of a population (wikipedia)
# Mean of your data sample
>>> df['price'].mean()
9679.594594594595
# Median of your data sample
>>> df['price'].median()
1.0
# index of the median
>>> len(df) / 2
18.5 # lower half: 18, upper half: 19
Your data sorted:
>>> df['price'].sort_values()
11391 1
45100 1
45058 1
40326 1
32201 1
31157 1
26550 1
26548 1
23333 1
23245 1
20916 1
16876 1
23370 1
48968 1
14762 1
11396 1
14255 1
14006 1
11633 1 # lower half median (index 18)
13995 1 # upper half median (index 19)
13457 1
13449 1
13105 1
14764 1
# <- mean (9679.594594594595)
12935 17690
32079 17690
40488 21900
31399 21900
16711 23985
33348 25000
45080 25390
45593 25588
42596 27985
14332 32999
43453 38997
43524 38997
30233 40000
Name: price, dtype: int64
