Home > Software engineering >  pivot_table loosing median values after filtering?
pivot_table loosing median values after filtering?

Time:01-20

I have a car_data df:

print(car_data.head())

price  model_year           model  condition  cylinders fuel  odometer  \
0   9400        2011          bmw x5       good          6  gas    145000   
1  25500        2010      ford f-150       good          8  gas     88705   
2   5500        2013  hyundai sonata   like new          4  gas    110000   
3   1500        2003      ford f-150       fair          8  gas    226000   
4  14900        2017    chrysler 200  excellent          4  gas     80903   

  transmission    type paint_color  is_4wd date_posted  days_listed  car_age  \
0    automatic     SUV       white       1  2018-06-23           19        7   
1    automatic  pickup       white       1  2018-10-19           50        8   
2    automatic   sedan         red       1  2019-02-07           79        6   
3    automatic  pickup       white       1  2019-03-22            9       16   
4    automatic   sedan       black       1  2019-04-02           28        2   

   average_mileage  
0     20714.285714  
1     11088.125000  
2     18333.333333  
3     14125.000000  
4     40451.500000 

As I want to figure out the median prices for model_year 2018, I created the following pivot table:

car_data[(car_data.model_year == 2018)].pivot_table(index=['model', 'model_year'], values='price', aggfunc='median').head(50)

Resluts:

                                            price
model                         model_year         
bmw x5                        2018        40625.0
buick enclave                 2018        37382.0
cadillac escalade             2018        51900.0
chevrolet camaro              2018        22988.0
chevrolet camaro lt coupe 2d  2018        24200.0
chevrolet colorado            2018        30588.0
chevrolet corvette            2018        47950.0
chevrolet cruze               2018        13995.0
chevrolet equinox             2018        16349.0
chevrolet impala              2018        15995.0
chevrolet malibu              2018        16690.0
chevrolet silverado           2018        34988.0
chevrolet silverado 1500      2018        31500.0
chevrolet silverado 1500 crew 2018        33750.0
chevrolet silverado 2500hd    2018        46995.0
chevrolet silverado 3500hd    2018        42980.0
chevrolet suburban            2018        42995.0
chevrolet tahoe               2018        16500.5
chevrolet traverse            2018        26947.5
chrysler 300                  2018        12291.5
dodge charger                 2018            1.0

Now the issue is that dodge charger is showing a value of 1.0, so I checked dodge charger prices, and there are prices that are above 1.0 so how come the median is 1.0?

car_data[(car_data['model']=='dodge charger') & (car_data['model_year']==2018)]['price']

The results I get :

11391        1
11396        1
11633        1
12935    17690
13105        1
13449        1
13457        1
13995        1
14006        1
14255        1
14332    32999
14762        1
14764        1
16711    23985
16876        1
20916        1
23245        1
23333        1
23370        1
26548        1
26550        1
30233    40000
31157        1
31399    21900
32079    17690
32201        1
33348    25000
40326        1
40488    21900
42596    27985
43453    38997
43524    38997
45058        1
45080    25390
45100        1
45593    25588
48968        1

CodePudding user response:

Do not confuse the mean and the median:

the median is the value separating the higher half from the lower half of a population (wikipedia)

# Mean of your data sample
>>> df['price'].mean()
9679.594594594595

# Median of your data sample
>>> df['price'].median()
1.0

# index of the median
>>> len(df) / 2
18.5  # lower half: 18, upper half: 19

Your data sorted:

>>> df['price'].sort_values()
11391        1
45100        1
45058        1
40326        1
32201        1
31157        1
26550        1
26548        1
23333        1
23245        1
20916        1
16876        1
23370        1
48968        1
14762        1
11396        1
14255        1
14006        1
11633        1  # lower half median (index 18)
13995        1  # upper half median (index 19)
13457        1
13449        1
13105        1
14764        1
                # <- mean (9679.594594594595)
12935    17690
32079    17690
40488    21900
31399    21900
16711    23985
33348    25000
45080    25390
45593    25588
42596    27985
14332    32999
43453    38997
43524    38997
30233    40000
Name: price, dtype: int64
  •  Tags:  
  • Related