Scipy and numpy standard deviation methods give slightly different results. I don't understand why. Can anyone explain that to me?
Here is an example.
import numpy as np
import scipy.stats
ar = np.arange(20)
print(np.std(ar))
print(scipy.stats.tstd(ar))
returns
5.766281297335398
5.916079783099616
CodePudding user response:
It's in my mind a while ago..To get the same values
import numpy as np
import scipy.stats
ar = np.arange(20)
print(np.std(ar, ddof=1))
print(scipy.stats.tstd(ar))
output #
5.916079783099616
5.916079783099616
My mentor use to say
-->
ddof=1if you're calculatingnp.std()for a sample taken from your complete dataset.--->
ddof=0if you're calculating for the full population
CodePudding user response:
With np.std() you are computing the standard deviation:
x = np.abs(ar - ar.mean())**2
std = np.sqrt(np.sum(x) / len(ar)) # 5.766281297335398
However, with scipy.stats.tstd you are computing the trimmed standard deviation:
x = np.abs(ar - ar.mean())**2
std = np.sqrt(np.sum(x) / (len(ar) - 1)) # 5.916079783099616
Note that you are computing the square root of the mean of x when using np.std() (the mean of x is the sum of x divided by the length of x). When computing the trimmed version you are dividing by n-1, n being the length of the array.
