Assume I have a data like this:
x = np.random.randn(4, 100000)
and I fit a histogram
hist = np.histogramdd(x, density=True)
What I what is to get probability of number g, e.g. g=0.1. Assume some hypothetical function foo then.
g = 0.1
prob = foo(hist, g)
print(prob)
>> 0.2223124214
How could I do something like this, where I get probability back for a single or a vector of numbers for a fitted histogram ? Especially histogram that is N dimensional.
CodePudding user response:
histogramdd takes O(r^D) memory, and unless you have a very large dataset or very small dimension you will have a poor estimate. Consider your example data, 100k points in 4-D space, the default histogram will be 10 x 10 x 10 x 10, so it will have 10k bins.
x = np.random.randn(4, 100000)
hist = np.histogramdd(x.transpose(), density=True)
np.mean(hist[0] == 0)
gives something arround 0.77 meaning that 77% of the bins in the histogram have no points.
You probably want to smooth the distribution. Unless you have a good reason to not do, I would suggest you to use Gaussian kernel-density Estimate
x = np.random.randn(4, 100000) # d x n array
f = scipy.stats.gaussian_kde(x) # d-dimensional PDF
f([1,2,3,4]) # evaluate the PDF in a given point
