Merging bins in numpy array-CodePudding

I have an histogram saved in an array, with the rightmost edges of the bins in the first column and the corresponding frequency in the second one. For example:

array([[1.00000000e 00, 9.76765797e-02],
   [2.00000000e 00, 3.26260189e-02],
   [3.00000000e 00, 2.27720518e-03],
   [4.00000000e 00, 1.61188858e-01],
   [5.00000000e 00, 1.23496687e-01],
   [6.00000000e 00, 2.04377586e-01],
   [7.00000000e 00, 7.47678209e-02],
   [8.00000000e 00, 4.67140951e-02],
   [9.00000000e 00, 1.31659099e-01],
   [1.00000000e 01, 1.25216050e-01]])

What is the fastest way to rebin this histogram, for example by taking a bin size of 2.5?

The resulting array should have 2.5,5.0,7.5,10.0 as first column and the sum of the frequency values in the intervals [0,2.5],(2.5,5.0],(5.0,7.5],(5.0,10.] as second column.

I'm trying to find a compact way to make this transformation but cannot find it.

Edit: As Jakob Stark made me notice, it's not possible to rebin a histogram in general. However it is possible to merge bins. For example, doubling or tripling the bin size. How can one do this in a compact way?

I have updated the question's title to reflect the edit.

CodePudding user response：

You cannot rebin a histogram. If you fill data in a histogram, you loose information (thats in fact often the reason why you want histograms). Unless you still have the original data there is no way to get a histogram with a different binning.

If you have the original data, you can of course make a new histogram with the desired binning out of it.

Edit You can merge bins though. So as long as your new bins can be expressed through merged bins (e.g. double the bin size) you can just add the wheights of each contributing bin to the merged bin.

CodePudding user response：

As @Jakob Stark pointed out, you can only rebin as long as your new bin size is a multiple of your old one; this allows you to merge bins cleanly.

Below is an example of how you could bin your data using different bin sizes:

import numpy as np

arr = np.array(
    [
        [1.00000000e00, 9.76765797e-02],
        [2.00000000e00, 3.26260189e-02],
        [3.00000000e00, 2.27720518e-03],
        [4.00000000e00, 1.61188858e-01],
        [5.00000000e00, 1.23496687e-01],
        [6.00000000e00, 2.04377586e-01],
        [7.00000000e00, 7.47678209e-02],
        [8.00000000e00, 4.67140951e-02],
        [9.00000000e00, 1.31659099e-01],
        [1.00000000e01, 1.25216050e-01],
    ]
)

rightmost = arr[-1][0]

bin_sizes = [2, 3, 5]
for size in bin_sizes:
    result = []
    for i in range(0, int(rightmost), size):
        bound = min(rightmost, i   size)
        freq = arr[i : i   size, 1].sum()

        result.append((bound, freq))

    print(np.array(result), end="\n\n")

This produces the following output:

[[ 2.          0.1303026 ]
 [ 4.          0.16346606]
 [ 6.          0.32787427]
 [ 8.          0.12148192]
 [10.          0.25687515]]

[[ 3.          0.1325798 ]
 [ 6.          0.48906313]
 [ 9.          0.25314101]
 [10.          0.12521605]]

[[ 5.          0.41726535]
 [10.          0.58273465]]

CodePudding user response：

In the end, I cam up with this. Not terribly efficient, though, I'm afraid:

data=array([[1.00000000e 00, 9.76765797e-02],
   [2.00000000e 00, 3.26260189e-02],
   [3.00000000e 00, 2.27720518e-03],
   [4.00000000e 00, 1.61188858e-01],
   [5.00000000e 00, 1.23496687e-01],
   [6.00000000e 00, 2.04377586e-01],
   [7.00000000e 00, 7.47678209e-02],
   [8.00000000e 00, 4.67140951e-02],
   [9.00000000e 00, 1.31659099e-01],
   [1.00000000e 01, 1.25216050e-01]])

bin_size=2.

x=data[:,0]
y=data[:,1]     
nbins=max(x)/bin_size
x_merge=asarray([max(a) for a in array_split(x,nbins)])
y_merge=asarray([sum(a) for a in array_split(y,nbins)])
out_array=column_stack((x_merge,y_merge))

Still interested in more efficient/compact ways to do this.