Extract numpy array, from np.array of tuples containing tuples?-CodePudding

I have seen Python get column vector from array of tuples, which I expected would have answered my question, but it doesn't.

So, I've prepared an example based on an example in that post, which shows what I want to do, and where I get stuck:

import numpy as np

# based on https://stackoverflow.com/a/48716125/6197439
# arr is a numpy array of tuple "pairs" of floats

oarr = [(0.109, 0.5), (0.109, 0.55), (0.109, 0.6), (0.2, 0.4), (0.3, 0.5)]
arr = np.array(oarr)
print("arr type: {} shape: {} dt {}".format(
  type(arr), arr.shape, arr.dtype))            # arr type: <class 'numpy.ndarray'> shape: (5, 2) dt float64
print("slice arr[:, 1]: {}".format(arr[:, 1])) # slice arr[:, 1]: [0.5  0.55 0.6  0.4  0.5 ]
print("slice arr[0, :]: {}".format(arr[0, :])) # slice arr[0, :]: [0.109 0.5  ]
print("arr len: {}".format(len(arr)))          # arr len: 5

# arr2, instead, becomes a numpy array of tuple "pairs", 
# with first element tuple of string and float, and second element float
# arr2 can still be sliced by numpy fine:

oarr2 = []
for ix in range(len(arr)):
  oarr2.append( ( (str(oarr[ix][0]), oarr[ix][0]), oarr[ix][1] ) )
arr2 = np.array( oarr2, dtype=object )

print("arr2 type: {} shape: {} dt {}".format(
  type(arr2), arr2.shape, arr2.dtype))           # arr2 type: <class 'numpy.ndarray'> shape: (5, 2) dt object
print("slice arr2[:, 1]: {}".format(arr2[:, 1])) # slice arr2[:, 1]: [0.5 0.55 0.6 0.4 0.5]
print("slice arr2[0, :]: {}".format(arr2[0, :])) # slice arr2[0, :]: [('0.109', 0.109) 0.5]
print("arr2 len: {}".format(len(arr2)))          # arr2 len: 5

# arr2fc is where we attempt to extract the tuples in arr2 "first column",
# using numpy slicing syntax.
# arr2fc is now a numpy array of objects, as previously,
# but these objects (tuple pairs of string and float),
# are now *not* considered objects with lengths, (see .shape below)
# so extracting e.g. the first column (the string element) 
# of the tuple, with numpy slicing syntax, fails: 

arr2fc = arr2[:, 0]

print(arr2fc)                                        # [('0.109', 0.109) ('0.109', 0.109) ('0.109', 0.109) ('0.2', 0.2) ('0.3', 0.3)]
print("arr2fc type: {} shape: {} dt {}".format(
  type(arr2fc), arr2fc.shape, arr2fc.dtype))         # arr2fc type: <class 'numpy.ndarray'> shape: (5,) dt object
print("slice arr2fc[:, 1]: {}".format(arr2fc[:, 1])) # IndexError: too many indices for array: array is 1-dimensional, but 2 were indexed

Basically, I'd like to extract the "columns" formed by tuples in arr2fc as separate numpy arrays; so from the column formed by first (the string) element of this tuple, I'd like to get numpy array of object (here string):

[ '0.109', '0.109', '0.109', '0.2', '0.3' ]

... and from the column formed by second (the float) element of this tuple, I'd like to get numpy array of float:

[ 0.109, 0.109, 0.109, 0.2, 0.3 ]

Sure, I can always do a Python loop, then iterate and populate an empty Python list, then convert that to numpy array -- however, is there something like a numpy slicing syntax, that would enable me to extract these "columns" with a one-liner, avoiding Python loops?

CodePudding user response：

For that you might want to use numpy vectorize. With numpy vectorize you can "vectorize" a function so that it can be applied on an input array and produce a new array or a tuple of arrays. For your example that could look like


vectorized_split = np.vectorize(lambda x: (x[0],x[1]))
string_array,float_array = vectorized_split(arr2fc)

It is important to note that this will not give you any numpy vectorization performance gains, as it just runs a for loop under the hood. However, when you cannot make use of numpy vectorization like in this case, it gives you at least a compact codebase.

CodePudding user response：

Your code as displayed in ipython:

In [178]: oarr = [(0.109, 0.5), (0.109, 0.55), (0.109, 0.6), (0.2, 0.4), (0.3,0.5)]
     ...: arr = np.array(oarr)
In [179]: oarr
Out[179]: [(0.109, 0.5), (0.109, 0.55), (0.109, 0.6), (0.2, 0.4), (0.3, 0.5)]
In [180]: arr
Out[180]: 
array([[0.109, 0.5  ],
       [0.109, 0.55 ],
       [0.109, 0.6  ],
       [0.2  , 0.4  ],
       [0.3  , 0.5  ]])

So starting with a list of tuples, we get a 2d array, with float dtype. A list of lists would work the same way.

Your next array:

In [181]: oarr2 = []
     ...: for ix in range(len(arr)):
     ...:   oarr2.append( ( (str(oarr[ix][0]), oarr[ix][0]), oarr[ix][1] ) )
     ...: arr2 = np.array( oarr2, dtype=object )
In [182]: oarr2
Out[182]: 
[(('0.109', 0.109), 0.5),
 (('0.109', 0.109), 0.55),
 (('0.109', 0.109), 0.6),
 (('0.2', 0.2), 0.4),
 (('0.3', 0.3), 0.5)]
In [183]: arr2
Out[183]: 
array([[('0.109', 0.109), 0.5],
       [('0.109', 0.109), 0.55],
       [('0.109', 0.109), 0.6],
       [('0.2', 0.2), 0.4],
       [('0.3', 0.3), 0.5]], dtype=object)

Again a 2d list, (5,2), but with a tuple as one element in each row.

Selecting a column:

In [184]: arr2fc = arr2[:, 0]
In [185]: arr2fc
Out[185]: 
array([('0.109', 0.109), ('0.109', 0.109), ('0.109', 0.109), ('0.2', 0.2),
       ('0.3', 0.3)], dtype=object)
In [186]: _.shape
Out[186]: (5,)

A 1d array of objects - each a tuple.

Converting it back to list, we can make a 2d array and again index a column:

In [187]: arr2fc.tolist()
Out[187]: 
[('0.109', 0.109),
 ('0.109', 0.109),
 ('0.109', 0.109),
 ('0.2', 0.2),
 ('0.3', 0.3)]
In [188]: np.array(arr2fc.tolist(),object)
Out[188]: 
array([['0.109', 0.109],
       ['0.109', 0.109],
       ['0.109', 0.109],
       ['0.2', 0.2],
       ['0.3', 0.3]], dtype=object)
In [189]: _[:,1]
Out[189]: array([0.109, 0.109, 0.109, 0.2, 0.3], dtype=object)

or with a list comprehension:

In [190]: [x[1] for x in arr2fc]
Out[190]: [0.109, 0.109, 0.109, 0.2, 0.3]

Multidimensional indexing only works on the dimensions shown by the shape. It does not "reach through" and index the objects, even if they are, by themselves, indexable.

Some comparative times:

In [194]: timeit string_array,float_array = vectorized_split(arr2fc)
31.5 µs ± 277 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [195]: timeit [x[1] for x in arr2fc]
1.57 µs ± 1.07 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [196]: timeit np.array(arr2fc.tolist(),object)[:,1]
3.77 µs ± 65 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

Here the "vectorize" method is much slower. For large arrays, "vectorize" speeds are closer to the list comprehension speeds.