I am using pyodbc to perform queries to a large SQL Server database. Because SQL can not hold native numpy arrays, data is stored in regular strings representative of numpy arrays. When I load this data to a DataFrame, it is as follows:
Id Array1 Array2
1 -21.722315 11.017685 -23.340452 2754.642 481.94247 21.728323
...
149001 1.611342 1.526262 -35.415166 6252.124 61.51516 852.15167
However, I then want to perform operations on Array1 and Array2, so I need to convert them to actual numpy arrays. My current way of doing this is applying np.fromstring to the entire dataset column.
df['Array1'] = df['Array1'].apply(lambda x: np.fromstring(x, dtype=np.float32, sep = ' '))
df['Array2'] = df['Array2'].apply(lambda x: np.fromstring(x, dtype=np.float32, sep = ' '))
# Elapsed Time: 9.524s
Result:
Id Array1 Array2
1 [-21.722315, 11.017685, -23.340452] [2754.642, 481.94247, 21.728323]
...
149001 [1.611342, 1.526262, -35.415166] [6252.124, 61.51516, 852.15167]
While this code works, I don't believe it is efficient nor scalable. Are there more computationally efficient ways of transforming a large amount of data in numpy arrays?
CodePudding user response:
Using tobytes() to get the underlying bytestring and frombuffer() to convert back will be much more efficient. One detail is that the array dimensions are lost after tobytes(), so you may need to resize the after frombuffer().
