Home > OS >  How to turn a Pandas Dataframe with Numpy Lists as a single column into having each index represent
How to turn a Pandas Dataframe with Numpy Lists as a single column into having each index represent

Time:01-26

Hi all I'm currently working with word vectors in Python and would like to run some Bayesian Hierarchical Clustering in R which seems to only cluster when each vector index is given its own column. I have the code to retrieve the vectors but they are given in numpy arrays in a single column:

label                                             vector  \
0       1 Crónicas  [ 5.26622403e-03,  2.76202578e-02, -2.03670934e-...   
1           1 Juan  [-4.13045213e-02, -3.40997241e-04,  6.59986138e-...   
2          1 Pedro  [ 1.93648413e-03,  7.61903543e-03,  5.45683019e-...   
3          1 Reyes  [-0.01713392,  0.01234968, -0.00780387,  0.013362...

Ideally I would want it to be something like this:

label               x1               x2               x3         \
0       1 Crónicas  5.26622403e-03   2.76202578e-02   -2.03670934e-...   
1           1 Juan  -4.13045213e-02  -3.40997241e-04  6.59986138e-...   
2          1 Pedro  1.93648413e-03   7.61903543e-03   5.45683019e-...   
3          1 Reyes  -0.01713392      0.01234968       -0.00780387...

Here's some reproducible code I came up with

import pandas as pd
import random
import numpy as np

row_names = ["train", "car", "tractor", "truck", "boat", "plane"]
random_vectors = []

for i in row_names:
    vector = [random.uniform(0,1) for i in range(10)]
    random_vectors.append(np.array(vector))

label_DF = pd.DataFrame({'label':row_names, 'vector':random_vectors})

Any and all tips are welcome. Have a good day :)

CodePudding user response:

You can convert your list of lists to a 2D Numpy array and construct the final DataFrame with it:

import pandas as pd
import random
import numpy as np

row_names = ["train", "car", "tractor", "truck", "boat", "plane"]
random_vectors = []

for i in row_names:
    vector = [random.uniform(0,1) for i in range(10)]
    random_vectors.append(np.array(vector))

label_DF = pd.DataFrame({'label':row_names, 'vector':random_vectors})

# Create 2D Numpy array from values
temp = label_DF.vector.values
temp = np.array(list(temp))

# Create final DataFrame using the numpy array
output = pd.DataFrame(temp, index=label_DF.index)
output['label'] = label_DF.label

print(output)

which gives me:

          0         1         2         3         4         5         6         7         8         9    label
0  0.971427  0.608333  0.415566  0.139951  0.870935  0.219539  0.972286  0.345405  0.567477  0.087404    train
1  0.568816  0.178477  0.497407  0.415878  0.356035  0.915570  0.119754  0.064307  0.327284  0.899719      car
2  0.947162  0.622367  0.930498  0.362429  0.177074  0.828043  0.434496  0.334775  0.586800  0.685099  tractor
3  0.790544  0.630087  0.323274  0.656123  0.462856  0.437417  0.908296  0.883913  0.028340  0.901321    truck
4  0.110653  0.647129  0.902092  0.597604  0.312707  0.688970  0.889833  0.874016  0.292510  0.256918     boat
5  0.364499  0.149350  0.275034  0.959932  0.890455  0.548498  0.476552  0.146530  0.273142  0.008246    plane
  •  Tags:  
  • Related