Home > Software engineering >  Trying to convert string values to array in pandas dataframe
Trying to convert string values to array in pandas dataframe

Time:02-01

I have a dataframe that looks like this

Company   CompanyDetails
A         [{"companyId": 1482, "companyAddress": 'sampleaddress1', "numOfEmployees": 500}]
B         [{"companyId": 1437, "companyAddress": 'sampleaddress2', "numOfEmployees": 50}]
C         [{"companyId": 1452, "companyAddress": 'sampleaddress3', "numOfEmployees": 10000}]

When I execute df.dtypes I find that both the Company and CompanyDetails columns are objects.

df[['CompanyDetails']].iloc[0, :] would return '{["companyId": 1482, "companyAddress": 'sampleaddress1', numOfEmployees: 500]}' (there will be quotes ' ' around my array").

I am trying to extract the details within the dictionary in the CompanyDetails column so that I can add new columns to my dataframe to look like this:

Company   CompanyId   CompanyAddress   numOfEmployees   
A          1482      'sampleaddress1'         500
B          1437      'sampleaddress2'         50
C          1452      'sampleaddress3'       10000

I tried something like this as I was trying to convert the CompanyDetails column to contain arrays for all my values so I can easily extract each property in the object.

import ast
df['CompanyDetails'] = df['CompanyDetails'].apply(ast.literal_eval)

However, the above code caused this error

ValueError: malformed node or string: <ast.Name object at 0x000002D73D0C13A0>

Would appreciate any help on this, thanks!

CodePudding user response:

You're getting the error because it's actually not valid JSON. numOfEmployees is not quoted, and JSON required ALL keys to be double-quoted.

The easiest, safest (in terms of likelyhood to break) way I can think of to fix this would be to repair the JSON using a regular expression replace:

df['CompanyDetails'] = df['CompanyDetails'].str.replace(r',\s*(\w )\s*:', r', "\1":', regex=True)

Then do your other stuff:

import ast
df['CompanyDetails'] = df['CompanyDetails'].apply(ast.literal_eval)
df = pd.concat([df.drop('CompanyDetails', axis=1), pd.json_normalize(df['CompanyDetails'].explode())], axis=1)

...or whatever you have in mind.

CodePudding user response:

You can use

import pandas as pd

# Test dataframe
df = pd.DataFrame({'Company':['A'], 'CompanyDetails':[[{"companyId": 1482, "companyAddress": 'sampleaddress1', "numOfEmployees": 500}]]})

df['CompanyDetails'] = df['CompanyDetails'].str[0]
df = pd.concat([df.drop(['CompanyDetails'], axis=1), df['CompanyDetails'].apply(pd.Series)], axis=1)
# => >>> df
#      Company  companyId  companyAddress  numOfEmployees
#    0       A       1482  sampleaddress1             500

Note:

  • df['CompanyDetails'] = df['CompanyDetails'].str[0] gets the first item from each list since each of them only contains one item
  • pd.concat([df.drop(['CompanyDetails'], axis=1), df['CompanyDetails'].apply(pd.Series)], axis=1) does the actual expansion and merging with the current dataframe.
  •  Tags:  
  • Related