Trying to convert string values to array in pandas dataframe-CodePudding

I have a dataframe that looks like this

Company   CompanyDetails
A         [{"companyId": 1482, "companyAddress": 'sampleaddress1', "numOfEmployees": 500}]
B         [{"companyId": 1437, "companyAddress": 'sampleaddress2', "numOfEmployees": 50}]
C         [{"companyId": 1452, "companyAddress": 'sampleaddress3', "numOfEmployees": 10000}]

When I execute df.dtypes I find that both the Company and CompanyDetails columns are objects.

df[['CompanyDetails']].iloc[0, :] would return '{["companyId": 1482, "companyAddress": 'sampleaddress1', numOfEmployees: 500]}' (there will be quotes ' ' around my array").

I am trying to extract the details within the dictionary in the CompanyDetails column so that I can add new columns to my dataframe to look like this:

Company   CompanyId   CompanyAddress   numOfEmployees   
A          1482      'sampleaddress1'         500
B          1437      'sampleaddress2'         50
C          1452      'sampleaddress3'       10000

I tried something like this as I was trying to convert the CompanyDetails column to contain arrays for all my values so I can easily extract each property in the object.

import ast
df['CompanyDetails'] = df['CompanyDetails'].apply(ast.literal_eval)

However, the above code caused this error

ValueError: malformed node or string: <ast.Name object at 0x000002D73D0C13A0>

Would appreciate any help on this, thanks!

CodePudding user response：

You're getting the error because it's actually not valid JSON. numOfEmployees is not quoted, and JSON required ALL keys to be double-quoted.

The easiest, safest (in terms of likelyhood to break) way I can think of to fix this would be to repair the JSON using a regular expression replace:

df['CompanyDetails'] = df['CompanyDetails'].str.replace(r',\s*(\w )\s*:', r', "\1":', regex=True)

Then do your other stuff:

import ast
df['CompanyDetails'] = df['CompanyDetails'].apply(ast.literal_eval)
df = pd.concat([df.drop('CompanyDetails', axis=1), pd.json_normalize(df['CompanyDetails'].explode())], axis=1)

...or whatever you have in mind.

CodePudding user response：

You can use

import pandas as pd

# Test dataframe
df = pd.DataFrame({'Company':['A'], 'CompanyDetails':[[{"companyId": 1482, "companyAddress": 'sampleaddress1', "numOfEmployees": 500}]]})

df['CompanyDetails'] = df['CompanyDetails'].str[0]
df = pd.concat([df.drop(['CompanyDetails'], axis=1), df['CompanyDetails'].apply(pd.Series)], axis=1)
# => >>> df
#      Company  companyId  companyAddress  numOfEmployees
#    0       A       1482  sampleaddress1             500

Note:

df['CompanyDetails'] = df['CompanyDetails'].str[0] gets the first item from each list since each of them only contains one item
pd.concat([df.drop(['CompanyDetails'], axis=1), df['CompanyDetails'].apply(pd.Series)], axis=1) does the actual expansion and merging with the current dataframe.