I have a dataframe that looks like this
Company CompanyDetails
A [{"companyId": 1482, "companyAddress": 'sampleaddress1', "numOfEmployees": 500}]
B [{"companyId": 1437, "companyAddress": 'sampleaddress2', "numOfEmployees": 50}]
C [{"companyId": 1452, "companyAddress": 'sampleaddress3', "numOfEmployees": 10000}]
When I execute df.dtypes I find that both the Company and CompanyDetails columns are objects.
df[['CompanyDetails']].iloc[0, :] would return '{["companyId": 1482, "companyAddress": 'sampleaddress1', numOfEmployees: 500]}' (there will be quotes ' ' around my array").
I am trying to extract the details within the dictionary in the CompanyDetails column so that I can add new columns to my dataframe to look like this:
Company CompanyId CompanyAddress numOfEmployees
A 1482 'sampleaddress1' 500
B 1437 'sampleaddress2' 50
C 1452 'sampleaddress3' 10000
I tried something like this as I was trying to convert the CompanyDetails column to contain arrays for all my values so I can easily extract each property in the object.
import ast
df['CompanyDetails'] = df['CompanyDetails'].apply(ast.literal_eval)
However, the above code caused this error
ValueError: malformed node or string: <ast.Name object at 0x000002D73D0C13A0>
Would appreciate any help on this, thanks!
CodePudding user response:
You're getting the error because it's actually not valid JSON. numOfEmployees is not quoted, and JSON required ALL keys to be double-quoted.
The easiest, safest (in terms of likelyhood to break) way I can think of to fix this would be to repair the JSON using a regular expression replace:
df['CompanyDetails'] = df['CompanyDetails'].str.replace(r',\s*(\w )\s*:', r', "\1":', regex=True)
Then do your other stuff:
import ast
df['CompanyDetails'] = df['CompanyDetails'].apply(ast.literal_eval)
df = pd.concat([df.drop('CompanyDetails', axis=1), pd.json_normalize(df['CompanyDetails'].explode())], axis=1)
...or whatever you have in mind.
CodePudding user response:
You can use
import pandas as pd
# Test dataframe
df = pd.DataFrame({'Company':['A'], 'CompanyDetails':[[{"companyId": 1482, "companyAddress": 'sampleaddress1', "numOfEmployees": 500}]]})
df['CompanyDetails'] = df['CompanyDetails'].str[0]
df = pd.concat([df.drop(['CompanyDetails'], axis=1), df['CompanyDetails'].apply(pd.Series)], axis=1)
# => >>> df
# Company companyId companyAddress numOfEmployees
# 0 A 1482 sampleaddress1 500
Note:
df['CompanyDetails'] = df['CompanyDetails'].str[0]gets the first item from each list since each of them only contains one itempd.concat([df.drop(['CompanyDetails'], axis=1), df['CompanyDetails'].apply(pd.Series)], axis=1)does the actual expansion and merging with the current dataframe.
