I have the following pandas dataframe:
pd.DataFrame({'keys': {3: 'brandId', 5: 'price', 14: 'sizes', 18: 'brandId', 20: 'price', 29: 'sizes', 30: 'condition', 31: 'condition', 32: 'colour', 33: 'age', 36: 'brand', 40: 'colour', 41: 'brand', 44: 'productType', 50: 'brandId', 52: 'price', 61: 'sizes', 62: 'condition', 63: 'colour', 64: 'age', 67: 'brand', 70: 'productType'}, 'values': {3: 925, 5: {'currencyName': 'GBP', 'priceAmount': '50.00', 'nationalShippingCost': '3.00'}, 14: {'id': 4, 'name': 'UK 4', 'quantity': 1}, 18: 925, 20: {'currencyName': 'GBP', 'priceAmount': '11.00', 'nationalShippingCost': '0.00'}, 29: {'id': 3, 'name': 'S', 'quantity': 1}, 30: {'id': 'used_like_new', 'name': 'Like new'}, 31: {'id': 'brand_new', 'name': 'Brand new'}, 32: {'id': 'multi', 'name': 'Multi'}, 33: {'id': 'modern', 'name': 'Modern'}, 36: 'chinese-laundry', 40: {'id': 'white', 'name': 'White'}, 41: 'chinese-laundry', 44: 'tshirts', 50: 925, 52: {'currencyName': 'GBP', 'priceAmount': '20.00', 'nationalShippingCost': '3.00'}, 61: {'id': 11, 'name': 'M', 'quantity': 1}, 62: {'id': 'brand_new', 'name': 'Brand new'}, 63: {'id': 'black', 'name': 'Black'}, 64: {'id': '90s', 'name': '90s'}, 67: 'chinese-laundry', 70: 'jackets'}})
Which looks like this:
keys values
3 brandId 925
5 price {'currencyName': 'GBP', 'priceAmount': '50.00'...
14 sizes {'id': 4, 'name': 'UK 4', 'quantity': 1}
18 brandId 925
20 price {'currencyName': 'GBP', 'priceAmount': '11.00'...
29 sizes {'id': 3, 'name': 'S', 'quantity': 1}
30 condition {'id': 'used_like_new', 'name': 'Like new'}
...
I want to flatten out the dictionary for specific values belonging to their key. For example, grab only the value from priceAmount, and the value from name in any other dictionary key.
So expected output:
keys values
3 brandId 925
5 price 50.00
14 sizes UK 4
18 brandId 925
20 price 11.00
29 sizes S
30 condition Like new}
I can do this with the following, which takes an awfully long time if I had more to replace!
price_data = []
for price in data[data['keys'].str.contains('price', na=False)].values:
price_data.append(price[1]['priceAmount'])
condition_data = []
for condition in data[data['keys'].str.contains('condition', na=False)].values:
condition_data.append(condition[1]['name'])
age_data = []
for age in data[data['keys'].str.contains('age', na=False)].values:
age_data.append(age[1]['name'])
sizes_data = []
for sizes in data[data['keys'].str.contains('sizes', na=False)].values:
sizes_data.append(sizes[1]['name'])
colour_data = []
for colour in data[data['keys'].str.contains('colour', na=False)].values:
colour_data.append(colour[1]['name'])
#replace the values
data=data.replace(data[data['keys'].str.contains('price', na=False)]['values'].values, price_data)
data=data.replace(data[data['keys'].str.contains('condition', na=False)]['values'].values, condition_data)
data=data.replace(data[data['keys'].str.contains('age', na=False)]['values'].values, age_data)
data=data.replace(data[data['keys'].str.contains('sizes', na=False)]['values'].values, sizes_data)
data=data.replace(data[data['keys'].str.contains('colour', na=False)]['values'].values, colour_data)
Is there a faster and smoother alternative to this?
CodePudding user response:
Perhaps, if you have access to the dict that you used to make the df, you can instead use json_normalize().
For example:
d = {
'keys': {
3: 'brandId', 5: 'price', 14: 'sizes', 18: 'brandId', 20: 'price', 29: 'sizes', 30: 'condition',
31: 'condition', 32: 'colour', 33: 'age', 36: 'brand', 40: 'colour', 41: 'brand', 44: 'productType',
50: 'brandId', 52: 'price', 61: 'sizes', 62: 'condition', 63: 'colour', 64: 'age', 67: 'brand',
70: 'productType',
},
'values': {
3: 925, 5: {'currencyName': 'GBP', 'priceAmount': '50.00', 'nationalShippingCost': '3.00'},
14: {'id': 4, 'name': 'UK 4', 'quantity': 1}, 18: 925,
20: {'currencyName': 'GBP', 'priceAmount': '11.00', 'nationalShippingCost': '0.00'},
29: {'id': 3, 'name': 'S', 'quantity': 1}, 30: {'id': 'used_like_new', 'name': 'Like new'},
31: {'id': 'brand_new', 'name': 'Brand new'}, 32: {'id': 'multi', 'name': 'Multi'}, 33:
{'id': 'modern', 'name': 'Modern'}, 36: 'chinese-laundry', 40: {'id': 'white', 'name': 'White'},
41: 'chinese-laundry', 44: 'tshirts', 50: 925,
52: {'currencyName': 'GBP', 'priceAmount': '20.00', 'nationalShippingCost': '3.00'},
61: {'id': 11, 'name': 'M', 'quantity': 1}, 62: {'id': 'brand_new', 'name': 'Brand new'},
63: {'id': 'black', 'name': 'Black'}, 64: {'id': '90s', 'name': '90s'}, 67: 'chinese-laundry',
70: 'jackets',
},
}
Note that this dict is in a bit of an unusual form, with keys and values separated at the top level. In order to put them together, so that json_normalize() can be used, we want to merge them so that all records are complete (keys and values). That factors out the numeric keys for each. Note, since I assume there are many records in values (probably a list of dicts?), you would have to do that on each of them.
>>> mod_d = {d['keys'][i]: v for i, v in d['values'].items()}
>>> mod_d
{'brandId': 925,
'price': {'currencyName': 'GBP',
'priceAmount': '20.00',
'nationalShippingCost': '3.00'},
'sizes': {'id': 11, 'name': 'M', 'quantity': 1},
'condition': {'id': 'brand_new', 'name': 'Brand new'},
'colour': {'id': 'black', 'name': 'Black'},
'age': {'id': '90s', 'name': '90s'},
'brand': 'chinese-laundry',
'productType': 'jackets'}
With this, we can now use json_normalize():
>>> df = pd.json_normalize(mod_d)
>>> df
brandId brand productType price.currencyName price.priceAmount \
0 925 chinese-laundry jackets GBP 20.00
price.nationalShippingCost sizes.id sizes.name sizes.quantity \
0 3.00 11 M 1
condition.id condition.name colour.id colour.name age.id age.name
0 brand_new Brand new black Black 90s 90s
,
CodePudding user response:
pandas string methods allow accessing values from lists/tuples/dictionaries:
df['val'] = np.where(df['keys'] == 'price',
df['values'].str['priceAmount'],
df['values'].str['name'])
df['val'] = df['val'].fillna(df['values'])
keys values val
3 brandId 925 925
5 price {'currencyName': 'GBP', 'priceAmount': '50.00'... 50.00
14 sizes {'id': 4, 'name': 'UK 4', 'quantity': 1} UK 4
18 brandId 925 925
20 price {'currencyName': 'GBP', 'priceAmount': '11.00'... 11.00
29 sizes {'id': 3, 'name': 'S', 'quantity': 1} S
30 condition {'id': 'used_like_new', 'name': 'Like new'} Like new
31 condition {'id': 'brand_new', 'name': 'Brand new'} Brand new
32 colour {'id': 'multi', 'name': 'Multi'} Multi
33 age {'id': 'modern', 'name': 'Modern'} Modern
36 brand chinese-laundry chinese-laundry
40 colour {'id': 'white', 'name': 'White'} White
41 brand chinese-laundry chinese-laundry
44 productType tshirts tshirts
50 brandId 925 925
52 price {'currencyName': 'GBP', 'priceAmount': '20.00'... 20.00
61 sizes {'id': 11, 'name': 'M', 'quantity': 1} M
62 condition {'id': 'brand_new', 'name': 'Brand new'} Brand new
63 colour {'id': 'black', 'name': 'Black'} Black
64 age {'id': '90s', 'name': '90s'} 90s
67 brand chinese-laundry chinese-laundry
70 productType jackets jackets
CodePudding user response:
Another option is to use a simple list comprehension:
df['values'] = [i.get('priceAmount') or i.get('name') if isinstance(i, dict) else i for i in df['values'].tolist()]
Output:
keys values
3 brandId 925
5 price 50.00
14 sizes UK 4
18 brandId 925
20 price 11.00
29 sizes S
30 condition Like new
31 condition Brand new
32 colour Multi
33 age Modern
36 brand chinese-laundry
40 colour White
41 brand chinese-laundry
44 productType tshirts
50 brandId 925
52 price 20.00
61 sizes M
62 condition Brand new
63 colour Black
64 age 90s
67 brand chinese-laundry
70 productType jackets
