I have a dataframe containing a 'COUNTRY' column. I want to insert into the dataframe a new column, 'REGION', that includes in each row a list of regions that the country belongs to. For example, if the 'COUNTRY' column is:
In [5]: df['COUNTRY']
Out[5]:
0 ITALY
1 UNITED STATES
2 CHINA
3 FRANCE
4 BRAZIL
Name: COUNTRY, dtype: object
I want the 'REGION' column to be:
df['REGION']
Out[8]:
0 [ROW, EU]
1 [UNITED STATES]
2 [ROW, ASIA]
3 [ROW, EU]
4 [ROW]
Name: REGION, dtype: object
'ROW' stands for rest-of-the-world, 'EU' for Europe, etc.
I tried the following with np.select(), but I am getting an error.
# insert region column
condlist = [
(df.COUNTRY == 'ITALY'),
(df.COUNTRY == 'UNITED STATES'),
(df.COUNTRY != 'UNITED STATES'),
(df.COUNTRY == 'CHINA'),
]
region = [
['ROW', 'EU'],
['UNITED STATES'],
['ROW'],
['ROW', 'ASIA']
]
print(np.select(condlist, region, ['ROW']))
Note that for a country that is not included in the condlist[] tests, I want to assign it the region ['ROW'].
The error I am getting is:
ValueError: shape mismatch: objects cannot be broadcast to a single shape
CodePudding user response:
Here's a way I believe. Unfortunately I wasn't able to get the different sized lists into the Dataframe so you'd just have to do a query on the contents of Region to determine if "ROW" and "EU" are present.
countries = ['ITALY', 'UNITED STATES', 'CHINA', 'FRANCE', 'BRAZIL']
regions = [['ROW, EU'], ['UNITED STATES'], ['ROW, ASIA'], ['ROW, EU'], ['ROW']]
region_map = dict(zip(countries, regions))
df = pd.DataFrame(data={'COUNTRY': countries})
df['REGION'] = ""
for country in region_map:
df.loc[df['COUNTRY'] == country, 'REGION'] = region_map[country]
print(df)
COUNTRY REGION
0 ITALY ROW, EU
1 UNITED STATES UNITED STATES
2 CHINA ROW, ASIA
3 FRANCE ROW, EU
4 BRAZIL ROW
CodePudding user response:
The region_map construction proposed by @nanobennett gets a little hard to manage as the number of countries and regions increases. Also, using the pd.Series.map would make the solution more pythonic. What I like about the solution by @nanobennett is the use of a string for the regions instead of a list. Using a string for the region makes it easier to insert the region into a dataframe column and manipulate using the .str accessors. Here is what I ended up doing:
regions = {}
for country in countries:
# test for each country once
if country not in regions:
region = ''
# ROW
if country != 'UNITED STATES':
region = 'ROW, '
# all other countries
if country in alant.local.EU:
region = 'EU, '
if country in alant.local.DACH:
region = 'DACH, '
elif country in alant.local.NORDIC:
region = 'NORDIC, '
elif country in alant.local.BENELUX:
region = 'BENELUX, '
elif country in alant.local.SA:
region = 'SA, '
elif country in ['CANADA', 'UNITED STATES']:
region = 'NA, '
if country == 'UNITED STATES':
region = 'US, '
# special country blocs
if country in alant.local.BRICS:
region = 'BRICS, '
regions[country] = region.rstrip(' ,')
return countries.map(lambda country: regions.get(country, 'ROW'))
where for the lists used in the if elif tests:
alant.local.EU is a list of the European countries, alant.local.DACH are a list of the DACH countries, i.e., ['AUSTRIA', 'GERMANY', 'SWITZERLAND'], etc.
I believe this solution can be improved, perhaps using the 3.10 switch statement, and the pycountry or other packages I am not aware of. but for now it does the job for me.
