Mapping country name to multiple-region designation-CodePudding

I have a dataframe containing a 'COUNTRY' column. I want to insert into the dataframe a new column, 'REGION', that includes in each row a list of regions that the country belongs to. For example, if the 'COUNTRY' column is:

In [5]: df['COUNTRY']
Out[5]: 
0            ITALY
1    UNITED STATES
2            CHINA
3           FRANCE
4           BRAZIL
Name: COUNTRY, dtype: object

I want the 'REGION' column to be:

df['REGION']
Out[8]: 
0          [ROW, EU]
1    [UNITED STATES]
2        [ROW, ASIA]
3          [ROW, EU]
4          [ROW]
Name: REGION, dtype: object

'ROW' stands for rest-of-the-world, 'EU' for Europe, etc.

I tried the following with np.select(), but I am getting an error.

    # insert region column
    condlist = [
        (df.COUNTRY == 'ITALY'),
        (df.COUNTRY == 'UNITED STATES'),
        (df.COUNTRY != 'UNITED STATES'),
        (df.COUNTRY == 'CHINA'),
    ]
    region = [
        ['ROW', 'EU'],
        ['UNITED STATES'],
        ['ROW'],
        ['ROW', 'ASIA']
    ]
    print(np.select(condlist, region, ['ROW']))

Note that for a country that is not included in the condlist[] tests, I want to assign it the region ['ROW'].

The error I am getting is:

ValueError: shape mismatch: objects cannot be broadcast to a single shape

CodePudding user response：

Here's a way I believe. Unfortunately I wasn't able to get the different sized lists into the Dataframe so you'd just have to do a query on the contents of Region to determine if "ROW" and "EU" are present.

countries = ['ITALY', 'UNITED STATES', 'CHINA', 'FRANCE', 'BRAZIL']

regions = [['ROW, EU'], ['UNITED STATES'], ['ROW, ASIA'], ['ROW, EU'], ['ROW']]

region_map = dict(zip(countries, regions))

df = pd.DataFrame(data={'COUNTRY': countries})

df['REGION'] = ""

for country in region_map:
    df.loc[df['COUNTRY'] == country, 'REGION'] = region_map[country]

print(df)

         COUNTRY         REGION
0          ITALY        ROW, EU
1  UNITED STATES  UNITED STATES
2          CHINA      ROW, ASIA
3         FRANCE        ROW, EU
4         BRAZIL            ROW

CodePudding user response：

The region_map construction proposed by @nanobennett gets a little hard to manage as the number of countries and regions increases. Also, using the pd.Series.map would make the solution more pythonic. What I like about the solution by @nanobennett is the use of a string for the regions instead of a list. Using a string for the region makes it easier to insert the region into a dataframe column and manipulate using the .str accessors. Here is what I ended up doing:

   regions = {}
   for country in countries:

       # test for each country once
       if country not in regions:

           region = ''

           # ROW
           if country != 'UNITED STATES':
               region  = 'ROW, '

           # all other countries
           if country in alant.local.EU:
               region  = 'EU, '
               if country in alant.local.DACH:
                   region  = 'DACH, '
               elif country in alant.local.NORDIC:
                   region  = 'NORDIC, '
               elif country in alant.local.BENELUX:
                   region  = 'BENELUX, '
           elif country in alant.local.SA:
               region  = 'SA, '
           elif country in ['CANADA', 'UNITED STATES']:
               region  = 'NA, '
               if country == 'UNITED STATES':
                   region  = 'US, '

           # special country blocs
           if country in alant.local.BRICS:
               region  = 'BRICS, '

           regions[country] = region.rstrip(' ,')

   return countries.map(lambda country: regions.get(country, 'ROW'))

where for the lists used in the if elif tests: alant.local.EU is a list of the European countries, alant.local.DACH are a list of the DACH countries, i.e., ['AUSTRIA', 'GERMANY', 'SWITZERLAND'], etc.

I believe this solution can be improved, perhaps using the 3.10 switch statement, and the pycountry or other packages I am not aware of. but for now it does the job for me.