i have this series which contains country,state,city and i would like to extract them accordingly- refer to the output table
| Region |
|---|
| US* |
| Arizona** |
| Phoenix |
| Mesa |
| California** |
| Los Angeles |
| San Diego |
| Sacramento |
| Florida** |
| Tampa |
| Miami |
| Canada* |
| Central Canada** |
| Montreal |
| London |
my desired output
| Region | State | City | |
|---|---|---|---|
| US* | Arizona** | Phoenix | |
| US* | Arizona** | Mesa | |
| US* | California** | Los Angeles | |
| US* | California** | San Diego | |
| US* | California** | Sacramento | |
| US* | Florida** | Tampa | |
| US* | Florida** | Miami | |
| Canada* | Central Canada** | Montreal | |
| Canada* | Central Canada** | London |
is this even possible?
I tried some panda operations with isin() but failed miserably
CodePudding user response:
of course it's possible:
def split_by_country(region_list: pd.Series):
result = []
start_idx = None
for i, region in enumerate(region_list):
if region.endswith("*") and not region.endswith("**"):
if start_idx is None:
start_idx = i
elif isinstance(start_idx, int):
result.append(region_list[start_idx: i])
start_idx = i
result.append(region_list[start_idx:])
return result
countries = split_by_country(regions_s)
countries
Above code will splits the series/list of regions to list of lists. Every sublist starts (index 0) with country name. Then u can do something like that:
country_dict = {country[0]: split_by_region(country[1:])
for country in countries}
split_by_region is the same as split_by_country by with different condition (region.endswith("*") and not region.endswith("**") > region.endswith("**"))
and at the end to (belowe code i write without checking, so it may contains some syntax error) :
result_df = pd.DataFrame(columns=["country","subregion","city"])
for i, (country, subregions) in enumerate(country_dict.iteritems()):
for subregion, city in subregions.iteritems():
result_df.loc[i] = [country, subregion, city]
