I have a df that looks similar to the below but with 100s of columns, where each column references a different text resolution.
| Index | A/RES/73/262 | A/RES/73/263 |
|---|---|---|
| Issue-Primary | ME | HR |
| Issue-Secondary | NaN | NaN |
| Description | Protection of the Palestinian civilian | Situation of human rights in Myanmar |
I have script that loops through each resolution description and sets a value in the "Issue-Primary" for a defined dictionary of search terms.
df = pd.read_excel("file_name")
issue_dict = {'human rights':'HR', 'Protection':'HR', 'Palestinian':'ME'}
key_pattern = re.compile(rf'({"|".join(issue_dict.keys())})')
for columnName, columnData in df.iteritems():
matched_key = re.search(key_pattern, str(columnData[1]))
if matched_key:
columnData[0] = issue_dict.get(matched_key.group(), "Issue-Primary")
else:
columnData[0] = np.NaN
There are resolutions for which descriptions are relevant for multiple issues so I would like to apply a similar script to my "Issue-Secondary" row. However, I need to ensure that when the script is setting values for "Issue-Secondary" it is ignoring terms relevant for the code in "Issue-Primary"
For instance, "Protection" relates to 'HR', so in the first column after the script sets the value for "ME" in "Issue-Primary", it should return to "Issue-Secondary" and apply the same process without looking for "ME" terms (otherwise it simply duplicates the "ME"). Ideally, "Issue-Primary" will be set to "ME" and "Issue-Secondary" gets "HR" in the first resolution column.
Below is my first attempt. However, as you can imagine, since I am using the same issue_dict, it doesn't produce the desired results. I am unsure what the best way is to ensure that whatever term is in "Issue-Primary" gets left out when applying the script to "Issue-Secondary".
issue_codes_list = ['HR', 'ME']
key_pattern = re.compile(rf'({"|".join(issue_dict.keys())})')
for columnName, columnData in df.iteritems():
matched_key = re.search(key_pattern, str(columnData[1]))
if matched_key:
columnData[0] = issue_dict.get(matched_key.group(), "Issue-Primary")
else:
columnData[0] = np.NaN
for j in issue_codes_list:
if columnData[0] == j:
if matched_key:
columnData[1] = issue_dict.get(matched_key.group(), "Issue-Secondary")
else:
columnData[1] = np.NaN
CodePudding user response:
You can use re.findall to find all occurences of the strings to match at once, and then only keep unique values:
df = pd.DataFrame(index=['Issue-Primary','Issue-Secondary','Description'],columns=['a','b'])
df.loc['Description']=['Protection of the Palestinian civilian',
'Situation of human rights in Myanmar']
issue_dict = {'human rights':'HR', 'Protection':'HR', 'Palestinian':'ME'}
key_pattern = re.compile(rf'({"|".join(issue_dict.keys())})')
for columnName, columnData in df.iteritems():
matched_key = re.findall(key_pattern, str(columnData[2]))
if matched_key:
all_issues = [issue_dict.get(x, "Issue-Primary") for x in matched_key]
issues=list()
for i in all_issues:
if i not in issues: issues.append(i)
if len(issues)>0:
columnData[0] = issues[0]
if len(issues)>1:
columnData[1] = issues[1]
returns for df:
| a | b | |
|---|---|---|
| Issue-Primary | HR | HR |
| Issue-Secondary | ME | NaN |
| Description | Protection of the Palestinian civilian | Situation of human rights in Myanmar |
I am not sure how you are getting 'ME' for Issue-Primary in the first column though. If the ordering doesn't matter, you can save a few lines by replacing the block that starts at all_issues with
issues = list(set([issue_dict.get(x, "Issue-Primary") for x in matched_key]))
set() will remove duplicates.
I am also unsure if there's any use for you to use issue_dict.get(matched_key.group(), "Issue-Primary") instead of just issue_dict[matched_key.group()].
