How to implement a dynamic search over multiple rows-CodePudding

I have a df that looks similar to the below but with 100s of columns, where each column references a different text resolution.

Index	A/RES/73/262	A/RES/73/263
Issue-Primary	ME	HR
Issue-Secondary	NaN	NaN
Description	Protection of the Palestinian civilian	Situation of human rights in Myanmar

I have script that loops through each resolution description and sets a value in the "Issue-Primary" for a defined dictionary of search terms.

df = pd.read_excel("file_name")
issue_dict = {'human rights':'HR', 'Protection':'HR', 'Palestinian':'ME'} 
key_pattern = re.compile(rf'({"|".join(issue_dict.keys())})')

for columnName, columnData in df.iteritems():
    matched_key = re.search(key_pattern, str(columnData[1]))
    if matched_key:
        columnData[0] = issue_dict.get(matched_key.group(), "Issue-Primary")
    else:
        columnData[0] = np.NaN

There are resolutions for which descriptions are relevant for multiple issues so I would like to apply a similar script to my "Issue-Secondary" row. However, I need to ensure that when the script is setting values for "Issue-Secondary" it is ignoring terms relevant for the code in "Issue-Primary"

For instance, "Protection" relates to 'HR', so in the first column after the script sets the value for "ME" in "Issue-Primary", it should return to "Issue-Secondary" and apply the same process without looking for "ME" terms (otherwise it simply duplicates the "ME"). Ideally, "Issue-Primary" will be set to "ME" and "Issue-Secondary" gets "HR" in the first resolution column.

Below is my first attempt. However, as you can imagine, since I am using the same issue_dict, it doesn't produce the desired results. I am unsure what the best way is to ensure that whatever term is in "Issue-Primary" gets left out when applying the script to "Issue-Secondary".

issue_codes_list = ['HR', 'ME']
key_pattern = re.compile(rf'({"|".join(issue_dict.keys())})')

for columnName, columnData in df.iteritems():
    matched_key = re.search(key_pattern, str(columnData[1]))
    if matched_key:
        columnData[0] = issue_dict.get(matched_key.group(), "Issue-Primary")
    else:
        columnData[0] = np.NaN
    
    for j in issue_codes_list:
        if columnData[0] == j:
            if matched_key:
                columnData[1] = issue_dict.get(matched_key.group(), "Issue-Secondary")
            else:
                columnData[1] = np.NaN

CodePudding user response：

You can use re.findall to find all occurences of the strings to match at once, and then only keep unique values:

df = pd.DataFrame(index=['Issue-Primary','Issue-Secondary','Description'],columns=['a','b'])
df.loc['Description']=['Protection of the Palestinian civilian',
                       'Situation of human rights in Myanmar']

issue_dict = {'human rights':'HR', 'Protection':'HR', 'Palestinian':'ME'} 
key_pattern = re.compile(rf'({"|".join(issue_dict.keys())})')

for columnName, columnData in df.iteritems():
    matched_key = re.findall(key_pattern, str(columnData[2]))
    if matched_key:
        
        all_issues = [issue_dict.get(x, "Issue-Primary") for x in matched_key]
        issues=list()
        for i in all_issues:
            if i not in issues: issues.append(i)
        
        if len(issues)>0:
            columnData[0] = issues[0]
        if len(issues)>1:
            columnData[1] = issues[1]

returns for df:

	a	b
Issue-Primary	HR	HR
Issue-Secondary	ME	NaN
Description	Protection of the Palestinian civilian	Situation of human rights in Myanmar

I am not sure how you are getting 'ME' for Issue-Primary in the first column though. If the ordering doesn't matter, you can save a few lines by replacing the block that starts at all_issues with

issues = list(set([issue_dict.get(x, "Issue-Primary") for x in matched_key]))

set() will remove duplicates.

I am also unsure if there's any use for you to use issue_dict.get(matched_key.group(), "Issue-Primary") instead of just issue_dict[matched_key.group()].