I currently have a function that yields a term and the sentence it occurs in. At this point, the function is only retrieving the first match from the list of terms. I would like to be able to retrieve all matches instead of just the first.
For example, the list_of_matches = ["heart attack", "cardiovascular", "hypoxia"]
and a sentence would be text_list = ["A heart attack is a result of cardiovascular...", "Chronic intermittent hypoxia is the..."]
The ideal output is:
['heart attack', 'a heart attack is a result of cardiovascular...'],
['cardiovascular', 'a heart attack is a result of cardiovascular...'],
['hypoxia', 'chronic intermittent hypoxia is the...']
# this is the current function
def find_word(list_of_matches, line):
for words in list_of_matches:
if any([words in line]):
return words, line
# returns list of 'term, matched string'
key_vals = [list(find_word(list_of_matches, line.lower())) for line in text_list if
find_word(list_of_matches, line.lower()) != None]
# output is currently
['heart attack', 'a heart attack is a result of cardiovascular...'],
['hypoxia', 'chronic intermittent hypoxia is the...']
CodePudding user response:
You're going to want to use regex here.
import re
def find_all_matches(words_to_search, text):
matches = []
for word in words_to_search:
matched_text = re.search(word, text).group()
matches.append(matched_text)
return [matches, text]
Please note that this will return a nested list for all the matches.
CodePudding user response:
The solution needs 2 steps:
- fix the function
- process the output
Given that your disired output follows the pattern
output = [
[word1, sentence1],
[word2, sentence1],
[word3, sentence2],
]
- Fix the function: you should change de return on 'for' loop to iterate over each word of list_of_matches, to get all words that matches and not only the first
. It should stay like this:
def find_word(list_of_matches, line):
answer = []
for words in list_of_matches:
if any([words in line]):
answer.append([words, line])
return answer
With the function above, the output will be:
key_vals = [
[
['heart attack', 'a heart attack is a result of cardiovascular...'],
['cardiovascular', 'a heart attack is a result of cardiovascular...']
],
[
['hypoxia', 'chronic intermittent hypoxia is the...']
]
]
- Process the output: Now you need to get the var "key_vals" and process all the list of lists for each sentence processed with the following code:
output = []
for word_sentence_list in key_vals:
for word_sentence in word_sentence_list:
output.append(word_sentence)
and, finally, the output will be:
output = [
['heart attack', 'a heart attack is a result of cardiovascular...'],
['cardiovascular', 'a heart attack is a result of cardiovascular...'],
['hypoxia', 'chronic intermittent hypoxia is the...']
]
