I used an NLP chunker that splits incorrectly the term 'C ' and 'C#' as: C (NN), (SYM), (SYM), C (NN), #(SYM).
The resulting list of incorrect chunking looks like this:
l = [['C', 'NN'], [' ', 'SYM'], [' ', 'SYM'], ['C', 'NN'], ['#', 'NN']]
I would like to post-process this list, by identifying the strings in index 0 of each list that are 'C' and the next in line ' ', ' ' or '#'. Then I'd like to concatenate these strings, so that 'C',' ',' ' becomes 'C ' by simply adding these together. This has to be generalisable, so it should work with lists that contain multiple different words, but still concatenate the desired strings.
desired result:
l_desired = [['C ', 'NN'], ['C#', 'NN']]
I can identify the items in the list independently (index 0) but I don't know how to go about identifying the desired sequence. My idea was to use the next() function, although I do not know where to begin.
CodePudding user response:
You can loop over the list and check if the first element is a letter, in this case append as a new item, else update the last item:
from string import ascii_letters
letters = set(ascii_letters)
out = []
for e in l:
if e[0][0] in letters:
out.append(e.copy()) # making a copy not to affect original list
elif out: # this is to check that out is not empty (edge case)
out[-1][0] = e[0]
Or using a blacklist of symbols:
symbols = set(' #')
out = []
for e in l:
if e[0] in symbols and out:
out[-1][0] = e[0]
else:
out.append(e.copy())
output:
[['C ', 'NN'], ['C#', 'NN']]
CodePudding user response:
Here is a naive implementation of a generator which will read the original list as an iterator and "tokenize" it (again - naively):
def the_generator(l):
it = iter(l)
def get_tok():
x = it.next()
return (",".join(x),x)
while True:
tok1 = get_tok()
tok3 = None
if tok1[0] != 'C,NN':
yield tok1[1]
continue
tok2 = get_tok()
if tok2[0] == '#,NN':
yield ['C#','NN']
continue
if tok2[0] == ' ,SYM':
tok3 = get_tok()
if tok3[0] == ' ,SYM':
yield ['C ','NN']
continue
yield tok1[1]
yield tok2[1]
if tok3:
yield tok3[1]
l = [['Dog', 'NN'], ['C', 'NN'], [' ', 'SYM'], [' ', 'SYM'], ['C', 'NN'], ['#', 'NN'], ['C', 'NN'], [' ','SYM'], ['#', 'NN']]
for x in the_generator(l):
print(x)
The output:
['Dog', 'NN']
['C ', 'NN']
['C#', 'NN']
['C', 'NN']
[' ', 'SYM']
['#', 'NN']
The generator does not convert the list all at once, only as needed. To create a new list all at once you can do list(the_generator(l)).
I am stringifying the individual tokens with join() to make comparisons simple. The while True loop ends naturally when the original iterable ends and .next() raises StopIteration.
CodePudding user response:
Basically, I just appended each letter one by one.
When there's a match with the two strings we're looking for ("C " or "C#"), it will add that value to the list and reset the string.
l = [['C', 'NN'], [' ', 'SYM'], [' ', 'SYM'], ['C', 'NN'], ['#', 'NN']]
l_desired = []
x, y = "", ""
for item in l:
if x == "":
y = item[1]
x = item[0]
if x in ["C ", "C#"]:
l_desired.append([x, y])
x, y = "", ""
print("RESULT: " l_desired)
# RESULT: [['C ', 'NN'], ['C#', 'NN']]
