Python: how to get the next sequence of a list of lists based on a condition?-CodePudding

I used an NLP chunker that splits incorrectly the term 'C ' and 'C#' as: C (NN), (SYM), (SYM), C (NN), #(SYM).

The resulting list of incorrect chunking looks like this:

l = [['C', 'NN'], [' ', 'SYM'], [' ', 'SYM'], ['C', 'NN'], ['#', 'NN']]

I would like to post-process this list, by identifying the strings in index 0 of each list that are 'C' and the next in line ' ', ' ' or '#'. Then I'd like to concatenate these strings, so that 'C',' ',' ' becomes 'C ' by simply adding these together. This has to be generalisable, so it should work with lists that contain multiple different words, but still concatenate the desired strings.

desired result:

l_desired = [['C  ', 'NN'], ['C#', 'NN']]

I can identify the items in the list independently (index 0) but I don't know how to go about identifying the desired sequence. My idea was to use the next() function, although I do not know where to begin.

CodePudding user response：

You can loop over the list and check if the first element is a letter, in this case append as a new item, else update the last item:

from string import ascii_letters

letters = set(ascii_letters)

out = []
for e in l:
    if e[0][0] in letters:
        out.append(e.copy()) # making a copy not to affect original list
    elif out: # this is to check that out is not empty (edge case)
        out[-1][0]  = e[0]

Or using a blacklist of symbols:

symbols = set(' #')

out = []
for e in l:
    if e[0] in symbols and out:
        out[-1][0]  = e[0]
    else:
        out.append(e.copy())

output:

[['C  ', 'NN'], ['C#', 'NN']]

CodePudding user response：

Here is a naive implementation of a generator which will read the original list as an iterator and "tokenize" it (again - naively):

def the_generator(l):
    it = iter(l)

    def get_tok():
        x = it.next()
        return (",".join(x),x)

    while True:
        tok1 = get_tok()
        tok3 = None
        if tok1[0] != 'C,NN':
            yield tok1[1]
            continue
        tok2 = get_tok()
        if tok2[0] == '#,NN':
            yield ['C#','NN']
            continue
        if tok2[0] == ' ,SYM':
            tok3 = get_tok()
            if tok3[0] == ' ,SYM':
                yield ['C  ','NN']
                continue
        yield tok1[1]
        yield tok2[1]
        if tok3:
            yield tok3[1]


l = [['Dog', 'NN'], ['C', 'NN'], [' ', 'SYM'], [' ', 'SYM'], ['C', 'NN'], ['#', 'NN'], ['C', 'NN'], [' ','SYM'], ['#', 'NN']]

for x in the_generator(l):
  print(x)

The output:

['Dog', 'NN']
['C  ', 'NN']
['C#', 'NN']
['C', 'NN']
[' ', 'SYM']
['#', 'NN']

The generator does not convert the list all at once, only as needed. To create a new list all at once you can do list(the_generator(l)).

I am stringifying the individual tokens with join() to make comparisons simple. The while True loop ends naturally when the original iterable ends and .next() raises StopIteration.

CodePudding user response：

Basically, I just appended each letter one by one.

When there's a match with the two strings we're looking for ("C " or "C#"), it will add that value to the list and reset the string.

    l = [['C', 'NN'], [' ', 'SYM'], [' ', 'SYM'], ['C', 'NN'], ['#', 'NN']]
    
    l_desired = []
    x, y = "", ""
    
    for item in l:
        if x == "":
            y = item[1]
            
        x  = item[0]
        
        if x in ["C  ", "C#"]:
            l_desired.append([x, y])
            x, y = "", ""       

print("RESULT: "   l_desired)
# RESULT: [['C  ', 'NN'], ['C#', 'NN']]