Tagging character matrices in NumPy-CodePudding

I'm trying to come up with a way of encoding strings with character-level tags in NumPy. For example, given the following accepted characters:

chars = ['a', 'b', 'c', 'd', '1', '2', '3', '4']

The string:

s = ['1','b','a','c','3','4','1','1']

gets encoded as like so:

char_mat = np.array([[c]*len(chars) for c in chars])

s_mat = 1*(char_mat==s)

and the resulting s_mat looks like this:

array([[0, 0, 1, 0, 0, 0, 0, 0],
       [0, 1, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 1, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0],
       [1, 0, 0, 0, 0, 0, 1, 1],
       [0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 1, 0, 0, 0],
       [0, 0, 0, 0, 0, 1, 0, 0]])

Where each row corresponds to a character in chars, and each column corresponds to the position in the string. So the a in s is in the third column, the b is in the second, and so on.

Let's say I also have a class tag for each character in the string, and these are, for instance:

'1' 'b' 'a' 'c' '3' '4' '1' '1'
 |   |   |   |   |   |   |   |
 v   v   v   v   v   v   v   v
 1   2   2   0   0   3   3   1

I'd like to come up with a way of outputting a tag_matrix that has the same shape as s_mat but contains the tag for each element, like this:

array([[0, 0, 2, 0, 0, 0, 0, 0],
       [0, 2, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0],
       [1, 0, 0, 0, 0, 0, 3, 1],
       [0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 3, 0, 0]])

But I really can't figure out how to do this. Many thanks in advance for any help or suggestions. Note also that this is a small version of the actual problem I'm working on, in which strings are much longer, and the accepted characters include ascii lowercase, digits, and punctuation.

CodePudding user response：

It should be as simple as multiplying the class tags array with s_mat:

class_tags = np.array([1, 2, 2, 0, 0, 3, 3, 1])
tag_matrix = s_mat * class_tags