I'm trying to come up with a way of encoding strings with character-level tags in NumPy. For example, given the following accepted characters:
chars = ['a', 'b', 'c', 'd', '1', '2', '3', '4']
The string:
s = ['1','b','a','c','3','4','1','1']
gets encoded as like so:
char_mat = np.array([[c]*len(chars) for c in chars])
s_mat = 1*(char_mat==s)
and the resulting s_mat looks like this:
array([[0, 0, 1, 0, 0, 0, 0, 0],
[0, 1, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 1, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0],
[1, 0, 0, 0, 0, 0, 1, 1],
[0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 1, 0, 0, 0],
[0, 0, 0, 0, 0, 1, 0, 0]])
Where each row corresponds to a character in chars, and each column corresponds to the position in the string. So the a in s is in the third column, the b is in the second, and so on.
Let's say I also have a class tag for each character in the string, and these are, for instance:
'1' 'b' 'a' 'c' '3' '4' '1' '1'
| | | | | | | |
v v v v v v v v
1 2 2 0 0 3 3 1
I'd like to come up with a way of outputting a tag_matrix that has the same shape as s_mat but contains the tag for each element, like this:
array([[0, 0, 2, 0, 0, 0, 0, 0],
[0, 2, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0],
[1, 0, 0, 0, 0, 0, 3, 1],
[0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 3, 0, 0]])
But I really can't figure out how to do this. Many thanks in advance for any help or suggestions. Note also that this is a small version of the actual problem I'm working on, in which strings are much longer, and the accepted characters include ascii lowercase, digits, and punctuation.
CodePudding user response:
It should be as simple as multiplying the class tags array with s_mat:
class_tags = np.array([1, 2, 2, 0, 0, 3, 3, 1])
tag_matrix = s_mat * class_tags
