Tensorflow Keras text_to_sequence return a list of lists-CodePudding

i have a problem in text_to_sequence in tf.keras

test_data = 'The invention relates to the fields of biotechnology, virology, epidemiology and public health, and is method for obtaining of new inactivated vaccine against coronavirus COVID-19. The essence matter of invention is COVID-19 virus SARS-CoV-2/KZ_Almaty/04.2020 strain isolated on the territory of the Republic of Kazakhstan. The strain of COVID-19 virus according to the optimal cultivation conditions is produced in the Vero cell culture system, inactivated by formaldehyde, clarified by low-speed centrifugation, purified and concentrated by diafiltration on diafiltration unit of Millipore Pellicon Cassette system. Sterilizing filtration is carried out through cascades of filters with a pore diameter of 0.45/0.22 μm. 2 % aluminum hydroxide gel Algidrogel, 85 is added in the obtained virus pool (viral concentrate) to final concentration of 0.5 mg/0.5 ml and bottled in glass vials. The vaccine obtained in this way is safe at intraperitoneal introduction to white mice and intravenously - to rabbits. The vaccine provides 80 % protection against COVID-19 infection for at least 6 months after two vaccinations. The vaccine keeps its properties for 12 months at 4-6°C.'

i have this string test data and i am trying to predict it's classification from a model that i have trained. The problem is that when i call text_to_sequence:

test = tf.keras.preprocessing.text.text_to_word_sequence(test_data)
test = token.texts_to_sequences(test)
print(test)

somehow it returns a list of lists and not a list of the word tokens.

[[1], [7726], [1], [13], [7726], [1], [2997], [1], [1], [7509], [1], [1], [1], [1], [4842], [1], [7167], [1], [1], [1], [1], [1], [4842], [1], [1], [1], [8383], [1], [1], [1], [1], [1], [1], [7167], [1], [1], [1], [1], [1], [1], [1], [1], [1], [1], [1], [5979], [1], [6054], [1], [13], [1], [1], [7509], [1], [13], [1], [1], [1], [1], [1], [14214], [1], [1], [1], [1], [689], [1], [1], [1], [4842], [1], [7167], [1], [1], [1], [1], [1], [1], [1], [7167], [1], [1], [7509], [1], [9204], [1], [1], [1], [1], [1], [7167], [1], [1], [1], [4842], [1], [7167], [1], [1], [5979], [1], [1], [7167], [1], [1], [1], [6054], [1], [1], [1], [7509], [1], [1], [4842], [1], [7167], [6054], [1], [1], [1], [1]]

test = pad_sequences(test, maxlen=max_length, padding='post')
test

So the output of the padding for max_length 200 is this:

array([[   1,    0,    0, ...,    0,    0,    0],
       [7726,    0,    0, ...,    0,    0,    0],
       [   1,    0,    0, ...,    0,    0,    0],
       ...,
       [   1,    0,    0, ...,    0,    0,    0],
       [   1,    0,    0, ...,    0,    0,    0],
       [   1,    0,    0, ...,    0,    0,    0]], dtype=int32)

Where it should be a single array with 200 lenght.

I have done some tests and it seems that the problem is text_to_sequence which returns this faulty list.

Any ideas what seems to be the cause? Should i change the input of text_to_sequence or is there any other solution?

CodePudding user response：

You should not use text_to_word_sequence if you are already using the class Tokenizer. Since the tokenizer repeats what text_to_word_sequence actually does, namely tokenize. Try something like this:

import tensorflow as tf

tokenizer = tf.keras.preprocessing.text.Tokenizer(num_words=300, filters = ' ', oov_token='UNK')
test_data = 'The invention relates to the fields of biotechnology, virology, epidemiology and public health, and is method for obtaining of new inactivated vaccine against coronavirus COVID-19. The essence matter of invention is COVID-19 virus SARS-CoV-2/KZ_Almaty/04.2020 strain isolated on the territory of the Republic of Kazakhstan. The strain of COVID-19 virus according to the optimal cultivation conditions is produced in the Vero cell culture system, inactivated by formaldehyde, clarified by low-speed centrifugation, purified and concentrated by diafiltration on diafiltration unit of Millipore Pellicon Cassette system. Sterilizing filtration is carried out through cascades of filters with a pore diameter of 0.45/0.22 μm. 2 % aluminum hydroxide gel Algidrogel, 85 is added in the obtained virus pool (viral concentrate) to final concentration of 0.5 mg/0.5 ml and bottled in glass vials. The vaccine obtained in this way is safe at intraperitoneal introduction to white mice and intravenously - to rabbits. The vaccine provides 80 % protection against COVID-19 infection for at least 6 months after two vaccinations. The vaccine keeps its properties for 12 months at 4-6°C.'
test = [test_data]
tokenizer.fit_on_texts(test)
test = tokenizer.texts_to_sequences(test)
test = tf.keras.preprocessing.sequence.pad_sequences(test, maxlen=200, padding='post')

print(test.shape)
# (1, 200)