Does Bert model need text?-CodePudding

Does Bert models need pre-processed text (Like removing special characters, stopwords, etc.) or I can directly pass my text as it is to Bert models. (HuggigFace libraries).

note: Follow up question to: String cleaning/preprocessing for BERT

CodePudding user response：

Refer this for more information on training and workflow of training a BERT Model.

CodePudding user response：

You need to tokenize your text first. The BertTokenizer class handles everything you need from raw text to tokens. See this:

from transformers import BertTokenizer, BertModel
import torch

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")
outputs = model(**inputs)

last_hidden_states = outputs.last_hidden_state