Does Bert models need pre-processed text (Like removing special characters, stopwords, etc.) or I can directly pass my text as it is to Bert models. (HuggigFace libraries).
note: Follow up question to: String cleaning/preprocessing for BERT
CodePudding user response:
Refer this for more information on training and workflow of training a BERT Model.
CodePudding user response:
You need to tokenize your text first. The BertTokenizer class handles everything you need from raw text to tokens. See this:
from transformers import BertTokenizer, BertModel
import torch
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')
inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")
outputs = model(**inputs)
last_hidden_states = outputs.last_hidden_state
