The BertTokenizer.from_pretrained() method is a class method in the Hugging Face Transformers library that allows you to load a pre-trained tokenizer for the BERT model. This tokenizer converts text input into a format the BERT model can understand.
To use BertTokenizer.from_pretrained(), first make sure you have the transformers library installed:
pip install transformers
In the next step, you can load the tokenizer for a specific BERT model in your Python script.
from transformers import BertTokenizer
# Load the tokenizer for the 'bert-base-uncased' model
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
There are several pre-trained BERT models available. Some common ones include:
- ‘bert-base-uncased‘: A smaller version of BERT that uses lowercase text.
- ‘bert-large-uncased’: A larger version of BERT that uses lowercase text, with more layers and parameters for higher accuracy but increased computational requirements.
- ‘bert-base-cased’: A smaller version of BERT that uses original cased text.
- ‘bert-large-cased’: A larger version of BERT that uses original cased text with more layers and parameters.
After loading the tokenizer, you can use it to tokenize text:
# Tokenize a single sentence
sentence = "This is John wick"
encoded_sentence = tokenizer.encode(sentence)
print("Tokenize a single sentence")
print(encoded_sentence)
See the below complete code.
from transformers import BertTokenizer
# Load the tokenizer for the 'bert-base-uncased' model
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
# Tokenize a single sentence
sentence = "This is John wick"
encoded_sentence = tokenizer.encode(sentence)
print("Tokenize a single sentence")
print(encoded_sentence)
Output
Tokenize a single sentence
[101, 2023, 2003, 2198, 15536, 3600, 102]
You can see that it tokenized a single sentence.
That’s it.