How to Use BertTokenizer.from_pretrained() Method in Transformers

The BertTokenizer.from_pretrained() method is a class method in the Hugging Face Transformers library that allows you to load a pre-trained tokenizer for the BERT model. This tokenizer converts text input into a format the BERT model can understand.

To use BertTokenizer.from_pretrained(), first make sure you have the transformers library installed:

pip install transformers

In the next step, you can load the tokenizer for a specific BERT model in your Python script.

from transformers import BertTokenizer

# Load the tokenizer for the 'bert-base-uncased' model
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

There are several pre-trained BERT models available. Some common ones include:

  1. ‘bert-base-uncased‘: A smaller version of BERT that uses lowercase text.
  2. ‘bert-large-uncased’: A larger version of BERT that uses lowercase text, with more layers and parameters for higher accuracy but increased computational requirements.
  3. ‘bert-base-cased’: A smaller version of BERT that uses original cased text.
  4. ‘bert-large-cased’: A larger version of BERT that uses original cased text with more layers and parameters.

After loading the tokenizer, you can use it to tokenize text:

# Tokenize a single sentence

sentence = "This is John wick"
encoded_sentence = tokenizer.encode(sentence)
print("Tokenize a single sentence")
print(encoded_sentence)

See the below complete code.

from transformers import BertTokenizer

# Load the tokenizer for the 'bert-base-uncased' model
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# Tokenize a single sentence
sentence = "This is John wick"
encoded_sentence = tokenizer.encode(sentence)
print("Tokenize a single sentence")
print(encoded_sentence)

Output

Tokenize a single sentence
[101, 2023, 2003, 2198, 15536, 3600, 102]

You can see that it tokenized a single sentence.

That’s it.

Leave a Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.