How to Encode Multiple Sentences using transformers.BertTokenizer

To encode multiple sentences using the transformers.BertTokenizer, you can use the BertTokenizer.from_pretrained() method. The BertTokenizer.from_pretrained() method is a class method in the Hugging Face Transformers library that allows you to load a pre-trained tokenizer for the BERT model. This tokenizer converts text input into a format the BERT model can understand.

To use BertTokenizer.from_pretrained() method, ensure that you have installed the transformers library. If not, then you can install it using the below command.

pip install transformers

In the next step, you can load the tokenizer for a specific BERT model as follows.

from transformers import BertTokenizer

# Load the tokenizer for the 'bert-base-uncased' model
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

Now, you can tokenize multiple sentences using BertTokenizer like this:

from transformers import BertTokenizer

# Load the tokenizer for the 'bert-base-uncased' model
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# Define your sentences
sentences = [
  "This is the John wick.",
  "I am a non-violent guy",
  "Don't Afraid of me!"
]

encoded_sentences = tokenizer(
  sentences, padding=True, truncation=True, return_tensors="pt")

print(encoded_sentences)

Output

{'input_ids': tensor([[ 101, 2023, 2003, 1996, 2198, 15536, 3600, 1012, 102],
[ 101, 1045, 2572, 1037, 2512, 1011, 6355, 3124, 102],
[ 101, 2123, 1005, 1056, 4452, 1997, 2033, 999, 102]]), 
'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1],
[1, 1, 1, 1, 1, 1, 1, 1, 1],
[1, 1, 1, 1, 1, 1, 1, 1, 1]])}

The tokenizer function takes a list of sentences, and you can use optional arguments like padding, truncation, and return_tensors to control the output format.

  1. padding=True: Pads the encoded sequences to the same length.
  2. truncation=True: Truncates the encoded sequences to the maximum model length.
  3. return_tensors=”pt”: Returns the output as PyTorch tensors. Replace “pt” with “tf” for TensorFlow tensors, or remove the argument to get plain Python lists.

You can see encoded_sentences filled with a dictionary containing input IDs, token type IDs, and attention masks. These can be fed directly into a BERT model for further processing.

That’s it.

Leave a Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.