How to Fix TypeError: doc2bow expects an array of unicode tokens on input, not a single string

Diagram of How to Fix TypeError: doc2bow expects an array of unicode tokens on input, not a single string

Diagram

TypeError: doc2bow expects an array of unicode tokens on input, not a single string error typically occurs when the “doc2bow() function from the Gensim library expects an array of unicode tokens as input, but you are providing a single string.”

The doc2bow() function is used to create a bag-of-words representation of a document. A bag-of-words representation is a list of the words that appear in a document, along with the number of times each word appears.

Before going further, you must install the gensim library if you have not installed it!

pip install gensim

Reproducing the error

from gensim.corpora import Dictionary

# Create a Dictionary
dct = Dictionary()

# Add document (incorrectly formatted)
dct.doc2bow("This is a sentence.")

Output

TypeError: doc2bow expects an array of unicode tokens on input, not a single string

How to fix it?

To fix this error, you must “convert the single string to an array of unicode tokens using the split() method.” The split() method will split the string into a list of words, where each word is a unicode token.

from gensim.corpora import Dictionary

# Create a Dictionary
dct = Dictionary()

# Add document (correctly formatted)
document = "This is a sentence.".split()

# Add the document to the dictionary to update token-to-id mapping
dct.add_documents([document])

# Convert document to bag-of-words format
bow = dct.doc2bow(document)

print(bow)

Output

[(0, 1), (1, 1), (2, 1), (3, 1)]

The output of the doc2bow() function is a list of tuples. Each tuple contains the word ID and the number of times the word appears in the document.

That’s it!

Related posts

token indices sequence length is longer than the specified maximum sequence length

Encode Multiple Sentences using transformers.BertTokenizer

ValueError: TextEncodeInput must be Union[TextInputSequence, Tuple[InputSequence, InputSequence]] – Tokenizing BERT / Distilbert Error

Leave a Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.