Diagram
TypeError: doc2bow expects an array of unicode tokens on input, not a single string error typically occurs when the “doc2bow() function from the Gensim library expects an array of unicode tokens as input, but you are providing a single string.”
The doc2bow() function is used to create a bag-of-words representation of a document. A bag-of-words representation is a list of the words that appear in a document, along with the number of times each word appears.
Before going further, you must install the gensim library if you have not installed it!
pip install gensim
Reproducing the error
from gensim.corpora import Dictionary
# Create a Dictionary
dct = Dictionary()
# Add document (incorrectly formatted)
dct.doc2bow("This is a sentence.")
Output
TypeError: doc2bow expects an array of unicode tokens on input, not a single string
How to fix it?
To fix this error, you must “convert the single string to an array of unicode tokens using the split() method.” The split() method will split the string into a list of words, where each word is a unicode token.
from gensim.corpora import Dictionary
# Create a Dictionary
dct = Dictionary()
# Add document (correctly formatted)
document = "This is a sentence.".split()
# Add the document to the dictionary to update token-to-id mapping
dct.add_documents([document])
# Convert document to bag-of-words format
bow = dct.doc2bow(document)
print(bow)
Output
[(0, 1), (1, 1), (2, 1), (3, 1)]
The output of the doc2bow() function is a list of tuples. Each tuple contains the word ID and the number of times the word appears in the document.
That’s it!
Related posts
token indices sequence length is longer than the specified maximum sequence length
Encode Multiple Sentences using transformers.BertTokenizer

Krunal Lathiya is a seasoned Computer Science expert with over eight years in the tech industry. He boasts deep knowledge in Data Science and Machine Learning. Versed in Python, JavaScript, PHP, R, and Golang. Skilled in frameworks like Angular and React and platforms such as Node.js. His expertise spans both front-end and back-end development. His proficiency in the Python language stands as a testament to his versatility and commitment to the craft.