TypeError: doc2bow expects an array of unicode tokens on input, not a single string error typically occurs when the “doc2bow() function from the Gensim library expects an array of unicode tokens as input, but you are providing a single string.”
The doc2bow() function is used to create a bag-of-words representation of a document. A bag-of-words representation is a list of the words that appear in a document, along with the number of times each word appears.
Before going further, you must install the gensim library if you have not installed it!
pip install gensim
Reproducing the error
from gensim.corpora import Dictionary # Create a Dictionary dct = Dictionary() # Add document (incorrectly formatted) dct.doc2bow("This is a sentence.")
TypeError: doc2bow expects an array of unicode tokens on input, not a single string
How to fix it?
To fix this error, you must “convert the single string to an array of unicode tokens using the split() method.” The split() method will split the string into a list of words, where each word is a unicode token.
from gensim.corpora import Dictionary # Create a Dictionary dct = Dictionary() # Add document (correctly formatted) document = "This is a sentence.".split() # Add the document to the dictionary to update token-to-id mapping dct.add_documents([document]) # Convert document to bag-of-words format bow = dct.doc2bow(document) print(bow)
[(0, 1), (1, 1), (2, 1), (3, 1)]
The output of the doc2bow() function is a list of tuples. Each tuple contains the word ID and the number of times the word appears in the document.