MemoryError: unable to allocate array with shape and data type float32 while using word2vec in python
Solution 1
Ideally, you should paste the text of your error into your question, rather than a screenshot. However, I see the two key lines:
<TIMESTAMP> : INFO : estimated required memory for 2372206 words and 400 dimensions: 8777162200 bytes
...
MemoryError: unable to allocate array with shape (2372206, 400) and data type float32
After making one pass over your corpus, the model has learned how many unique words will survive, which reports how large of a model must be allocated: one taking about 8777162200 bytes
(about 8.8GB). But, when trying to allocate the required vector array, you're getting a MemoryError
, which indicates not enough computer addressable-memory (RAM) is available.
You can either:
- run where there's more memory, perhaps by adding RAM to your existing system; or
- reduce the amount of memory required, chiefly by reducing either the number of unique word-vectors you'd like to train, or their dimensional size.
You could reduce the number of words by increasing the default min_count=5
parameter to something like min_count=10
or min_count=20
or min_count=50
. (You probably don't need over 2 million word-vectors – many interesting results are possible with just a vocabulary of a few tens-of-thousands of words.)
You could also set a max_final_vocab
value, to specify an exact number of unique words to keep. For example, max_final_vocab=500000
would keep just the 500000 most-frequent words, ignoring the rest.
Reducing the size
will also save memory. A setting of size=300
is popular for word-vectors, and would reduce the memory requirements by a quarter.
Together, using size=300, max_final_vocab=500000
should trim the required memory to under 2GB.
Solution 2
I encountered the same problem while working on pandas dataframe, i solved it by converting float64 types to unint8 ( of course for those not necessarily needs to be float64, you can try float32 instead of 64)
data['label'] = data['label'].astype(np.uint8)
if you encounter conversion errors
data['label'] = data['label'].astype(np.uint8,errors='ignore')
Related videos on Youtube
suraj
Updated on July 09, 2022Comments
-
suraj over 1 year
I am trying to train the word2vec model from Wikipedia text data, for that I am using following code.
import logging import os.path import sys import multiprocessing from gensim.corpora import WikiCorpus from gensim.models import Word2Vec from gensim.models.word2vec import LineSentence if __name__ == '__main__': program = os.path.basename(sys.argv[0]) logger = logging.getLogger(program) logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s') logging.root.setLevel(level=logging.INFO) logger.info("running %s" % ' '.join(sys.argv)) # check and process input arguments if len(sys.argv) < 3: print (globals()['__doc__']) sys.exit(1) inp, outp = sys.argv[1:3] model = Word2Vec(LineSentence(inp), size=400, window=5, min_count=5, workers=multiprocessing.cpu_count()) # trim unneeded model memory = use (much) less RAM model.init_sims(replace=True) model.save(outp)
But after 20 minutes of program running, I am getting following error
-
Oliver W. over 4 yearsSo, you have a memory error. The machine you're trying to run Word2Vec on doesn't have enough memory to process the entire input. Consider making the input smaller or read up on distributed frameworks.
-
suraj over 4 years@ Oliver W. Can you tell how can I do it using distributed frameworks as I am new in dis field. I am using windows machine.
-
Oliver W. over 4 yearsThat's a really broad question, suraj, and I believe you're better served by reading some material first. Check out modern frameworks, like Apache Spark. But ask yourself if you need that? Perhaps all you need is really to reduce the problem size?
-
-
suraj over 4 yearsHere forward I will paste the text error. Thank for the wonderfull description and solution given by you.
-
user3710004 over 2 yearsWhat does "size" actually measure in a vectorizer?
-
gojomo over 2 yearsIn these word2vec implementations,
size
usually refers to the number of dimensions in the trained word-vectors. (In recent versions of Gensim, this parameter's name has been changed tovector_size
.)