word2vec lemmatization of corpus before training

14,073

Solution 1

I think it really matters about what you want to solve with this. It depends on the task.

Essentially by lemmatization, you make the input space sparser, which can help if you don't have enough training data.

But since Word2Vec is fairly big, if you have big enough training data, lemmatization shouldn't gain you much.

Something more interesting is, how to do tokenization with respect to the existing diction of words-vectors inside the W2V (or anything else). Like "Good muffins cost $3.88\nin New York." needs to be tokenized to ['Good', 'muffins', 'cost', '$', '3.88', 'in', 'New York.'] Then you can replace it with its vectors from W2V. The challenge is that some tokenizers my tokenize "New York" as ['New' 'York'], which doesn't make much sense. (For example, NLTK is making this mistake https://nltk.googlecode.com/svn/trunk/doc/howto/tokenize.html) This is a problem when you have many multi-word phrases.

Solution 2

The current project I am working on involves identifying gene names within Biology papers abstracts using the vector space created by Word2Vec. When we run the algorithm without lemmatizing the Corpus mainly 2 problems arise:

  • The vocabulary gets way too big, since you have words in different forms which in the end have the same meaning.
  • As noted above, your space get less sparse, since you get more representatives of a certain "meaning", but at the same time, some of these meanings might get split among its representatives, let me clarify with an example

We are currently interest in a gene recognized by the acronym BAD. At the same time, "bad" is a english word which has different forms (badly, worst, ...). Since Word2vec build its vectors based on the context (its surrounding words) probability, when you don't lemmatize some of these forms, you might end up losing the relationship between some of these words. This way, in the BAD case, you might end up with a word closer to gene names instead of adjectives in the vector space.

Share:
14,073
Luca Fiaschi
Author by

Luca Fiaschi

Senior Data Science Executive leading top-performing cross-functional teams at large D2C & SaaS technology companies. Deep theoretical understanding and extensive hands-on experience with the latest techniques from machine learning, deep learning, and big data engineering Track of records delivering end-to-end data science products with millions of euro ROI. Portfolio of projects on all sides of e-commerce activities, such as marketing, logistics, technology, and personalization. Ph.D. in Computer Vision & AI with hundreds of citations in top tier conferences

Updated on June 14, 2022

Comments

  • Luca Fiaschi
    Luca Fiaschi almost 2 years

    Word2vec seems to be mostly trained on raw corpus data. However, lemmatization is a standard preprocessing for many semantic similarity tasks. I was wondering if anybody had experience in lemmatizing the corpus before training word2vec and if this is a useful preprocessing step to do.

  • Luca Fiaschi
    Luca Fiaschi almost 10 years
    >> "Something more interesting is, how to do tokenization with respect to the existing disction of words-vectors inside the W2V (or anything else)" what do you mean with tokenization in this context?, Thanks
  • Daniel
    Daniel almost 10 years
    Like "Good muffins cost $3.88\nin New York." needs to be tokenized to ['Good', 'muffins', 'cost', '$', '3.88', 'in', 'New York.'] Then you can replace it with its vectors from W2V. The challenge is that some tokenizers my tokenize "New York" as ['New' 'York'], which doesn't make much sense. (For example NLTK is making this mistake nltk.googlecode.com/svn/trunk/doc/howto/tokenize.html) This is a problem when you have many multi-word phrases.
  • samsamara
    samsamara over 8 years
    >>"Essentially by lemmatization you make the input space sparser". did you mean if you keep both the lemmatized and the original form of the tokens? otherwise wouldn't lemmatization make the input space much smaller?
  • Eli Korvigo
    Eli Korvigo over 7 years
    Lemmatisation makes data denser, thus reducing the amount of data required for adequate training.
  • lucid_dreamer
    lucid_dreamer about 6 years
    Make the tokens case sensitive if the acronym is always BAD.
  • N4ppeL
    N4ppeL almost 6 years
    But then you get a lot of noise for every word that occurs at the beginning of a sentence and therefore starts with a capital letter. another approach would be to use POS-tagging to identify "bad" either as noun or as adjective