Best machine learning approach to automate text/fuzzy matching

18,812

To frame it as a ML problem, you could learn a similarity function.

Instead of classifying "Acme Corp" as matching the target class "Acme" (classifier), you would instead learn a function that learns to tell that "Acme Corp" is similar to "Acme", but dissimilar to "ABC Corp".

This is usually called "Similarity Learning", in your case, maybe more specifically "Ranking similarity learning" since your goal is not to learn a function that will output a similarity value, but instead rank potential candidates.

But before using full ML algorithms, I would start first by using a string distance metric, for instance the Levenshtein distance metric (very common and easy to find). Transform your data in positive and negative examples (a positive example: Acme is a match to Acme Corp). The simplest learning function would be finding the Edit Distance threshold that maximizes your score. You can also add parameters like: "remove Corp.", "remove Ltd", etc. and find what combination works best.

Share:
18,812
Anonymous
Author by

Anonymous

Updated on June 19, 2022

Comments

  • Anonymous
    Anonymous almost 2 years

    I'm reasonably new to machine learning, I've done a few projects in python. I'm looking for advice on how to approach the below problem which I believe could be automated.

    A user in a data quality team in my organisation has a daily task of taking a list of company names (with addresses) that have been manually entered, he has to then search a database of companies to find the matching result, using his judgement - i.e. no hard and fast rule.

    An example of the input would be:

    Company Name, Address Line 1, Country

    Of this, the user takes the company name and enters it into the search tool. Where he is presented with a list of results and he picks the best match but may choose not to pick any match. The search tool is built in house and talks to an external API, I have access to the source code so I can modify the search tool to capture the input, the list of results, and I could add a checkbox to see which result was used, and a check box to signify that none was chosen. Therefore this would become my labelled training data.

    The columns used from the results to make the judgement are roughly the same:

    Company Name, Address Line 1, Country

    Given a company name like Stack Overflow, the results may return Stack Overflow Ltd., Stacking Overflowing Shelves Ltd. etc. The input data is reasonably good, so the results usually yield about 10 matches, and to a human, it's fairly obvious which one to pick.

    My thought is that with enough training data I could call the API directly with the search term, and then choose the appropriate result from the list of results.

    Is this something that could be achieved through ML? I'm struggling with the fact that the data will be different every time. Thoughts on the best way to achieve this are welcome, in particular how to structure the data for the model and what kind of classifier to use etc.