Check out this video, if you prefer to listen to or watch this content:
Background
Before we talk about PolyFuzz and fuzzy matching, its numerous applications in the world of SEO, its limitations and pitfalls, and how to get started with it regardless of your coding experience, let’s first take a moment to thank today’s article’s sponsor – Ahrefs.
I want to give a huge shoutout to Ahrefs for sponsoring this article as a result of a popular request from this Twitter thread, and specifically, the Ahrefs Webmaster tools, which are perfect for small website owners, as it’s totally free to use and super easy to get around. If you are looking to make improvements on your website and bring instant value on a budget, then the Webmaster tools are the perfect tool to get started with.
What is string matching?
String matching in machine learning is a problem that dates back to as early as the 1980s. The problem essentially measures the distance between two strings and calculates on the basis of that a similarity score between the two strings, or otherwise – makes an approximation match to classify the strings as either equivalent, similar or distant.
A study from Hall and Dowling (1980), reads:
Aroximate matching of strings is reviewed with the aim of surveying techniques suitable for finding an item in a database when there may be a spelling mistake or other error in the keyword. The methods found are classified as either equivalence or similarity problems.
In the same study, the authors describe that some of the reasons for implementing string matching algorithms are broadly: error correction or information retrieval.
Error correction refers to a corruption-correction point of view and is a form of identification of patterns in a large corpus of data, or otherwise – retrieve information based on a specified input, find similarity mismatches and correct the error.
Information retrieval is all about providing an input, which best describes the information we are trying to retrieve from the dataset. Here, there might be two risks – the program returning unwanted words, and missing those words that are required.
The Similarity problem in string matching is the understanding of the approximation of two strings. Otherwise, how alike are the two strings? So, let’s talk about how is this measured in string matching.
In programmable information systems, string variation is measured by errors in spelling and typing. Early studies in the field have found that mistaking a letter for another letter was the most common typing error, but also omitting a letter or inserting another by mistake.
What are the different methods in string matching and what do they do?
Since the introduction of string matching, a lot of work has been done in the field with a number of different algorithms, and methods introduced.
In this section, I’d like to briefly go over the main types of string matching used, and the main libraries that you might encounter starting to do this type of work.
Exact matching – Method and Limitations
Exact matching, or elsewhere referred to as direct matching is a method that performs direct access matching for the exact pattern or its similarities within a text depending on the location of a character in alphabetical order.
The Boyer–Moore algorithm is one of the best-known pattern matching algorithms and is considered to be very fast in practice. It is designed for the exact string matching of many strings against a single keyword. Here is how it works in practice:
- you have an input string (keyword) that you want to find similarities for in a dataset
- the program loops through the entries and checks first for the characters in the keyword (whether they are present in a given dataset entry), and then for the length of your keyword input (whether it matches the given dataset entry)
- if there is a match, the process is repeated for all characters in the word, until an exact match is achieved
- if there is a mismatch, the algorithm looks for the next substring example that was matched (partial keyword match)
Distance-based matching, most prominent being Levenshtein and Jaro – Method and Limitations
The edit distance algorithm is considered the best algorithm to be used to find the distance between two strings. The edit distance between two strings ‘s’ and ‘t’ is the minimum number of edit operations needed to convert the string ‘s’ into ‘t’. Here is how this works:
- you have a keyword (input) and a dataset entry
- the program calculates the number of character shifts needed in order to get from the input to the entry
The limitation of this method is that it is based on a simple character distance methodology, without any comprehension of the semantic similarity between the two keywords. For instance, the words HARD and HAND will be considered more similar than the words HARD and HARDER, as the latter would need two additions of characters, whereas the former would need just one replacement of a character.
Phonetic Matching such as Metaphone – Method and Limitations
Phonetic matching plays a key role in information retrieval in multilingual environments, where diversities in pronunciation or writing styles with the same meaning may be present. In such cases, the phonetic matching technique is also used for different languages other than English.
Some popular examples of such algorithms are Metaphone, DMetaphone, Caverphone, and New York State Identification and Intelligence System (NYSIIS) Phonetic code. However, amongst all of these algorithms, studies have shown Metaphone excels in its performance compared to other techniques for all types of errors (e.g. misspellings, letter absences, letter swaps, having additional letters, etc), which is followed by Caverphone and NYSIIS.
Some criticisms or limitations of these algorithms are that they are not very productive as the precision is low with most of them returning a large number of false positives. Such algorithms don’t detect all close matches, and it’s important to adapt the type of algorithm used for the type of database it’s going to shift through.
N-gram String matching – Method and Limitations
N-grams refer to detecting the occurrences of a fixed set of pattern arrays as embedded sub-arrays in an input array. In simple terms, here is how this works:
- you enter the keyword “what is string matching in machine learning” as your input
- this can be split into 2-grams (bi-grams) like “what is”, “is sting”, “sting matching”, “matching in”, “In machine”, “machine learning”, or 3-grams (tri-grams) like: “what is string”, “is sting matching”, “sting matching in”, “matching in machine”, “in machine learning”, or 4-grams, 5-grams, 6-grams, or one 7-gram.
- All of these variations are searched for in the dataset, categorized based on the n-gram input – e.g. matches a bigram, or matches a 6-gram, with similarity based on the presence of n-grams in the dataset entry
N-gram based algorithms are extremely efficient for fast extracting data involving large patterns.
This type of string matching algorithm has a number of different applications amongst which (but certainly not limited to):
- similarity detection for plagiarism identification,
- keyword or key-phrase detection from a large corpus of data (e.g. from the sphere of SEO what title or meta description matches a niche keyphrase or parts of it),
- finding articles that mention a certain keyword pattern (e.g. how-to articles, ultimate guide articles, etc).
TF-IDF String matching
Cosine similarity with tf-idf is a well-established metric for comparing text, which has been adapted for flexibly matching a query string with values in a single attribute of a relation.
TF-IDF analyzes the corpus of words as a whole, and weights each token as more important to the string if it is less common in the corpus, as highlighted in this project by Adrial Pearl.
Some limitations are that this approach does not take into consideration semantic similarities between the input and the found database entries, and is also not speedy when set to high accuracy.
What are some commonly used string-matching libraries that you might encounter?
Fuzzy Pandas (Python)
Fuzzy Pandas is a simple, robust, and lean library that allows you to do fuzzy matching with pandas data frames. You can find the Python project description, code snippets, and docs here.
PolyFuzz (Python)
In the most general sense, PolyFuzz can be used for fuzzy string matching, grouping, and evaluation.
PolyFuzz uses different fuzzy string matching techniques as a framework, such as Levenshtein distance, TF-IDF character-based, and n-gram methods. This framework can be customized to model fuzzy string matching, which is what makes the library not only powerful but also very valuable to string-matching tasks.
What are the benefits of using PolyFuzz?
One huge benefit is the ability to customize the algorithm you use – with just a few lines of code you can quickly implement different matching models, depending on the need and your data.
In addition to this, another benefit is that with PolyFuzz you can also select the edit distance algorithm, specifically if you don’t want to be hindered by the limitations of basic (e.g. exact match, or Levenshtein distance match) algorithms. There is an abundance of edit distance algorithms out there that you might want to use, and we’ve only converted a small fraction of them in this article. PolyFuzz allows for experimentation and customization like no other package. You can use any distance measure, see the documentation here.
Check the tutorial of Maarten Grootendorst on string matching w/ BERT, TFIDF, and other algorithms and distance measures, leveraged via PolyFuzz
Another cool thing about PollyFuzz is the ability to match, group, and visualize multiple models within a single PolyFuzz instance, which can be used in the model selection process to compare the performance of different algorithms on a sample of your dataset.
Fuzzywuzzy (Python)
Fuzzywuzzy is a Python library that uses Levenshtein Distance to calculate the differences between sequences and patterns. This works in a way we’ve already explained – by calculating the number of corrections needed in order to go from the identified misspelling of an entry to the input keyword.
Check out this amazing tutorial for getting started with this library., published by Catherine Gitau.
What can fuzzy matching be used for in the context of SEO work?
Fuzzy Matching for Internal Link Opportunities Identification
Finding Similarity Between Two Strings – Keywords, URLs, Titles
The quickest way to find how similar two strings are is to use fuzzy matching as an AppScript as a Google Sheets Formula , or FUZZYLOOKUP as a Formula in Excel.
If you want to use Excel – FUZZYLOOKUP formula is developed as an add-on and you can find a great tutorial on using it here. Essentially, the formula is an advanced version of VLOOKUP, using advanced mathematics to calculate the probability that what it finds matches up with your search entry, which means the tool works even when characters (numbers, letters, punctuation) do not match up exactly.
However, a bit more nuanced is the FuzzyLook-up application in Google Sheets as an AppScript.
It’s easier to install and to get going, and also a bit more user-friendly to use FuzzyLookup in Google Sheets.
Let’s look at how to quickly assess the similarity of pages, titles, or keywords.
Finding Internal page liking opportunities in the same topic cluster or across different page groups
Based on a content extraction of your page’s content from Screaming Frog, you can also find similar pages to link to.
The recommendation here is to use this only for pages that are somewhat similar in nature in terms of the content structure (e.g. product pages), as otherwise you will be comparing apples to oranges. Another important note is that this will not help you evaluate content semantics as part of the process, so ensure you are aligned on the limitations of fuzzy matching before proceeding.
It’s also best to keep the comparisons relatively low in terms of volume, so better to compare paragraphs of a page, or titles, as opposed to the entire content of the page. Whilst this is great for experimentation, it’s also very important to review the provided recommendations and sense-check them after, to ensure the links made are sensible.
Fuzzy Matching for Competitor Research
Performing Competitor Analysis of URL and Title Differences, identifying Keyword Use Opportunities
Greg Bernhardt also created a script and Streamlit app, using PolyFuzz to perform a competitive analysis of URLs and other site data, such as titles.
The aim of the created tools is to find differences between your ranking URLs structure, titles, and use of keywords and those of your competitor, discover where they are outranking you (via the use of Semrush API), and highlight opportunities and quick wins.
Fuzzy Matching for Redirect Mapping & Content Plagiarism Identification
Evaluating URL Redirect Mapping Outcomes and/ or Content Plagiarism
Francis Angelo Reyes presented a nifty URL Redirect mapping tool using Beautiful Soup for content scraping after a redirect project has been executed, and PolyFuzz for analysis to quickly check for content similarity. The aim of the tool, in the words of Francis, is to:
Find identical URLs in a website migration. This is useful when you need to map out the pages from ‘current’ or ‘old’ domain to the ‘new’ domain.
Francis Angelo Reyes, Lupage Digital
The tool can also double down as a content plagiarism checker – overall super useful app.
This type of check can also be done via the Google Sheets template and video I shared earlier.
Match 404s to existing content and generate a redirect list
The legendary Technical SEO and Python wiz Greg Bernhardt created a simple but effective Python script, using PolyFuzz to match 404s to existing content and generate a redirect list using a Python module called Polyfuzz. For each amazing use-case, Greg releases a script and a Streamlit app as well – check these out on his website
Fuzzy Matching for Keyword Clustering is not recommended – Here’s why
I do want to mention that Polyfuzz and fuzzy matching, in general, can be used for keyword clustering and grouping of keywords in the process of keyword research, or backlinks in the process of backlink research. The code for doing that can be found in the API’s documentation itself.
However, as Lee Foot has stated after his experiments in using PolyFuzz for keyword clustering, this is not at all an ideal or recommended way to do keyword clustering, for the reasons mentioned in the theoretical portion of this article. Namely, that fuzzy matching does simple shifts between characters, as opposed to seeking semantic relations between the words in the cluster, hence, as Lee states:
While it got the job done, there were always some head-scratching clusters which I felt the original result could be improved on. Words that shared a similar pattern of letters would be clustered even if they were unrelated semantically.
Lee Foot, Search Engine Journal
Final thoughts and additional resources
So, now that you know how fuzzy matching works, its benefits, limitations, and use cases for SEO, you have all the tools needed to go and test it yourself. This is undoubtedly one of the easiest ways to get started with machine learning for SEOs, as it has benefits in terms of automation and scalability, but also some pretty significant limitations, which make it useful in certain contexts but useless in others.
In this article, I’ve featured the work of amazing creators in the SEO space, but I also want to leave you with some additional learning, if you want to dig deeper into this topic. Here are some great resources and creators that have written about Fuzzy Matching and it’s use-cases:
- Maarten Grootendorst has written a great intro to Fuzzy Matching for Python – jump into the code directly here.
- Raoof Naushad has written another great guide on the different fuzzy matching methods.
- If you’re a video-learning type of person (like myself), here are two videos talk will walk you through fuzzy matching:
- String Similarity, Find Best Match in JavaScript & Google Sheets – (*) the code for my template is from this video
- Google Colab Tutorial – Fuzzy Match Lookup with Google Sheets Data Using Python Fuzzy Pandas
- Needless to say, the docs will be your best friend building custom tools, so read the documentation
- Another cool string similarity js app script by Stephen Brown
Thank you for reading and happy learning!
Frequently Asked Questions
What is string matching?
String matching in machine learning is a problem that dates back to as early as the 1980s. The problem essentially measures the distance between two strings and calculates on the basis of that a similarity score between the two strings, or otherwise – makes an approximation match to classify the strings as either equivalent, similar or distant.
What are the different methods in string matching and what do they do?
1. Exact matching performs direct access matching for the exact pattern within a text depending on the location of a character in alphabetical order.
2. Distance-based matching, most prominent being Levenshtein and Jaro, evaluate the distance between two strings by calculating the minimum number of edit operations needed to convert one string to the other string.
3. Phonetic matching checks strings for phonetical similarities and differences and is suitable for multilingual assessments.
4. N-gram matching detects the occurrences of a fixed set of pattern arrays as embedded sub-arrays in an input array.
5. TF-IDF matching analyzes the corpus of words as a whole, and weights each token as more important to the string if it is less common in the corpus.
What can fuzzy matching be used for in the context of SEO work?
Fuzzy Matching can be used for:
– Finding Similarity Between Two Strings – Keywords, URLs, Titles, which can help with cannibalization analysis or duplication analysis
– Finding Internal page liking opportunities in the same topic cluster or across different page groups
– Performing Competitor Analysis of URL and Title Differences, identifying Keyword Use Opportunities
– Evaluating URL Redirect Mapping Outcomes and/ or Content Plagiarism
– Match 404s to existing content and generate a redirect list
What is Fuzzy matching not recommended for in SEO work?
Fuzzy matching is not recommended for large-scale clustering of keywords in keyword research as there is no semantic comprehension of the models typically used. When working with big datasets, best use more advanced approaches such as deep learning methods or entity extraction.