With the rise of interest in AI as a technology that can help make our jobs easier as marketers, and the recent surge in generative AI technology, there’s never been a time, where AI resources for marketing have been more accessible and more popular. With that, there’s also a few challenges:
- newcomers are rarely learning anything beyond LLMs, or even just stopping at ChatGPT (or other similar chatbots) as an interface with working with LLMs.
- with that, comes a limit to what can really be achieved, with predictable, repeatable, and trustworthy results when it comes to AI in marketing
- That leads to people abandoning the technology, worsening the risks of an AI winter
In this article, I’ll cover three things:
- what you need to get started in Machine Learning (if you’ve been postponing it)
- the tasks that you will get immediate value from
- and we’ll also run through what you actually need to grow as a
machine learning engineer. I’m kidding. As an SEO, or organic search marketer that uses machine learning, because let’s face it, we’re not becoming machine learning engineers anytime soon.
This post is based on several talks and masterclasses I did recently (for TechSEO Connect, BrightonSEO, and Search ‘n Stuff Antalya), the deck for which you can find here.
If you prefer to watch a video, here’s the video:
How to approach Machine Learning as an SEO
Overcome limiting beliefs and focus on what you need to execute a task
Most of the people that are getting into machine learning are struggling with like limiting beliefs:
- negative past experiences – maybe you started a project and you kind of failed or it took you way too long to execute, making you give up on it
- imposted syndrome – you’re thinking that you’re not technical enough, that this goes over your head or ‘I’m not good enough to do this’, or ‘why should I be doing this? Why am I the person to, isn’t there some like, more technical person to do this?’
These are challenges that not only we struggle with as marketers trying to do something technical, but also data scientists and machine learning engineers, whenever they approach learning a new model or niche. What you really need is to know is:
- When to search for ML
- What model to use
- How to find suitable ML tools
- What you can achieve in a short time-frame
- How to drive value via ML
For every project that you are working on, you need to think about three things. The characteristics of the tasks, the data, and the solution.
How to understand the characteristics of the ML task you are trying to do
in terms of task characteristics, you can think of machine learning, very, very basic understanding. You have supervised, unsupervised ML.
Supervised ML simply means that you have a way to validate the results that the ML model gives you. Unsupervised ML is when you don’t have a way to validate your results. Some examples of supervised learning might be regression and classification. For unsupervised, it might be clustering or, dimensionality reduction.
This is a very, very simplified view of the most common things that you’ll be working on. But I just want to highlight machine learning is a massive field. Fact of the matter is, you’re not going to become an expert in two days. (regardless of what LinkedIn post tells you that you can be).
The ML field is massive. So, bear with that. You’re doing something very small, implementing a small task to aid your workflow by borrowing a technology from an otherwise massive field. That’s a good mindset to start.
Whenever you are understanding the task, or otherwise – what you’re going to actually do, also think about if you have the data to self train a model, use a pre trained model, a model that has been trained by a company.
- To Self-Train a model means to develop a machine learning model using a new or existing architecture from scratch with your own data. This requires a ton of data and a considerable amount of expertise, not only for training, but also for testing and validation.
- To use a Pre-Trained Model means to use a model that already has been trained by a third-party. This approach is the easiest to get started, and requires the least amount of data, if at all.
- To fine-tune a model, means you are taking an ML model that a third-party has developed, which you then re-train to improve, adapt, and fine-tune the performance of on a more specific, custom dataset. This works very well for when you are working in a niche domain, but executing an otherwise popular task, like classifying images that are medical and related to a specific disease.
Understand what data you are working with
When it comes to data, think, is your data textual, is it numeric, is it image based, is it time series.
Understand the characteristics of the solution you are trying to implement
Then it, when it comes to solutions, ask yourself these questions:
- Is the task mission critical? Like does your job depend on the project that you’re working on? – If it does, don’t rely on AI.
- Do you need consistent results every time? – If so, unsupervised machine learning is not the way to go. Transformers and LLMs and generative AI is not the way to go. Deep learning is also not the way to go.
- Do you need your results to be easy to understand, to relate to stakeholders? – If so, again, deep learning is not the way to go.
- If the goal is to simply outperform the current methods that you have (and that can be – in terms of time saved, quality, speed, or anything else), then go with an automated solution.
Always assess the usefulness of machinery models depending on multiple factors, how easy it is to deploy, what’s the bottom line, what time are you going to be saving, how easy it is to actually get started utilizing this technology, do you need it to be scaled across an organization and so forth.
So, now you know – in order to find things when you have an idea, you need to keep your queries specific to these three characteristics.
What ML models to try to get immediate value out of Machine Learning as an SEO
Text classification
Text Classification, is a supervised ML approach, where you have list of categories, labels, buckets on one side, and content that we’d like to sort into the buckets on the other. We are working with textual data and we do need to have some sort of prediction and the labels that we are taking are from an API that we will be using.
By using Google’s Natural Language API for text classification, you have the list of over 1300 categories to sort your content in, and by using an MLforSEO Text classification Template in Google Sheets, you can complete the process in less than 20 minutes.
Here’s a summary of the process:
- Create an API key
- identify the content to scrape
- Scrape the content with Screaming Frog or your crawler of choice.
- Enter the URLs and the content and then get the data
This is a great approach, because you not only get some categories, you have primary, secondary, tertiary categories and even deeper levels for some of these. But you also have a precision score, which is useful as it allows you to set a threshold to say everything below 70 percent is not a good classification. This is a good approach to actually justify why some of these articles you might want to send for manual review to be reclassified, or you might want to fine-tune the model with AutoML.
MLforSEO also has a free Looker Studio Template for visualising the output data from text classification, where you can visually explore the classification categories, and modify the filters.
Why not use an LLM for text classification?
With OpenAI, for instance, the results are a hit or miss, as GPT is a general-purpose model, which is not efficient in large datasets, and provides unreliable and unpredictable results. It is unsupervised ML – generative AI, transformer architecture, meaning it’s great at being creative, not at being precise.
How to implement this in SEO?
Classify your content articles and pages with the API. Looking at the primary categories, these can become the primary categories on your website – things like news, arts and entertainment, business, etc. You can use sub-category levels to improve the tagging system in place.
Topic Modelling
Content Clustering or commonly known as Topic Modelling is similar in concept to text classification, but without the need for predefined labels. Instead, we work solely with the content, aiming to identify patterns and similarities across topics. The goal is to group similar pieces of content together based on their inherent characteristics.
This approach is especially useful when you’re faced with a large number of categories — for instance, when your site’s content doesn’t align perfectly with existing categories (like those from Google’s NLP API). By clustering content, you can identify and create new, relevant groups that better reflect your site’s structure.
By implementing something like LDA topic modelling, you can get your site’s main topics, how these topics are connected (how similar they are semantically to one another), but also how they interconnect within separate pages. Every topic is tackled by different pages simultaneously because there are subtopics mentioned in the content. Because as we humans write, we don’t only write on one thing, we mention different entities and different semantically related concepts and so on.
BERTopic is another, more recent alternative to LDA, which gives you the opportunity to actually build your own custom model using different choices of each step of the model’s architecture that need to be performed.
This works very well if you’re a really big nerd and a techie, you’re doing topic modeling all the time for different companies, industries, websites, and so forth, and you want to test out different configurations to see what works best. BERTopic follows the same principles:
- It embeds the documents
- It semantically groups them together and
- it extracts the topic models.
How to implement this in SEO?
We already talked about classification. Get those classes. These are our primary categories. You also have a very overall naming or group for your content. Then, you can also get topic models and subtopics. And then you can also extract entities and link articles together based on these three things.
- you can link pages that mention the same topic or subtopic
- improve your categorization and tagging system on your website
- link based on the same or semantically related entities.
Keyword Clustering
Let’s talk about keyword clustering. Here, we are reducing the noise from the keyword, essentially, taking a keyword and only selecting one word or two words from it, depending on your configuration, to say these are the most important terms in this text.
These are the most semantically important terms.
How KeyBERT works is: it tokenizes the terms, extracts embeddings, and then it gives you the most important word. In simple terms, from every sentence, from every keyword, you have one word that is the most vital one.
I highly recommend this video to learn KeyBERT, but essentially you can identify one word, choose to identify a bigram, which is a link of two words (and they don’t need to be following one another in the keyword that you provide, which is very important), or you can also give a full content piece to actually identify the main terms from the content that you have to actually see if you’re overstuffing with the keywords that you have.
But you don’t want to do this. You just want to give it a Google Sheets spreadsheet and you just want it to give away the keyword categories and the clusters and all of that stuff. You can do that.
So, here’s an MLforSEO Google Colab I created, where you give it the keywords, it gives you a list of the primary term, and it also gives you the bigram. Awesome, right?
Entity Extraction and Analysis
Why it is important to actually not think just about content categories, topics, but also think about the entities that make up the topic? Well, because that’s what all of the search engines are doing, and that’s what people are doing as well.
Whenever you are thinking about a certain topic, you’re also thinking about all of the things that need to happen, people involved, and concepts related to that topic. Take travelling to a conference for instance – you’re thinking about your travel and companies (entities) like KLM or Uber that you will use to facilitate it; you’re thinking about the people you will meet (entities), the companies that will exhibit there, the things you will do and landmarks you will see.
The process to identify entities shouldn’t take you more than 20 minutes, if you follow the MLforSEO Tutorial on Entity Analysis, and use the attached no-code template for extracting entities from text in Google Sheets. Here’s how it works:
- You get an API key from Google Cloud for their NLP API
- Insert your content in the template
- You run the script, and get the entity data in seconds.
What is awesome about this is you get not only the identified entity from the content, but you also get what type of entity it is, how prominent it is, what the sentiment around it is, and the different variations on how it’s been mentioned, as well as links to any wikipedia pages, related to the entity.
What possible data points can you use this in SEO?
You can analyze your content, analyze your competitor’s content, you can scrape YouTube videos and analyze the entities that are being discussed over there. You can also do the same for Reddit, any social mentions, you can analyze your customer feedback that you’re collecting and It’s just sitting, collecting dust for nobody to analyze and you can also do internal link, text anchors, and all of these cool stuff.
Does it matter how you do it, why not use ChatGPT?
Yeah, of course it does. Don’t use ChatGPT for entity analysis. The data it provides is not reliable, and it does not give you all of these additional data points we mentioned. Use a model that has been custom trained for this job.
Why does an API like Google cloud’s, API win, or even the ones that Azure or AWS have? Because they’re trained for this job. They give you a ton of data that is precise. They give you data that you can actually further analyze. And they don’t hallucinate.
Fuzzy matching
Let’s talk about fuzzy matching – similarity between two strings. I have a 30 minute video on this concept, and a blog post and no-code template… I’m not going to bore you with the details.
In a nutshell, this can be implemented in SEO for things like:
- Redirect mapping
- Title, H1, and content 1:1 mapping and similarity assessments (which can help with duplication management).
- SERP analysis – things like checking title similarity
- and also more advanced use cases like detecting the need for structured data on your website
Content Moderation
Content moderation can be done with a module from the Google NLP API (same as the one discussed in the sections on Text classification and Entity Analysis). Same API can also be used for sentiment analysis and syntax analysis.
What the content moderation model does is it automatically analyzes text for inappropriate or undesirable content. It wants to show you how to maintain a clean and professional dataset without manually reviewing each entry.
And when I researched this module and worked with it, I thought that’s just like detecting whether the content is your money or your life? Because that sounds awfully similar to what Lily Ray has described that it means, right? Topics that affect your happiness, health, financial stability, or safety.
If you look at the categories that this API, that is owned by Google, does. It’s actually those four categories.
To do content moderation on your own website’s content with this API, follow the MLforSEO Tutorial for content moderation, and use the associated no-code template for identifying toxic content in Google Sheets. Process is simple:
- scrape content to analyse
- get your API key and run the formulas in the sheets
- get data back on whether your content falls into any of these categories
The possible data points for content moderation analysis you can use are your content, your competitors content, YouTube video transcripts, again social media mentions, comments on your website community posts that you are managing, like forums and things like that, just to make sure that everything associated with your brand is safe.
Speech-to-text transformation
Speech-to-text transformation can be amazing for cases like when you have a presence on audio or video platforms, but not a good connection. of this content in your blog. In which case, you can transcribe the content.
There’s a ton of tools that you can use. There are no code approaches that you can incorporate in your day to day, for smaller projects. But the models of bigger companies are still better, cause they’ve been trained on big datasets. With the exception, again, of OpenAPI’s Whisper, which has limited quality for languages other than English.
Caveat, I’m not saying that you:
- Spam your blog with auto transcribed content
- scrape your competitor’s YouTube videos, transcribe, and ‘get traffic to the moon, bro’.
I’m not saying that. What I am saying is that you can bridge the gaps between different teams if you’re working in an enterprise organization, and if they are creating really good webinar style interviews, content, You can actually turn that into videos.
Make your content work harder, distribute better, and use transcription for competitor analysis as well if you are trying to enter YouTube as a niche.
Mix and match with different approaches, so for instance, you can scrape audio, transcribe, you can classify the content, categorize, create subtopics, extract entities, and all of these cool stuff. Incorporate that into your strategy and boom, you’re not only thinking about blog content, you’re thinking about the YouTube landscape as well.
Text-to-speech transformation
Text-to-audio is working in the opposite direction, transforming written content to audio or even video snippets.
Here, I’m not saying:
- Fill YouTube with trash, please don’t we already have enough of it.
What I am saying is certain content formats don’t really require a rich video format to accompany them, like tutorials, documentation, interview styles that only exist in blog format. And you can also add automation like person to camera videos (like Syntesia avatars) without actually having to record full blown hundreds of thousands of tutorials by creating digital avatars that work quite well for content that doesn’t really need to be as powerful.
Text-to-text transformation
Text-to-text transformation models is where LLMs shine. Any kind of LLM works for tasks like:
- transforming your blog content to social media posts – shout out to the custom GPTs from Caitlin Hathaway.
- transform your blog post to newsletter editions,
- transform PDFs into guides, and so on.
Other use cases for text-to-text transformation that utilise LLMs but also other approaches were also recently published on MLforSEO:
- Combining structured data with generative AI to create content based on databases of information like statistics for products, product reviews, or other product information. This was described in this tutorial by Elias Dabbas, How to use generative AI with structured data for programmatic SEO
- Combining fuzzy matching, GSC data, and generative AI to re-write your titles or other meta data to be more closely aligned with user queries. Process described by Natzir Ruiz’s tutorial, How to Automatically Optimize your SEO Metadata with FuzzyWuzzy and OpenAI in Google Colab
How to grow as an marketer, using Machine Learning
I’ve shared a of ideas. Hopefully, you implement at least some of them. But please know, this is just scratching the surface!
What I think you need to grow in your skillset are three things:
- The right mindset.
- A community of people that are learning and failing and achieving things right alongside you
- Resources that are beginner friendly intermediate advanced
That’s what I’m trying to build at MLforSEO.com
Check it out. Join the newsletter, and Slack community (with over 400+ people in it already). Start learning with the free templates and tutorials, and implement everything that has been talked about in this blog post.
There will also be an academy that will launch later this year, and courses available to preorder.