Export the content from a website in a spreadsheet or CSV, using a custom extraction in under 5 minutes

When solving real-life business problems with data science and NLP, there is always a need for creating a dataset, on which you can run machine learning models.

Web scraping, in simple terms, is extracting data from the web, or more specifically – from websites.

Web scraping with Screaming Frog can not only be used by SEOs, but also by data scientists, who want to skip the coding bit in order to get to their datasets quicker.

This tutorial will walk you through using Screaming Frog SEO Spider’s custom extraction feature, to scrape data from websites in the quickest and easiest way — through a CSSPath.

A couple of things (or limitations, if you wish) to this approach before we start:

customer extractions are not available in the free version of the tool, so to have access to this feature you need to pay a yearly fee (£149.00)
you can only extract text from pages that can be crawled by the SEO Spider, so they should return a 200 (OK) status code.
You can switch to JavaScript rendering mode to extract data from the rendered HTML.
This tutorial will not cover web scraping from the HTML using Xpath and regex, yet this is thoroughly covered by Screaming Frog’s own blog post on custom extractions.

To get started, you’ll need to download & install the SEO Spider software.

1. Copy the selector of the element you want to extract text from.

First, you need to locate the element that contains the text on the page and copy the selector.

To do this, first, open Inspect programming pannel to examine the HTML.

To identify the element, you can read the code, or if you are less familiar with HTML, you can also hover on different parts of the page in order to see which part of the code corresponds to the page selector.

Once hovered over, different elements flare up in dark blue.

Right-click and select Copy > Copy Selector.

Web Scraping of Page Content in 3 Easy Steps with Screaming Frog 2 — Copy the selector of a webpage’s HTML, image by author

Now, let’s head over to Screaming Frog.

2. Set up a custom extraction in Screaming Frog.

Open Screaming Frog.

Click on the Configuration menu and select Custom > Extraction.

1*4rkLuraZ b91x4Zm Y46g — Create a custom extraction with Screaming Frog, image by author

In terms of configuration, you first need to name the extraction. Bear in mind this name will be the name of the column, where the data gets extracted later on.

In my example, I’ve named it ‘Content’.

Then, select ‘CSS path’ and paste the selector you copied in step 1. Finally, select the Extract text option and click OK.

1*dHiO0ZabGdD3H5gYk6purw — Configure your Custom Extraction in Screaming Frog, Image by author

Click on OK to close the extraction menu.

Then, run the crawl.

3. Export your data.

Once your crawl has finished, you can navigate to the custom extraction field to export only the data from the extraction set-up. You can also export the entire crawl in one spreadsheet.

Screaming Frog enables data exports in CSV, Excel files, or even directly to a Google Sheet. The possibilities really are endless.

Final Thoughts

As I mentioned this tutorial is just scraping the surface (no pun intended) on what this tool can do, so I highly suggest checking out the XPath Examples for web scraping, provided by Screaming Frog’s content team.

This type of data is very easy to perform data analysis on and this tool can be especially useful for NLP and data science professionals, who specialize in text analytics. And as I mentioned before — it’s totally code-free.

2 Comments

Valerio Cusinato


September 3, 2021, 10:20 am

I try your method, but not working in Google Colab, in Jupiter Notebook too. There are some libraries lost, what you use for working code?

Thank you!
- Lazarina Stoy
  
  
  September 5, 2021, 7:49 am
  
  Hi Valerio,
  
  I am assuming this is in response to the automation of meta descriptions article.
  
  Please try to troubleshoot by ensuring that all necessary modules are installed, specifically in the console, try:
  
  If that doesn’t work, see if this can resolve the issue: https://github.com/dmmiller612/bert-extractive-summarizer/issues/81
  Or any of the other common errors with this model, resolved in these threads: https://github.com/dmmiller612/bert-extractive-summarizer/issues
  This is another common issue, I am seeing amongst people, so try this as well: https://github.com/dmmiller612/bert-extractive-summarizer/issues/48