Menu

Screaming Frog Exclude WordPress URLs: Huge List of 65+ URLs

If you’re running Screaming Frog Spider on a WordPress website, exclude the right URLs from your crawl with this list Screaming Frog Exclude WordPress URLs.

In this post, I’m going to provide a list of WordPress folders, files, and other URLs to exclude from your Screaming Frog spider crawls.

If you run a standard/default Screaming Frog Spider crawl on a WordPress website, you may run into some unnecessary results on your sitemap.

This post will show how to exclude them — and tell you which ones to exclude.

Already know what you’re doing? — Jump the full list. Otherwise, if you want to learn the how and why of these sitemap exclusions, keep reading.

Let’s dive in.

 

Screaming Frog Banner.

Screaming Frog Spider Sitemaps

If you’ve been in SEO for more than a few days, especially if you focus on technical or On-page SEO, you’ve definitely created a sitemap.

And Screaming Frog’s Spider tool is by far the best in the business.

A sitemap is a blueprint of your website that help search engines find, crawl and index all of your website’s content.

These giant URL lists tell search engines which pages on your site are most important.

You don’t NEED a sitemap. As Google puts it:

“If your site’s pages are properly linked, our web crawlers can usually discover most of your site.”

But there are a few cases where a sitemap is a huge help— like if your website is brand new, recently changed a ton of URLs, or you have a big website (1,000+ pages).

Unless your internal linking is PERFECT and all your 100’s or 1,000’s of URLs have earned external backlinks, search bots are going to have a hard time finding all of those pages.

That’s where sitemaps come in.

 

Screaming Frog Exclude WordPress URL Screenshot.

A Good Sitemap is a Clean Sitemap

But not EVERYTHING needs to be in your sitemap. In fact, you definitely shouldn’t put everything in there.

The whole point of a sitemap—and technical SEO in general—is to make sure you’re giving Google the best intel possible on how and where to crawl (and therefore index and rank) your pages.

Therefore, you only want to submit your important, public, searchable pages in your sitemap. The others should be excluded.

That’s where Screaming Frog’s exclude options come in.

From Screaming Frog’s Configuration > Exclude documentation

The exclude configuration allows you to exclude URLs from a crawl by supplying a list of regular expressions. A URL that matches an exclude is not crawled at all (it’s not just ‘hidden’ in the interface).

This will mean other URLs that do not match the exclude, but can only be reached from an excluded page will also not be found in the crawl.

The exclude list is applied to new URLs that are discovered during the crawl. This exclude list does not get applied to the initial URL(s) supplied in crawl or list mode.

Most SEO pros who have used Screaming Frog’s Spider tool are probably already familiar with the Exclude option. But in case you need a refresh, check out their extensive guide here, or check out their video below:

 

What should your sitemap exclude?

Let’s look at the WordPress folders, files, and URLs to you’ll likely want to exclude from your Screaming Frog spider crawls.

Throughout the rest of this post, I have to make some assumptions about the intended use-case of your sitemap.

Your mileage my vary, and the suggestions in this post won’t work for every site in every case. Please modify these lists for your own needs.

 

WP-Content Folder

  • https://example.com/wp-content/.*

This will exclude everything in your WordPress install’s /wp-content/ folder.

On the off chance that you actually want to allow some of those WordPress directories in your Screaming Frog spider crawl (like maybe your /uploads/ folder for PDF assets, e.g?), here are the individual folders.

Pick the ones you want to exclude:

  • https://example.com/wp-content/mu-plugins/.*
  • https://example.com/wp-content/plugins/.*
  • https://example.com/wp-content/themes/.*
  • https://example.com/wp-content/upgrade/.*
  • https://example.com/wp-content/uploads/.*

 

Other WordPress Directories

I can’t think of any reason why you’d want these directories included in your Screaming Frog wordpress website crawl.

  • https://example.com/wp-includes/.*
  • https://example.com/wp-admin/.*

Most of them are going to be unreachable for Screaming Frog’s spider tool, anyway — if you’re running it with the default “Respect noindex” configuration.

 

WordPress Default Files

Like the directories above, these are likely going to be skipped by Screaming Frog anyway. If your theme or install are making these files public, you’ve got bigger problems than just a sitemap.

But that’s another post.

In the meantime, you almost certainly want to exclude these WordPress file URLs from your Screaming Frog crawl.

  • https://example.com/index.php
  • https://example.com/license.txt
  • https://example.com/readme.html
  • https://example.com/wp-activate.php
  • https://example.com/wp-blog-header.php
  • https://example.com/wp-comments-post.php
  • https://example.com/wp-config.php
  • https://example.com/wp-config-sample.php
  • https://example.com/wp-cron.php
  • https://example.com/wp-links-opml.php
  • https://example.com/wp-load.php
  • https://example.com/wp-login.php
  • https://example.com/wp-mail.php
  • https://example.com/wp-settings.php
  • https://example.com/wp-signup.php
  • https://example.com/wp-trackback.php
  • https://example.com/xmlrpc.php

 

Post Taxonomies

This one is tough. It’s impossible for my list to be exhaustive of Taxonomies, since WordPress admins can create their own.

  • https://example.com/author/.*
  • https://example.com/category/.*
  • https://example.com/tag/.*

 

Pagination

WordPress post archives often get paginated, leading to lots of URLs like this. Should you include these URLs in your Screaming Frog crawl, or exclude them?

  • https://example.com/page/2/.*
  • https://example.com/page/3/.* — etc.

It’s a matter of opinion and use case. Personally, I don’t see how these URLs are helpful in a typical spider crawl. They’re not real URLs, per sé.

And most SEO professionals agree these types of URL should not be indexed by Google and other search engines. So if they’re not indexed, and therefore can’t drive organic search traffic, do they matter for your SEO sitemap or crawl?

Maybe. Depends on why you’re making it. Again, your mileage may vary. Exclude them if you want to. Entirely optional.

 

Server Binaries, etc.

  • https://example.com/bin/.*
  • https://example.com/boot/.*
  • https://example.com/cdn-cgi/.*
  • https://example.com/cgi-bin/.*
  • https://example.com/dev/.*
  • https://example.com/etc/.*
  • https://example.com/home/.*
  • https://example.com/lib/.*
  • https://example.com/media/.*
  • https://example.com/mnt/.*
  • https://example.com/opt/.*
  • https://example.com/run/.*
  • https://example.com/sbin/.*
  • https://example.com/srv/.*
  • https://example.com/tmp/.*
  • https://example.com/usr/.*
  • https://example.com/var/.*

No idea what these are? StackExchange has a great explanation of each. But suffice it to say: you should probably exclude them from your WordPress Screaming Frog spider crawl.

 

International & Language Groupings

  • https://example.com/en/.*
  • https://example.com/es/.*
  • https://example.com/fr/.*

— and/or —

  • https://en.example.com/.*
  • https://es.example.com/.*
  • https://fr.example.com/.*

On the other hand, maybe you explicitly want these directories/subdomains. Obviously you’ll have to modify these lists for your needs.

 

Subdomains

If your WordPress website contains other info or installations on a subdomain, you may want to exclude those.

Common examples include blog, forms, funnels, and shopping mini-sites.

  • https://blog.example.com/.*
  • https://forum.example.com/.*
  • https://info.example.com/.*
  • https://shop.example.com/.*
  • https://store.example.com/.*

In some cases, you may explicitly want these subdomains in your crawl. But especially if they’re non-indexed or canonicalized, you may want to exclude them from your Screaming Frog crawl.

 

Third-party Tools

These tools often require subdomains due to their technical setup. I’m thinking of HubSpot, Clickfunnels, Unbounce, etc.

In case any of these tools apply to you, here’s a list of likely subdomains you may be using with these tools:

  • https://clickfunnels.example.com/.*
  • https://eloqua.example.com/.*
  • https://hubspot.example.com/.*
  • https://instapage.example.com/.*
  • https://kajabi.example.com/.*
  • https://leadpages.example.com/.*
  • https://marketo.example.com/.*
  • https://unbounce.example.com/.*

 

Full list of WordPress URL exclusions

If you’ve decided which of the above URLs or URL types you need to exclude, you can grab this full list and modify it for your needs.

You probably could paste this list into your Screaming Frog exclude options box, but it might have some unintended consequences. Be careful not to over-exclude!

And obviously you’ll have to replace example.com with your website’s domain.

Here’s the full list (also available on Github) —

/**
* WordPress URL Exclude List for Screaming Frog Spider
* @author TJ Kelly – https://tjkelly.com
* @desc Full article — https://tjkelly.com/blog/screaming-frog-exclude-wordpress/
* @date 2021-07-08
*/
https://example.com/wp-content/.*
https://example.com/wp-content/mu-plugins/.*
https://example.com/wp-content/plugins/.*
https://example.com/wp-content/themes/.*
https://example.com/wp-content/upgrade/.*
https://example.com/wp-content/uploads/.*
https://example.com/wp-includes/.*
https://example.com/wp-admin/.*
https://example.com/index.php
https://example.com/license.txt
https://example.com/readme.html
https://example.com/wp-activate.php
https://example.com/wp-blog-header.php
https://example.com/wp-comments-post.php
https://example.com/wp-config.php
https://example.com/wp-config-sample.php
https://example.com/wp-cron.php
https://example.com/wp-links-opml.php
https://example.com/wp-load.php
https://example.com/wp-login.php
https://example.com/wp-mail.php
https://example.com/wp-settings.php
https://example.com/wp-signup.php
https://example.com/wp-trackback.php
https://example.com/xmlrpc.php
https://example.com/author/.*
https://example.com/category/.*
https://example.com/tag/.*
https://example.com/page/2/.*
https://example.com/page/3/.* — etc.
https://example.com/bin/.*
https://example.com/boot/.*
https://example.com/cdn-cgi/.*
https://example.com/cgi-bin/.*
https://example.com/dev/.*
https://example.com/etc/.*
https://example.com/home/.*
https://example.com/lib/.*
https://example.com/media/.*
https://example.com/mnt/.*
https://example.com/opt/.*
https://example.com/run/.*
https://example.com/sbin/.*
https://example.com/srv/.*
https://example.com/tmp/.*
https://example.com/usr/.*
https://example.com/var/.*
https://example.com/en/.*
https://example.com/es/.*
https://example.com/fr/.*
https://en.example.com/.*
https://es.example.com/.*
https://fr.example.com/.*
https://blog.example.com/.*
https://forum.example.com/.*
https://info.example.com/.*
https://shop.example.com/.*
https://store.example.com/.*
https://clickfunnels.example.com/.*
https://eloqua.example.com/.*
https://hubspot.example.com/.*
https://instapage.example.com/.*
https://kajabi.example.com/.*
https://leadpages.example.com/.*
https://marketo.example.com/.*
https://unbounce.example.com/.*

Leave a Reply