If you are the owner of a massive e-commerce website, an inbound marketer, or an SEO manager, you probably know how frustrating getting actionable information from a large e-tailing website can be. Backend code can feel inaccessible, and no one wants to deal with the intermediary of designers & developers when money’s on the table or deadlines are looming. Screaming Frog bridges that knowledge gap and lets anyone check to see if a site is optimized for search engines on a technical level.
Crawling large websites in Screaming Frog can present challenges beyond the typical functionality beginners are used to. Though we have already covered the onsite SEO applications of Screaming Frog, the SEO information you can get, and why it matters here. We wanted to create a guide highlighting some of the more advanced features in Screaming Frog that will let you easily crawl 5000+ page sites without errors. Check out the video below:
Although the basics of using Screaming Frog are pretty straightforward, I’ve often found in my work that older sites and those that generate dynamic URLS create the following issues when trying to get a clean crawl:
Over indexing from product review pages, comparison pages, referral pages, or contact us pages
under indexing from network time-outs when crawling.
Crawling: Can’t Stop….Won’t Stop
Over-indexing in this case refers to the fact that Screaming Frog might have already crawled thousands of pages on your site but is still nowhere near finishing. If you are a site owner, and know you only have 6,000 products on your site, this should tip you off to the fact that something is wrong. Typically, the problem is that dynamically generated pages like Product Review questionnaires, Side-by-side comparisons, referral pages, or contact us pages are just multiplying on top off each other. This is the reason why your e-commerce site that only sells 6,000 products supposedly has 30,000 pages. Now there are two solutions to this issue. The first is the optimal solution, while the second is fast but will reduce accuracy tremendously.
Screaming Frog: Exclude
The principal behind exclude is that we block Screaming Frog from crawling any pages we know don’t matter from an SEO perspective and are adding bloat to our crawl. If you go to “Configuration” > “Exclude” you will see this option.
I would recommend starting and stopping a crawl repeatedly until you have identified any pages in your file architecture that are causing problems. You don’t need to be familiar with regex to make this fix, but essentially adding a *. after the offending files will get rid of the issue.
Limiting URL Crawl Length
This is the fastest solution, but can also be the messiest. What you are doing here is telling Screaming Frog to skip over any URLs that are too long. If you jump down to “Configuration” > “Spider” > “Limits” > “Limit Max URL Length to Crawl” and set the value to something high, like anything past 115 you might be able to easily eliminate long dynamically generated URLs quickly.
This is great if you know that most of your page URLs only have 60 characters in them (For example there are four characters in “http”) but if your pages vary drastically this may get rid of useful pages unintentionally. So I would recommend the first strategy.
Why Isn’t Google Showing All My Products?
Now getting ranked number 1 may be the reason why you are asking yourself this question, but it’s important to note that if you have an older site, or an older Sitemap on your website, you may not be getting all the attention from Google that you deserve. In practical terms, this means you’re limiting your audience, traffic, and ultimately, revenue.
Typically, one of the first things I do when digging into a site is use the site: search operator to see how big Google thinks a site is. (You can read more about search operators here.) While this approach isn’t totally perfect, I have encountered situations where I have increased the amount of pages that Google was crawling for a site, and the pages indexed specifically, by updating the sitemap.xml in Webmaster tools. While no one knows the magic behind Google, my theory is that the Google Bots crawl quite quickly, and if your site is unresponsive, on the order of about 10 seconds, you may be getting pages dropped where you wouldn’t want them to be. So my solution for this situation is to use the “Configuration” > “Spider” > “Advanced” > “Response Timeout (secs)” option to increase how long I’m willing to wait for a page load.
Ultimately, this allows me to get all the pages out of my site that matter. While this is indicative of other issues, either technical or in the server architecture, using Screaming Frog here is a great quick-fix.
The methods depicted in the video, and briefly outlined here, are pretty straightforward and should hopefully keep you from spending too much time waiting for a crawl to finish (that never will). Let us know how your re-crawls go and if you have any questions.