Many webmasters are affected from the weird issue: Google is indexing (at least crawling) non-existing URLs. The issue isn’t depending of whether one uses WordPress or other CMS. This question about why Google is crawling and / or indexing non-existing URLs appears in all webmaster forums, Google Groups and so on, but without a clear solution.
The fact, that Googlebot creates and crawls a bunch of non-existing URLs, lets arise some questions:
- Where non existing URLs are coming from?
- Why is it not optimal, if non-existing URLs are crawled respectively indexed?
- How to minimize risks related to non-existing URLs?
The crawler bot reads web documents of a site one by one, using the sitemap (presuming there are no sitemap errors, where the crawler could stop crawling and get off). After the crawler is ready with existing URLs, it begins a kind of brute force attack to find spare parts for building URLs. Specially Googlebot is looking for everything in URLs and source code
- what is a part of any URL slug,
- what it means would be a part of an URL slug,
- what could be utilized in an URL slug, like IDs, parameters, labels, values, anchors, variables, relative paths, folder names and so on.
Here Google explains the crawler bot’s behavior, like:
In short, the crawler takes everything, what could be utilized for URL building, builds URLs and tries to get every byte from them.
This behavior of Google wouldn’t be a problem at all, if any request of non-existing URL would be answered by properly configured server with the error code 404, Google would get a bunch of 404 errors and that’s all – such errors make no negative impact to the site, where Google got them.
But, in most cases, servers are misconfigured and answer HTTP requests to non-existing pages with the code 200, like were these pages existing.
Why isn’t optimal, if Google gets code 200 from non-existing URLs
In general there are two problems:
- overspending of crawl budget – crawler has a limit amount of crawl budget pro website.If it crawls non-existing pages, it could happen, that crawl budget expires before important new pages will be crawled.
- indexing of non-existing URLs: if there is an existing page example.com/page/, and a page example.com/index.php/page/, which isn’t exist, but answers with code 200 and has content from the first page, so it could happen, that both pages appear in index, or even only the second page appears in index.