Many website publishers, especially recent ones, have had the bitter experience of Google no longer being able to index, or even crawl, many pages at present. Is this just a temporary bug or a deeper problem born from a desire to combat spam and low-quality content? Here are some answers…
We know that Google has been experiencing very serious problems indexing web pages for many months:
- More specifically on recent sites (but not only).
- On many European sites (exemple .fr, es, it, de..), but not limited to that.
On many sites, either the pages are crawled but not indexed, or they are simply not crawled. In Search Console, URLs are shown as “Excluded” in the “Coverage” report, with messages like “Detected, currently not indexed” when waiting to be crawled or “Crawled, currently not indexed” when they are crawled and awaiting indexing.
This phenomenon has become significant enough to realize that it is not an isolated bug for a given website. This is a real strong trend in Google crawling and indexing at the moment.
The quality of the pages as the first criterion
Many webmasters have asked me questions about this in recent weeks, providing the address of their site, which could not be visited or taken into account by the engine’s robots. It must be said that among these sources of information, many were of very low quality:
- Content is too short. Articles from a site with a 100% SEO vision in its design.
- Articles written solely to create a link to a page on another site (from a link sales platform or other).
For this type of page, it is normal that Google has implemented an algorithm to sort the wheat from the chaff. You have to take responsibility for it. But for other pages (and especially other sites), although completely valid and of good quality, the problem is also and always present.
Tools are trying to correct the situation, but… A number of tools, most of them using Google’s indexing API, were then put in place. By using them, even if the situation is not ideal, we can nevertheless improve the situation a little. And the fact that these forced indexing tools work (at least a little) clearly shows Google’s total inconsistency at this level. Indeed:
Either the engine considers that the contents in question are of low quality, and in this case, it must refuse their indexing, whatever the method used to submit them to it. If a page is refused for indexing via natural methods (robot crawl, XML Sitemap, etc.) but accepted via the API, that’s just nonsense! Or it indexes them via the API, and at that point, this means both that the quality of the content is not in question, but it also demonstrates its current inability to crawl the Web in a natural and efficient way. It could therefore be either a bug in the engine and its robots or a flaw in its crawl system, preventing it from crawling websites, particularly recent ones, in a clean and efficient manner. A very serious point, you will agree, for an engine that wants to be a world leader in the field!
(Note here, however, that certain URLs are accepted via the API submission tools but are sometimes subsequently deindexed by the engine).
We can currently observe the degeneration of crawling capabilities for several years: first of all, the multiple bugs that have punctuated the last few months in terms of indexing and now this inability to crawl and index recent content. We can even say that at present, Bing indexes the Web much better than its historical competitor. Who would have dared to say that a few years ago? It is even much more innovative at this level, notably with the IndexNow protocol, offered for several months.
So what is it today? After analyzing numerous sites having difficulty with SEO and conducting my own internal tests, here are my conclusions:
The current problems are so widespread and incredible that it is impossible that Google is not aware of these problems. So there must be a logical explanation.
Google may now be implementing a filter system to only index good quality content. But it is an understatement to say that it is not yet ready, especially with recent content, which has not yet offered the engine positive signals concerning the quality of the content of the page and especially of the site that displays them.
If one of the criteria for filtering the quality of content is, of course, based on the analysis of the texts offered online, it seems essential to quickly obtain links (backlinks) from a site “trusted” by Google (in which the engine has a certain confidence: old, having never spammed, having strong authority and legitimacy in their field, etc.). Each time we made a link from a trusted site to a web page that had previously had difficulty getting indexed, the said indexing was triggered as if by a miracle during the day. But without any impact on the indexing of other pages of the target site, however. In other words, the indexing of one page does not trigger the indexing of others.
Google is certainly trying to create firewalls to counter the potential invasion of spam content automatically written by GPT-3.5 or GPT-4 type algorithms. If a priori today, the engine knows how to distinguish automated content from texts written by humans, what will it be in a few months or a few years? It is therefore entirely possible that Google will implement algorithms in this direction and will primarily deal with web pages that have a history allowing the signals to be analyzed. Could Google’s current incredible situation mean that the next target content to be processed will be those that were recently posted online, still waiting? They will then be analyzed by the algorithm which will then be able to do its job correctly on this type of page? We can imagine it, without being certain, of course.
In any case, we must hope that the situation evolves quickly because it clearly does not give a positive image of the Mountain View firm and its ability to control its search engine and the current growth of the Web. It must be admitted that this was never the case a few years ago. But the Web was different, and the level of spam processed by the engine was very different too (remember that Google discovers 40 billion pages of spam every day! and the current evolution of SEO methods is not for nothing).
Is the engine overwhelmed by the exponential evolution of the Web and the number of pages and information available online, and therefore the spam with which it is bombarded? Or is this ultimately just a temporary blip and a situation that will be quickly corrected by Google’s technical teams? The near future will certainly tell us more on this subject… One thing is certain in any case: the current situation must absolutely evolve if Google maintains its current hegemony…