Indexed Though Blocked By Robots.txt vs. Blocked By Robots.txt

SEO

Indexed Though Blocked By Robots.txt vs. Blocked By Robots.txt

Seth

Trammell

Oct 12, 2023

min read

Key Points

• Robots.txt guides search engine bots on what to crawl or ignore, but not what to index.

• "Blocked By Robots.txt" is usually intentional but can become an issue if not handled properly.

• "Indexed Though Blocked By Robots.txt" usually indicates your website is linked to somewhere you’re not aware of.

Careful management and regular review of the robots.txt file can mitigate these challenges. Understanding the interaction between search engine crawlers and your website is a vital part of an SEO service and one of the key tenets of technical SEO. Two of our greatest tools for controlling and examining how search engines view our sites are the robots.txt file and Google Search Console or GSC.

Indexed though blocked by robots.txt and simply blocked by robots.txtare two status designations you’ll find within the “Pages” report in your Google Search Console.

The “Pages” report formerly known as the “Coverage” report in GSC

While blocked by robots.txt denotes a normal and intended status, the other is a bit more of a mystery and can warrant looking into to ensure the parts of your site you don’t want in search stay that way. First, though let’s get acquainted with the core concept of the robots.txt file.

What Is Robots.txt?

Robots.txt is a text file used by websites to communicate with search engine bots like Googlebot. Located in the root directory of a web page, the robots.txt file provides instructions about which parts of the site should be crawled or not crawled by search engine crawlers.

What Does Robots.txt Do?

The primary function of robots.txt is to control the behavior of search engine crawlers, guiding them to areas of the site that the owner wants to have crawled while disallowing access to other areas. This is particularly useful for conserving crawl budget on big sites as well as letting search engines focus on what’s important.

The robots.txt syntax consists of "User-agent" to specify the web crawlers, followed by "Disallow" or "Allow" to validate the URLs or paths that are to be blocked or permitted.

How Does Robots.txt Affect Indexing?

The instructions in the robots.txt file can impact a site's page indexing by search engines like Google, Yahoo, and Bing though the robots.txt file is not a tool for controlling indexing. If a page is disallowed in the robots.txt file, search engine bots are instructed not to crawl that page. However, since crawling and indexing are not the same and are controlled by different directives, it can lead to complex scenarios like being pages being Indexed Though Blocked By Robots.txt.

What Does “Blocked By Robots.txt” Mean in GSC?

In Google Search Console, Blocked By Robots.txt refers to URLs that the robots.txt file has directed Google's crawlers not to crawl. This disallow directive, at times on its own, prevents those URLs from appearing in Google search results. However, a robots.txt directive on its own is not intended to be enough to prevent indexation.

The blocked by robots.txt report lets you know what Google sees and how it respects your search directives.

Is “Blocked By Robots.txt” an Error?

No. At least, it doesn’t have to be, and it typically isn’t an error but is rather an intended status.

It’s most commonly the intent of the site owner to have pages excluded from crawling and search that aren’t or shouldn’t be found by searchers.

This can be handy for dev or staging sites, cart pages, and other areas of a website that searchers just don’t need. It can even help prevent or partially fix malicious attacks via things like site search spam.

It becomes an error if this blocking is unintentional, preventing vital content from being found by search engines. It's not unheard of for entire sites to get blocked due to a mistake in the robots.txt. Particularly sites moving from staging or development intended to go live. A simple command like the one below is all it takes to block your entire site.

When To Fix “Blocked By Robots.txt”

It’s best to fix Blocked By Robots.txt when essential pages are blocked unintentionally, negatively impacting your site's crawlability or visibility in search — primarily if it affects entire sections of your site, such as pagination or navigational elements of your content.

The “Blocked by robots.txt” section will appear under the reasons pages aren’t indexed in your GSC pages report.

Using tools like a robots.txt tester or validator can help you determine which lines are causing the blocks and test to see how changes will impact your URL's ability to be crawled.

What Does the Warning Indexed Though Blocked By Robots.txt Mean in GSC?

In GSC, the warning Indexed Though Blocked By Robots.txt means even though a URL is part of the robots.txt block declaration Google has found it and indexed it. There are a couple of reasons you might see this, and “fixing” the issue is often nuanced and very site or even CMS-specific.

The section for Indexed though blocked by robots.txt now appears at the bottom of a “Pages” report under “Improve page appearance”

Why Am I Seeing This Warning?

Crawling vs. Indexing: The robots.txt file can instruct search engines not to crawl certain pages, but this doesn't prevent them from being indexed. If a page can be discovered on other sites, a search engine may still include the URL in their index, despite the disallow directive.

GSC is out of date: GSC is a fantastic tool, but it isn’t always the most up-to-date. If the page was crawled and indexed before the robots.txt file was updated to block it, the page might remain in the index or show up in this report. Checking the URL with the URL inspection tool can confirm if this is still true in (near) real-time.

Page not labeled Noindex: The robots.txt file should not be used to control indexing, only crawling. To truly exclude a page from the index, it has to be given a noindex tag and able to be crawled so Google and others can respect that directive.

Not all bots respect robot.txt: Robots.txt is a web standard, but it's not enforced or enforceable. Google states that they respect robots.txt with their crawlers, but rumors and examples have existed for years of bots and spiders that don’t. Content that’s accessible and being served by other search engines could lead to links back to your content you don’t want to be crawled.

Is “Indexed Though Blocked By Robots.txt” an Error or Bad for SEO?

Yes, this situation usually is an error as pages we don’t intend to be crawled are most often also not things we want in the Google index.

So, is this situation bad for SEO? In our experience at GR0, that answer is often nuanced, and while the true answer is often “Yes,” the severity of the problem is typically not severe.

While most instances where this warning is seen are either out of date or edge cases, there are times when this can serve to alert a site owner or SEO to potential leaks in their system.

It can illuminate places where content is still linked in the case of a site migration that wasn’t complete or even point to problems in the robots.txt or noindex directive rules you’ve set.

It's usually worth taking a look and exploring the content and value of the URLs that show up in this report to ensure the content either deserves to be blocked or needs the further step of having a noindex tag applied.

Monitoring your GSC reports like this for Indexed, though blocked by robots.txt is an important part of site maintenance.

How To Fix URLs That Are “Indexed Though Blocked By Robots.txt”

Fixing URLs that show up in the Indexed Though Blocked By Robots.txt.report requires confirming they are up to date and determining the best state, indexed or not indexed, for the content in question.

How To Check URLs for Indexation

• Google Search Console URL Inspection Tool: The most useful tool for determining the (near) real-time status of how Google views a URL is the URL Inspection Tool.

Simply paste the URL you want to check into the search bar at the top of GSC, and voila!

• Site Search Operator: You can use a site search operator directly in Google or other search engines by typing site:yourdomain.com/your-url. If the URL appears in the search results, it's probably indexed.

We say ‘probably’ because while the ‘site:’ operator usually has it right, it isn’t intended for diagnostics which Google’s John Mueller explains via Google Search Central:

“... a site query is not meant to be complete nor used for diagnostics purposes. A site query is a specific kind of search that limits the results to a certain website. It’s basically just the word site, a colon, and then the website’s domain.

While using this query limits the results to a specific website, it’s not meant to be a comprehensive collection of all of the pages from that website. If you’re keen on finding out how many pages Google has indexed from your website, then use Search Console instead.”

- John Mueller Google Search Advocate

How To Send Google the Right Signals

SEO is all about sending the right signals. A website has many ways to communicate what information is important and worthwhile as well as what isn’t.

Here's how to ensure you send the right signals to prevent such problems:

Proper Configuration of Robots.txt: Make sure your robots.txt file is correctly configured to include only the necessary blocking directives.

Regularly Review Robots.txt: Periodically review and update your robots.txt file to ensure that it aligns with your current site structure and SEO strategy. There’s no golden rule for when to check your robots.txt, but at GR0, we recommend at least twice per year, depending on the size of your site.

Provide Clear Directives: If you don’t want content on your site in the index, the strongest signal you can send is a noindex directive. A big note is that any page you want Google to find a noindex tag on has to be available to be crawled. That means you can’t block it with a robots.txt if you want Google to remove it from the index.

Monitor Indexation: Keeping an eye on your page's report is key to understanding and staying ahead of how Google is crawling and indexing your site. Looking at warnings with broken links or 404 pages, as well as robots.txt and canonicalization, can point you to potential areas to fix.

By focusing on these areas, you can send clear, correct signals to Google and prioritize the content you want to get indexed. Routine site maintenance like this reinforces the effectiveness of your SEO efforts by ensuring that Google can correctly interpret and follow the rules you've established. If doing this yourself isn’t your thing, let the GR0 agency be your full-service SEO solution.

The Bottom Line

The issue of Indexed Though Blocked By Robots.txt vs. Blocked By Robots.txt illustrates two sides of the same coin. While the former indicates a situation where a URL is blocked from crawling but still appears in the search engine's index and is unintentional, the latter strictly refers to a URL being blocked from both crawling, which is mostly intentional.

Both situations can arise from misconfigurations or intentional settings within the robots.txt file. Proper utilization of SEO tools, regular maintenance, and a clear vision of what you want in the index and what you don’t go a long way.

For all of your SEO and digital marketing needs, schedule a consultation with GR0 and learn how we can take your success to the next level.

Sources:

About Search Console | Google

robots.txt | Wikipedia

How to write and submit a robots.txt file | Google Search Central

Is your site the victim of internal site search spam? | Yoast

robots.txt Validator and Testing Tool | TechnicalSEO.com

Robots.txt Introduction and Guide | Google Search Central

How To Use the Site Search Operator | Google Search Central

Why does a site:query not show all my pages? | YouTube