What Is a Robots.txt File? A Plain-English Guide

A few lines of plain text decide which of your pages Google crawls — and one stray rule can hide your entire site from search without a single warning. Here's what robots.txt really does, and the mistakes that quietly cost rankings.

Somewhere on your site right now, a search engine bot is making a decision you'll never see. It's deciding which of your pages deserve to be crawled, which to skip, and which to quietly ignore. Most of the time it gets it right. But sometimes a single misplaced line in a plain text file tells Google to leave your most important pages alone — and you won't find out until your traffic flatlines and you're staring at a "blocked by robots.txt" warning wondering when that happened.

That file is robots.txt, and it's one of the most powerful, most misunderstood, and most frequently broken files on the entire web. A few lines of text, sitting at your root domain, capable of either guiding crawlers gracefully or accidentally hiding your whole site from search. This guide explains exactly what it is, how it works, and the mistakes that quietly cost real sites real rankings.

What is a robots.txt file?

A robots.txt file is a plain text file placed in the root directory of your website — at https://yourdomain.com/robots.txt — that tells search engine crawlers and other automated bots which parts of your site they may and may not access. It's the first thing most well-behaved bots check before they start crawling.

It follows the Robots Exclusion Protocol, a convention that's been around since 1994 and was finally formalized as an official standard (RFC 9309) in 2022. Despite that long history, it's still widely misunderstood, partly because it does less than people assume and partly because it does it in ways that aren't obvious.

The key thing to understand up front: robots.txt is a set of directions, not a lock. Reputable crawlers like Googlebot and Bingbot respect it. Malicious bots, scrapers, and spam crawlers can and do ignore it completely. It's a "please don't" sign, not a security barrier.

What robots.txt actually controls (and what it doesn't)

This is where most of the confusion lives, so it's worth being precise.

What it does

  • Manages crawl access. It tells crawlers which URLs and directories they're allowed to request, helping you keep bots out of admin areas, internal search results, faceted-navigation traps, and other low-value sections.
  • Helps conserve crawl budget. On large sites, steering crawlers away from thousands of useless URLs means they spend their limited crawl resources on pages that matter.
  • Points to your sitemap. A Sitemap: line tells crawlers where to find your XML sitemap, making your important URLs easier to discover.

What it does NOT do

Here's the part that trips up even experienced site owners: robots.txt does not reliably keep pages out of Google's search results.

If you block a page in robots.txt, you're telling Google not to crawl it. But if other sites link to that page, Google can still index the URL — it just can't see the content. The result is an ugly search listing with the URL and no description, often with the note "No information is available for this page."

To actually keep a page out of search results, you use a noindex meta robots tag or X-Robots-Tag HTTP header — and crucially, the page must not be blocked in robots.txt, because Google has to be able to crawl the page to see the noindex instruction. Blocking it in robots.txt actively prevents the noindex from working. This single misunderstanding is behind a huge share of indexing problems. It's also closely tied to how your crawlable pages present themselves in results — something covered in more depth in our guide to finding pages with missing meta descriptions.

How a robots.txt file is structured

The syntax is simple. A file is made up of one or more groups, each starting with a User-agent line followed by rules. Here's a basic example:

  • User-agent: * — applies the following rules to all crawlers.
  • Disallow: /admin/ — blocks crawlers from the /admin/ directory.
  • Allow: /admin/public-info.html — carves out an exception within a disallowed directory.
  • Sitemap: https://yourdomain.com/sitemap.xml — points to your sitemap.

The core directives

  • User-agent names the crawler a rule group applies to. * means all bots; you can also target specific ones like Googlebot or Bingbot.
  • Disallow tells the crawler not to access a path. An empty Disallow: means "allow everything." Disallow: / means "block the entire site."
  • Allow explicitly permits a path, useful for overriding a broader Disallow. It's supported by Google and Bing.
  • Sitemap declares the absolute URL of your XML sitemap. You can list more than one.

One directive worth a note: Crawl-delay was once used to slow crawlers down, but Google ignores it entirely. If you need to manage Googlebot's crawl rate, that's handled in Google Search Console, not robots.txt. Treating Crawl-delay as a Google control is one of those outdated myths worth dropping.

The mistakes that quietly break sites

Because the file is so small, people assume it's hard to get wrong. In practice it's one of the most common sources of silent SEO damage. The worst offenders:

Accidentally blocking the whole site

The single most catastrophic robots.txt mistake is shipping this from a staging environment to production:

  • User-agent: *
  • Disallow: /

Those two lines tell every crawler to stay out of your entire site. Developers routinely block staging sites this way and then forget to remove the rule at launch. The site goes live, looks perfect to humans, and is completely invisible to search engines. It can sit like that for weeks before anyone notices the traffic that never arrived. This is precisely the class of error a pre-launch website audit is designed to catch before you go live.

Blocking CSS and JavaScript

Years ago it was common to disallow resource directories to "tidy up" crawling. Don't. Google renders pages much like a browser does, and if it can't fetch your CSS and JS, it can't see your layout, your mobile responsiveness, or content loaded by scripts. Blocking these resources can hurt how Google understands and ranks your pages.

Using robots.txt to hide sensitive content

Your robots.txt file is public. Anyone can read it at /robots.txt. Listing Disallow: /secret-admin-panel/ doesn't hide that directory — it advertises it to anyone curious enough to look. Sensitive areas need real authentication, not a robots directive.

Trusting it to deindex pages

As covered above, blocking a page doesn't remove it from search and actively prevents noindex from working. If a page is already indexed and you want it gone, leave it crawlable, add noindex, and only consider blocking it later, once it has dropped out.

Case-sensitivity and path errors

Paths in robots.txt are case-sensitive. Disallow: /Folder/ does not block /folder/. A trailing slash, a typo, or a wildcard in the wrong place can either block too much or nothing at all, and the file gives you no error message when it does.

How to check your robots.txt is working

You can't fix what you can't see, and robots.txt problems are invisible by design — the file does its damage silently. A few ways to keep it honest:

  • Read it yourself. Visit https://yourdomain.com/robots.txt in a browser. If you see Disallow: / under User-agent: *, your whole site is blocked. Act immediately.
  • Use Google Search Console. The robots.txt report shows the file Google last fetched, flags parse errors, and lets you confirm Google can read it.
  • Test specific URLs. Search Console's URL Inspection tool tells you whether a given page is blocked from crawling, so you can verify your important pages are reachable.
  • Re-check after every deploy. Migrations, redesigns, and CMS updates are exactly when a stray staging rule sneaks into production. Make a robots.txt check part of your launch routine.

Where Steterly fits in

A blocked robots.txt is the kind of problem that doesn't trip an alarm. The site loads, the pages look fine, and nothing tells you that search engines have been turned away — until rankings quietly slide. That's the same category of invisible decay as broken links, dead images, leftover placeholder text, and missing meta tags: flaws that are obvious to a crawler or a customer long before they're obvious to you.

Steterly is a whole-site quality scanner built to surface exactly that kind of hidden rot. It crawls your site the way a search engine would and reports back on the things that quietly cost you traffic and trust — broken links, missing or broken images, typos, outdated copyright years, placeholder text, missing meta titles and descriptions, and Core Web Vitals issues — so you find them before a visitor or Googlebot does.

You can start with a free scan of up to 50 pages, no credit card required. Create a free account and run a scan to get a clear, prioritized report of what needs attention. Pair that with a manual robots.txt check in Search Console and you've covered both the file that controls crawling and the content the crawlers actually find.

Frequently asked questions

Where is the robots.txt file located?

It must sit in the root directory of your domain, accessible at https://yourdomain.com/robots.txt. Crawlers only look in that exact location, so placing it in a subfolder or subdirectory means it will be ignored entirely. Each subdomain needs its own robots.txt file as well.

Does robots.txt stop a page from showing up in Google?

No. Blocking a page in robots.txt only stops Google from crawling its content, but the URL can still be indexed if other pages link to it, often appearing with no description. To reliably keep a page out of search results, leave it crawlable and use a noindex meta robots tag instead.

Do I need a robots.txt file at all?

Not strictly. If you want search engines to crawl everything, you can omit it and crawlers will assume full access. But having one is good practice: it lets you point to your sitemap, steer crawlers away from low-value sections, and conserve crawl budget on larger sites.

Can robots.txt protect private or sensitive content?

No. The file is publicly readable by anyone, so listing a directory in it actually reveals that the directory exists. Reputable bots obey it, but malicious scrapers ignore it completely. Genuinely sensitive content needs password protection or server-side authentication, not a robots.txt rule.

What happens if my robots.txt accidentally blocks my whole site?

If your file contains Disallow: / under User-agent: *, every compliant crawler is told to stay out, and your pages can drop out of search results within days to weeks. This usually happens when a staging-site rule ships to production. Remove the rule immediately and request re-crawling in Google Search Console.

Does Google still support the crawl-delay directive?

No. Google ignores crawl-delay entirely, though some other crawlers like Bing still honor it. If you need to manage how fast Googlebot crawls your site, you do that through the crawl rate settings in Google Search Console rather than in robots.txt.