Skip to main content

Robots.txt Validator & Tester

Validate robots.txt syntax against the Robots Exclusion Protocol (RFC 9309) and test exactly which URLs are ALLOWED or BLOCKED for any user-agent. Detects unknown directives, bad paths, rules before User-agent, Crawl-delay issues, and conflicting Allow/Disallow. Uses Google's longest-match-wins resolution. 100% in-browser, no signup.

robots.txt Source

Test a URL


Batch Test

Paste multiple URL paths (one per line) and test them all at once for the selected user-agent.

How to Use This Tool

  1. Fetch the robots.txt you want to validate. Open https://example.com/robots.txt in a browser, select all, and copy. Or if you manage the site, grab the file from the server root. The tool accepts any file up to 512KB — more than generous for any real robots.txt, which are typically under 10KB.
  2. Paste the content or upload the file. Drop the text into the large textarea on the left, or click Upload robots.txt to pick the file from disk. You can also click Load Sample to see a realistic example with both valid rules and intentional mistakes, so you can see exactly how the validator surfaces common problems.
  3. Click Validate (or press Ctrl/Cmd+Enter in the textarea). The parser walks every line, groups consecutive User-agent declarations into blocks, attaches each Allow/Disallow/Crawl-delay rule to the correct block, and runs syntax checks: unknown directives, missing colons, paths without leading /, numeric Crawl-delay, rules orphaned before any User-agent line, multiple User-agent: * blocks, misplaced $ anchors, and empty Disallow (which silently allows everything).
  4. Enter a URL path to test. In the right panel, type the path only (not the full URL) of the page you want to check. For example, to test https://example.com/admin/login.html, enter /admin/login.html. Leading slash is inserted automatically if missing.
  5. Pick a user-agent and click Test. Choose from the dropdown of 16 common crawlers (Googlebot, Bingbot, GPTBot, ClaudeBot, Applebot, etc.) or select Custom to type any UA string. The tool picks the single most specific User-agent block in your robots.txt (Google spec — blocks do NOT merge), runs every matching Allow/Disallow through longest-match-wins resolution, and shows ALLOWED (green) or BLOCKED (red) along with the exact rule line and the full list of effective rules.
  6. Test many URLs at once with Batch Test. Paste a list of URL paths (one per line) into the Batch Test textarea, click Test All, and get a table with status per URL. Perfect for auditing a full sitemap against robots.txt before deploy, or for post-migration QA.

About Robots.txt & the Robots Exclusion Protocol

The Robots Exclusion Protocol (REP) is the de-facto standard for telling crawlers which URLs on a site they may fetch and which they must not. It was drafted by Martijn Koster in 1994, implemented almost universally by 1996, and formalized as RFC 9309 by Google, Yandex, and a consortium of search engines in September 2022. Every mainstream crawler — Googlebot, Bingbot, Yandex, Baidu, DuckDuckGo, Applebot, as well as the new generation of AI training bots like GPTBot, ClaudeBot, CCBot, and PerplexityBot — reads /robots.txt before requesting any other resource, and almost all honour what they find there.

The file is plain text, UTF-8 encoded, served at the root path of each host: https://example.com/robots.txt. It contains zero or more records, separated by blank lines. Each record is one or more consecutive User-agent: lines followed by Allow:, Disallow:, and optionally Crawl-delay: directives that apply to those user-agents. Two directives are global and apply to the whole file regardless of record: Sitemap: (which points crawlers at your XML sitemap) and the Yandex-specific Host:. Comments begin with # and run to end of line. Directive names are case-insensitive; path values and user-agent names are case-sensitive.

The most important and most misunderstood rule of robots.txt is user-agent matching precedence. When Googlebot visits your robots.txt, it does NOT combine rules from multiple matching blocks. It picks the single most specific User-agent block, measured as the longest case-insensitive prefix match against its UA token, and applies ONLY that block. Other blocks are ignored completely. So if you have User-agent: * with Disallow: /private/ and User-agent: Googlebot with Disallow: /admin/, Googlebot will crawl /private/ (because it does not read the * block) and will only honour Disallow: /admin/. Every SEO audit we run catches at least one site making exactly this mistake: merging Googlebot into * assumptions and leaking crawl access to sensitive paths.

Within a block, Allow vs Disallow conflicts are resolved by Google's published longest-match-wins rule (now RFC 9309). Every rule in the block that matches the URL path is a candidate; the rule with the longest pattern wins; if two rules tie on length, Allow wins over Disallow. So Disallow: /admin/ (length 7) blocks /admin/anything, but Allow: /admin/public/ (length 14) selectively unblocks /admin/public/index.html. If no rule in the block matches, the URL is allowed by default. This is the opposite of firewall semantics, and a frequent source of confusion.

Wildcards. Google and most modern crawlers support two wildcards: * matches zero or more characters, $ anchors the pattern to the end of the URL. Common patterns: Disallow: /*.pdf$ (block all PDF files), Disallow: /*? (block all URLs with query strings), Allow: /images/*.jpg (allow JPG images under /images/). Wildcards are extensions to the original 1994 spec; all major crawlers support them now, but putting $ anywhere except the end of a pattern is a bug we flag with a warning.

Common mistakes we catch in production audits: typos in directive names (Dissallow, UserAgent, Craw-delay) which are silently ignored, leaving rules with zero effect; rules placed before any User-agent line, which are orphaned; relative paths (admin/ instead of /admin/) which match nothing; trying to block a file extension without wildcards; assuming Crawl-delay works for Google (it does not — only Bing, Yandex, Baidu honour it); blocking CSS/JS files, which prevents Google from rendering your pages correctly and destroys ranking; duplicating User-agent: * blocks (crawlers use only the first); and putting a UTF-8 BOM at the start of the file, which breaks parsing on some older crawlers.

Robots.txt does not control indexing — only crawling. A URL blocked by Disallow is not fetched, but if Google finds the URL through external links, it can still appear in search results (with no snippet, because the content was never read). This is why you sometimes see "No information is available for this page" listings in Google. To prevent indexing, you need a <meta name="robots" content="noindex"> tag or an X-Robots-Tag: noindex HTTP header — both of which require the crawler to reach the page, so do NOT combine them with a robots.txt block on the same URL. For truly private content, use HTTP authentication or return 404/410.

At EmproIT, our Technical SEO team configures robots.txt for enterprise sites handling thousands of distinct crawl surfaces: faceted navigation, paginated archives, internal search, admin portals, staging environments, API endpoints, CDN assets, and user-generated content. We run automated validation on every deploy — this tool runs the same checks — and monitor Search Console for crawl budget waste. Pair this validator with our Robots.txt Generator (to build correct files from scratch), our Sitemap Validator (to validate the sitemaps referenced from your robots.txt), and our HTTP Header Checker (to verify X-Robots-Tag headers are set correctly on pages you want de-indexed).

Frequently Asked Questions

What is robots.txt syntax?

Robots.txt follows the Robots Exclusion Protocol (REP), formalized by Google as RFC 9309 in 2022. The file is plain text, served at /robots.txt at the root of a host, and contains records separated by blank lines. Each record starts with one or more User-agent lines followed by Allow, Disallow, and optionally Crawl-delay lines. Sitemap and Host directives are global. Lines are case-insensitive for directive names but case-sensitive for paths and agent names. Comments start with # and run to end of line. Blank lines separate records. Our validator checks every one of these syntax rules and flags malformed lines with the exact line number.

How do wildcards (* and $) work in robots.txt?

Google, Bing, and most modern crawlers support two wildcards in Allow/Disallow paths. The asterisk * matches any sequence of zero or more characters. The dollar sign $ anchors the pattern to the end of the URL. Examples: Disallow: /*.pdf$ blocks all URLs ending in .pdf; Disallow: /private/* is equivalent to Disallow: /private/; Allow: /images/*.jpg allows any .jpg under /images/. Wildcards in the middle of a pattern are supported but some older crawlers misinterpret them. Our validator emits a warning when $ appears anywhere except the end. For maximum compatibility, prefer simple path prefixes when wildcards are not strictly required.

How does user-agent matching precedence work?

When a crawler reads robots.txt, it does NOT merge multiple matching blocks. It picks the single most specific User-agent block that matches its token and applies ONLY that block. Specificity is measured as the longest case-insensitive prefix match against the crawler's UA token. If Googlebot visits a file with blocks for User-agent: * and User-agent: Googlebot, it uses ONLY the Googlebot block and ignores the * block entirely. If Googlebot-Image visits the same file, it first looks for Googlebot-Image; if absent, falls back to Googlebot (prefix match); if that is absent, falls back to *. A Googlebot-specific block must repeat every rule from the * block you want Googlebot to honour.

Which wins when Allow and Disallow conflict?

Per Google's published spec (now RFC 9309), the rule with the LONGEST matching path pattern wins, and if two rules tie on length, Allow wins over Disallow. So Disallow: /admin/ blocks /admin/login.html, but Allow: /admin/public/ (longer pattern) unblocks /admin/public/index.html. If you have Allow: /page and Disallow: /page (equal length), Allow wins. This is different from the legacy 1994 spec, where the first matching rule won. Modern crawlers (Google, Bing, Yandex, DuckDuckGo) follow the longest-match-wins rule. Our tester implements the Google spec exactly.

Does Google honour Crawl-delay?

No. Google has publicly stated for over a decade that it ignores Crawl-delay in robots.txt. Googlebot's crawl rate is controlled instead via adaptive crawling in Search Console. However, Bing, Yandex, Baidu, and Seznam all honour Crawl-delay as a hard rate limit in seconds between requests. Typical values range from 1 to 10 seconds. Values above 10 seconds are often silently capped by crawlers that do honour the directive, and our validator emits a warning for Crawl-delay greater than 10. If you need Googlebot to slow down, implement server-side rate limiting or return 429/503 responses; do not rely on Crawl-delay.

How do I test whether a URL is blocked by robots.txt?

Paste your robots.txt into the left panel of this validator, then enter the URL path you want to check in the Test a URL panel (just the path, like /admin/login.html, not the full URL). Select the user-agent from the dropdown or enter a custom UA string. Click Test URL. The tool shows ALLOWED or BLOCKED, highlights the matching rule with the line number in your robots.txt, and lists every effective rule for that UA (after falling back from specific to * if needed). For batch testing, paste a list of URL paths (one per line) into the Batch Test textarea and click Test All.

What are common robots.txt mistakes?

The mistakes we see most often: (1) typos in directive names like Dissallow or UserAgent which are silently ignored; (2) rules placed before any User-agent line, which are orphaned and ignored; (3) relative paths like admin/ instead of /admin/; (4) trying to block a file extension without wildcards (use Disallow: /*.pdf$); (5) placing User-agent: * after a specific block and expecting rules to merge — they do not; (6) using robots.txt to hide sensitive content (robots.txt is public); (7) blocking CSS/JS files, which prevents Google from rendering the page; (8) missing or malformed Sitemap: directives. Our validator catches every one of these.

Does robots.txt affect indexing, or just crawling?

Robots.txt controls crawling, not indexing. A URL blocked by Disallow is not fetched by the crawler, but if Google finds the URL through external links, it can still index the URL (without content, showing just the URL and anchor text). This is why you sometimes see "No information is available for this page" in Google results. To prevent indexing, use: (1) a <meta name="robots" content="noindex"> tag on the page (which requires the crawler to reach the page, so do NOT block it in robots.txt); (2) an X-Robots-Tag: noindex HTTP header; (3) HTTP authentication (401/403); (4) 404/410 for permanently removed content. For reliable deindexing, use meta robots or X-Robots-Tag, not robots.txt.

Technical SEO That Controls Crawl Budget

Our Technical SEO team configures correct crawl directives, fixes blocked-resource issues, and ensures your critical pages are indexable by Google.

Let's Talk