Home Seo Knowledge Base What is a robots.txt file ?
Seo Knowledge Base

What is a robots.txt file ?

Share

A robots.txt file is a text file that helps dictate where web crawlers (such as search engine bots) can and cannot crawl on your website. Before a search engine visits any page on a domain it hasn’t encountered before, it will check that domain’s robots.txt file to determine which URLs it’s allowed to access and which ones it should avoid.

File Format and Location

The robots.txt file must be named exactly “robots.txt” (all lowercase) and must be located at the root of your website. For example, for a website at https://www.example.com, the robots.txt file should be accessible at https://www.example.com/robots.txt.

The file must be UTF-8 encoded plain text. You should create it using a text editor like Notepad, TextEdit, vi, or emacs – not with a word processor that might add proprietary formatting or unexpected characters.

Basic Syntax

A robots.txt file consists of one or more groups of rules. Each group typically includes:

  1. User-agent directive – Specifies which crawler the rules apply to
  2. Disallow/Allow directives – Specify which parts of the site can or cannot be accessed

Example of a Basic robots.txt File:

User-agent: *
Disallow: /private/
Allow: /
Sitemap: https://www.example.com/sitemap.xml

This example means:

  • User-agent: * – These rules apply to all crawlers
  • Disallow: /private/ – No crawler should access the /private/ directory or its contents
  • Allow: / – All other parts of the site can be accessed
  • Sitemap: https://www.example.com/sitemap.xml – Provides the location of your sitemap

Common Directives

  1. User-agent: Identifies which crawler the rules apply to
    • User-agent: * (all crawlers)
    • User-agent: Googlebot (Google’s crawler specifically)
    • User-agent: Bingbot (Bing’s crawler specifically)
  2. Disallow: Tells crawlers which parts of your site they should not access
    • Disallow: / (blocks the entire site)
    • Disallow: /admin/ (blocks the /admin/ directory)
    • Disallow: /*.pdf$ (blocks all PDF files)
  3. Allow: Tells crawlers which parts they can access (especially useful with wildcards)
    • Allow: /public/ (allows access to the /public/ directory)
  4. Sitemap: Indicates the location of your XML sitemap
    • Sitemap: https://www.example.com/sitemap.xml

More Complex Examples

Block All Crawlers from Entire Site:

User-agent: *
Disallow: /

Block One Specific Crawler:

User-agent: BadBot
Disallow: /

User-agent: *
Allow: /

Block Multiple Directories:

User-agent: *
Disallow: /cgi-bin/
Disallow: /tmp/
Disallow: /private/

Block File Types:

User-agent: *
Disallow: /*.pdf$
Disallow: /*.xls$
Disallow: /*.doc$

Important Limitations

It’s crucial to understand that robots.txt is primarily for managing crawler traffic to your site, not for hiding pages from search results. Some important limitations:

  1. The protocol relies on voluntary compliance and malicious bots may ignore your robots.txt file or even use it to find disallowed pages.
  2. A page disallowed in robots.txt can still be indexed if linked to from other sites. While Google won’t crawl the content, it might still find and index a disallowed URL if it’s linked from elsewhere on the web.
  3. To prevent a page from appearing in search results, use a meta robots noindex tag instead of blocking it with robots.txt.

Best Practices

  1. Only use robots.txt for files or pages that search engines should never see or that can significantly impact crawling, such as login areas, test areas, or where multiple faceted navigation exists.
  2. Monitor your robots.txt file for any issues or changes, as developers sometimes make changes when pushing new code that could inadvertently alter your robots.txt file.
  3. Test your robots.txt file using tools like Google Search Console to make sure it works as intended.
  4. Be aware that Google caches robots.txt files for up to 24 hours (sometimes longer), so changes may not take effect immediately.
  5. Remember that each subdomain needs its own robots.txt file.

By properly configuring your robots.txt file, you can help search engines crawl your site more efficiently and focus on the content that matters most to your visitors.

Related Articles

Can you guarantee a #1 ranking on Google ?

No and you should be wary of any SEO provider that claims...

How much does SEO cost, and is it worth it ?

SEO costs can vary widely depending on the scope of work and...

What are backlinks and why do they matter ?

Backlinks, also known as inbound or incoming links, are links from one...

Do I need to blog for SEO ?

You don’t have to run a traditional blog, but you do need...