Robots.txt is a way of telling search engine crawlers what they can and can’t access. Here, we take you through its benefits and how to use robots.txt on your site.

PLEASE! Before you read this guide, please check out our quick guide on the pages you should block Google from crawling. Improve your crawl efficiency and boost your Google ranking.

Contents

What is robots.txt?

A robots.txt file is a file on your web server that tells search engine bots like Googlebot whether they should access a page or not.

Why is robots.txt important?

Robots.txt controls how search engine spiders see your site, and consequently improper use of it can impact your ranking.

The robots.txt file is the first thing a search engine bot looks at when coming to a page. It does this to allow it to determine whether or not it has permission to access that page or file. If the robots.txt file says that it can enter, the search engine bot continues to the page.

If you have instructions for a bot you need to make sure you give them these, and the way to do so is with a robots.txt file. 

Reasons to use robots.txt files

  • You have content you want to block from search engines
  • You have paid links or ads that require special instructions for robots
  • You want to optimise access to your site from bots
  • You are developing a site which is live, but do not want it indexed by search engines yet

Though these scenarios can be controlled by other methods, the robots.txt file is a simple and centralised way of addressing such issues.

Reasons not to use robots.txt files

  • It’s error free and even simpler
  • You don’t have any files you want blocked from search engines

Basic robots.txt commands

These are the four most basic commands to familiarise yourself with when dealing with robots.txt.

Allow full access:
User-agent: *
Disallow:

Block all access:
User-agent: *
Disallow: /

Block one folder:
User-agent: *
Disallow: /folder/

Block one file:
User-agent: *
Disallow: /file.html

Robots.txt files on your site

1. Do I have a robots.txt file?

When it comes to the robots.txt file, you first need to establish whether you have one. You can easily do this by entering your URL on a site with robots.txt checkers. Alternatively, you can check from any browser by adding “/robots.txt” to the end of a domain name. E.g www.website.com/robots.txt

2. If you have one, ensure it’s not blocking important files

You need to make sure that robots.txt isn’t blocking any important files that aren’t meant to be blocked or that might help Google to understand your pages as this could affect your site’s ranking. You can check this on the Google search console, with Google also offering step by step instructions of how to test your robots.txt file.

3. Do I need a robots.txt file?

In some cases, your site may have no need for a robots.txt file. Review the reasons of why you might need a robots.txt file and decide. If you don’t have a robots.txt file, this simply means that search engine bots will have full access to your site, a normal and very common method.

How to make robots.txt files

What a robots.txt should say entirely depends on what outcome you want. All robots.txt instructions result in one of three outcomes:

  • Full allow: All content may be crawled.
  • Full disallow: No content may be crawled.
  • Conditional allow: Directives in the robots.txt determine the ability to crawl certain content.

Allow All – all content may be crawled

Most people want robots to visit everything on their site. If this applies to you, there are three ways to let bots know they’re welcome.

  • Don’t have a robot.txt file. If a bot comes to your site and doesn’t find robots.txt it will simply visit all of your web pages and content
  • Make an empty file called robots.txt. If your robots.txt file has nothing in it, a bot will find the file, see there is nothing to read and so will go on to visit all of your web pages and content
  • Make a file called robots.txt and enter the ‘Allow full access’ command mentioned above. This will send instructions to a bot telling it that it is allowed to visit all of your pages and content

Disallow All – no content may be crawled

This means that search engines will not index or display your web pages. To block all search engine spiders from your site you should enter the following command in your robots.txt:
User-agent: *
Disallow: /

Understanding robots.txt instructions

Here’s a breakdown of each of the words in a robots.txt instruction to help you better understand how to make your own

User-agent

“User-agent” specifies directions to a particular search engine bot if necessary.

If you want to tell all robots the same thing you put a ” * ” after the “User-agent”, making it “User-agent: *” This basically means that “these directions apply to all robots”.

If you want to tell a specific robot something (in this example Googlebot) it would look like this:

User-agent: Googlebot

The above line is saying “these directions only apply to Googlebot”.

Disallow

Disallow instructs the robot on what files they should not look at. For example, if you wanted to stop search engines from indexing photos on your site, you would put those photos into one folder and exclude it. To do so, you’d collect the images into one folder called “photos” and then tell the search engines not to index the folder by typing:

User-agent: *
Disallow: /photos

Here, the “User-agent *” part is tells the bot that “this applies to all robots”, whilst the “Disallow: /photos” part says “don’t visit or index my photos folder”.

Allow

Googlebot has a few more instructions than other robots. As well as “User-name” and “Disallow”, Googlebot also uses the “Allow” instruction.

The “Allow:” instruction allows you to tell a bot that it’s fine to view a file in a folder that has been “Disallowed” by other instructions. Going back to the “Disallow” photos example, if we had that instruction entered but wanted one of the photos in the files indexed, we would do this using “Allow”. Let’s say the image was called giraffe.gif, the instruction would therefore be:

User-agent: *
Disallow: /photos
Allow:  /photos/giraffe.gif

This instruction lets Googlebot know that it can visit giraffe.gif despite the photo folder being blocked.

You may or may not need robots.txt on your site, this depends entirely on your content and what you want available to bots. If you do have robots.txt on your site, you should always ensure that it’s up to date and it’s vital that it’s not blocking any pages you don’t want blocked or that might help Google to rank your pages.

Where to now?

If you have one hour to spare, please read and action Technical SEO Audit checklist. It contains all the elements you need to check to ensure your site is properly setup and able to rank well. We guarantee the steps in this guide will improve your organic traffic. The whole process should take you less than an hour to do. Happy hunting!

One Comment

  • Robot.txt file is used to give access or for block any page to the search engine. it is really helpful file for any website. everyone uses this to their website. the information you shared on it is really helpful for us. Keep sharing such kind of information.

Leave a Reply

Your email address will not be published. Required fields are marked *