Robots.txt is a way of telling search engine crawlers what they can and can’t access. Here, we take you through its benefits and how to use robots.txt on your site.
What are robots.txt files?
A robots.txt file is a file on your web server that tells search engine bots like Googlebot whether they should access a page or not.
Why is robots.txt important?
Robots.txt controls how search engine spiders see your site, and consequently improper use of it can impact your ranking.
The robots.txt file is the first thing a search engine bot looks at when coming to a page. It does this to allow it to determine whether or not it has permission to access that page or file. If the robots.txt file says that it can enter, the search engine bot continues to the page.
If you have instructions for a bot you need to make sure you give them these, and the way to do so is with a robots.txt file.
Reasons to use robots.txt files
- You have content you want to block from search engines
- You have paid links or ads that require special instructions for robots
- You want to optimise access to your site from bots
- You are developing a site which is live, but do not want it indexed by search engines yet
Though these scenarios can be controlled by other methods, the robots.txt file is a simple and centralised way of addressing such issues.
Reasons not to use robots.txt files
- It’s error free and even simpler
- You don’t have any files you want blocked from search engines
Primary robots.txt commands
These are the four most basic commands to familiarise yourself with when dealing with robots.txt.
Allow full access:
Block all access:
Block one folder:
Block one file:
Robots.txt files on your site
1. Do I have a robots.txt file?
When it comes to the robots.txt file, you first need to establish whether you have one. You can easily do this by entering your URL on a site with robots.txt checkers. Alternatively, you can check from any browser by adding “/robots.txt” to the end of a domain name. E.g www.website.com/robots.txt
2. If you have one, ensure it’s not blocking important files
You need to make sure that robots.txt isn’t blocking any important files that aren’t meant to be blocked or that might help Google to understand your pages as this could affect your site’s ranking. You can check this on the Google search console, with Google also offering step by step instructions of how to test your robots.txt file.
3. Do I need a robots.txt file?
In some cases, your site may have no need for a robots.txt file. Review the reasons of why you might need a robots.txt file and decide. If you don’t have a robots.txt file, this simply means that search engine bots will have full access to your site, a normal and very common method.
How to make robots.txt files
What a robots.txt should say entirely depends on what outcome you want. All robots.txt instructions result in one of three outcomes:
- Full allow: All content may be crawled.
- Full disallow: No content may be crawled.
- Conditional allow: Directives in the robots.txt determine the ability to crawl certain content.
Full allow – all content may be crawled
Most people want robots to visit everything on their site. If this applies to you, there are three ways to let bots know they’re welcome.
- Don’t have a robot.txt file. If a bot comes to your site and doesn’t find robots.txt it will simply visit all of your web pages and content
- Make an empty file called robots.txt. If your robots.txt file has nothing in it, a bot will find the file, see there is nothing to read and so will go on to visit all of your web pages and content
- Make a file called robots.txt and enter the ‘Allow full access’ command mentioned above. This will send instructions to a bot telling it that it is allowed to visit all of your pages and content
Full disallow – no content may be crawled
This means that search engines will not index or display your web pages. To block all search engine spiders from your site you should enter the following command in your robots.txt:
Here’s a breakdown of each of the words in a robots.txt instruction to help you better understand how to make your own
“User-agent” specifies directions to a particular search engine bot if necessary.
If you want to tell all robots the same thing you put a ” * ” after the “User-agent”, making it “User-agent: *” This basically means that “these directions apply to all robots”.
If you want to tell a specific robot something (in this example Googlebot) it would look like this:
The above line is saying “these directions only apply to Googlebot”.
Disallow instructs the robot on what files they should not look at. For example, if you wanted to stop search engines from indexing photos on your site, you would put those photos into one folder and exclude it. To do so, you’d collect the images into one folder called “photos” and then tell the search engines not to index the folder by typing:
Here, the “User-agent *” part is tells the bot that “this applies to all robots”, whilst the “Disallow: /photos” part says “don’t visit or index my photos folder”.
Googlebot has a few more instructions than other robots. As well as “User-name” and “Disallow”, Googlebot also uses the “Allow” instruction.
The “Allow:” instruction allows you to tell a bot that it’s fine to view a file in a folder that has been “Disallowed” by other instructions. Going back to the “Disallow” photos example, if we had that instruction entered but wanted one of the photos in the files indexed, we would do this using “Allow”. Let’s say the image was called giraffe.gif, the instruction would therefore be:
This instruction lets Googlebot know that it can visit giraffe.gif despite the photo folder being blocked.
You may or may not need robots.txt on your site, this depends entirely on your content and what you want available to bots. If you do have robots.txt on your site, you should always ensure that it’s up to date and it’s vital that it’s not blocking any pages you don’t want blocked or that might help Google to rank your pages.