Insights of robots.txt file – Optimizing robots.txt on WordPress

What is Robots.txt file?

Robots.txt or The Robots Exclusion Protocol (REP) is a small text file that bloggers or webmasters creates to instruct the search engine bots what to crawl in a website and what to index in a website. You can even allow only some sort of bots only for crawling and indexing. We can restrict Semrush robots, Ahrefs robots etc from accessing your website.

Note: But the important point to note here is having robots.txt file with Googlebots disallowed doesn’t mean that they can’t crawl our website. But it is just an indication for them to not to do. However it is not a firewall or some kind of a password protected. It is just like putting a note at the entrance that ‘Please don’t enter’. Mostly search engines obeys this. But you can’t restrict other back door bots from accessing your website with this robots.txt file. If we have some personal data in our site, we can make use of this file to let engines know not to crawl and index that particular information/ webpage.

Where to keep Robots.txt file in our Website:

Location of the robots.txt file is also extremely important. Have to keep this file in the main directory. You should be able to get this file when you type like http://techtipszone.com/robots.txt . If you place robots.txt file in any other directory, search engines will not be able to find it. They only search for it in the main directory. They won’t search in whole website directories and folders for that file. If they unable to find it in main directory location, then they simply assume that websites doesn’t have robots.txt file. So they start crawling and indexing the entire website.

Robots.txt Disallow commands :
Block all Crawlers from all the website content

User-agent: *

Disallow: /

Wildcard character * indicates allowing all.

To stop search engines to crawl temp folder.

User-agent: *

Disallow: /temp/

Stop google search engine to crawl a particular folder

User-agent: Googlebot

Disallow: /temp/

We can use wildcard characters also to match the folder names and search engine robots names.

Similar to disallowing, we can also instruct not to crawl particular directories also.

User-agent: Googlebot

Disallow: /Directory-name/

Noindex: /Directory-Name/

Block a specific webpage from crawling:

User-agent: Googlebot

Disallow: /category/page-url.html

How to generate and check Robots.txt file:

Checking and validating Robots.txt file is very simple one. First check whether robots.txt file is at correct location (Home directory) or not by searching http://domain-name.com/robots.txt  . If it is not present create a one and place it in your home directory using FTP server application like FileZilla etc.

For validation of robots.txt file, we can make use of Google webmaster tools.

For that go to Google webmaster tools. Under the Crawl option you can see robots.txt Tester option as in the above image. Click on that to validate that file. If everything is ok, it’ll notify 0 Errors and 0 Warnings. You can even test if any of your website is not indexed by google or not by checking in the below option (see the image). Enter your URL there and test whether it is allowed by Googlebot or not.

This is all about insights of robots.txt file. If you have any queries or if you want to add more this post, Please do suggest and comment. Cheers!

Leave a Reply

Your email address will not be published. Required fields are marked *