The robots.txt file is a method by which you can regulate and restrict the activity of certain web crawlers or bots that visit your website. There are specific rules you must follow when building and installing a robots.txt file on your web server which relate to the format and order of instructions within the file and the directory level at which the file must be placed on the web server. If you do not follow these rules, you may do more harm than good to your website by excluding good bots capable of driving visitors to your web pages.

In most cases, if you are not certain about how to construct and install robots.txt, it is better to omit the file until you can do the job properly. While not having robots.txt present on your server results in failed requests for the file and introduces a slight amount of server overhead due to those failed requests, this does not usually result in any adverse performance impact to your website. The bots and web crawlers employed by the major search engines are intelligent enough to function without robots.txt, and will not abandon your website if it is not there.

Well-behaved web crawlers and bots will follow the instructions in your robots.txt file. However, do not rely upon robots.txt to screen out bad or malicious bots such as email harvesters or web strippers. Such web denizens are engineered to ignore robots.txt and gain access to your website by any means at their disposal (depending upon the sophistication and tenacity of their programming). You must deny these bots access to your website through use of techniques such as directory privilege restrictions, password protection, use of .htaccess files, modification of the main server configuration file (if you have root access on the server or are a friend of the system administrator), programming server side includes (SSI) or by relying upon the good graces and expertise of your web hosting service.

If good bots are capable of performing without robots.txt and bad bots intentionally ignore it, why bother with it at all? The answer is that by carefully employing robots.txt, you can regulate access of the directories viewed by good bots, restrict access to your website completely if desired, and throttle the rate at which these bots spider your website to avoid possible bandwidth bottlenecks. Furthermore, while lack of a robots.txt file will not prevent good bots from visiting your website, a friendly bot is much more likely to revisit frequently if it does not encounter problems during its visits.

Authored by Kenneth L. Anderson.  Original article published 3 January 2006.

