The robots.txt file is a method by which you can regulate and
restrict the activity of certain web crawlers or bots that visit your
website. There are specific rules you must follow when building and
installing a robots.txt file on your web server which relate
to the format and order of instructions within the file and the directory
level at which the file must be placed on the web server. If you do
not follow these rules, you may do more harm than good to your website by
excluding good bots capable of driving visitors to your web pages.
In most cases, if you are not certain about how to construct and install
robots.txt, it is better to omit the file until you can do the
job properly. While not having robots.txt present on your server
results in failed requests for the file and introduces a slight amount of
server overhead due to those failed requests, this does not usually result
in any adverse performance impact to your website. The bots and web
crawlers employed by the major search engines are intelligent enough to
function without robots.txt, and will not abandon your
website if it is not there.
Well-behaved web crawlers and bots will follow the instructions in your
robots.txt file. However, do not rely upon robots.txt
to screen out bad or malicious bots such as email harvesters or web
strippers. Such web denizens are engineered to ignore robots.txt
and gain access to your website by any means at their disposal (depending
upon the sophistication and tenacity of their programming). You must deny
these bots access to your website through use of techniques such as
directory privilege restrictions, password protection, use of
.htaccess files, modification of the main server configuration
file (if you have root access on the server or are a friend of the system
administrator), programming server side includes (SSI) or by relying upon
the good graces and expertise of your web hosting service.
If good bots are capable of performing without robots.txt and bad bots
intentionally ignore it, why bother with it at all? The answer is that
by carefully employing robots.txt, you can regulate access of
the directories viewed by good bots, restrict access to your website
completely if desired, and throttle the rate at which these bots spider
your website to avoid possible bandwidth bottlenecks. Furthermore,
while lack of a robots.txt file will not prevent good bots from
visiting your website, a friendly bot is much more likely to revisit
frequently if it does not encounter problems during its visits.
Authored by Kenneth L. Anderson.
Original article published 3 January 2006.
Follow links to the right to learn more about robots.txt, including definition and
usage guidelines for robots.txt, a tutorial for building
your own robots.txt file, a free robots.txt file generator and a
robots.txt validator to insure your file follows proper syntax once it is
installed on your web server.
At the left margin, Related Links address topics of interest
pertaining to internet business and ecommerce. View the
Internet Business & eCommerce SiteMap
for a complete list of internet business, web business, website promotion and ecommerce topics.
|
|
Don’t be silent! Help us out. If you like our site, let
others know about us. Tell your friends. Post to
blogs and forums. Webmasters — link to us. And ...
if you think we can improve, let us know how. Contact
us with your suggestions. We’re always eager
to hear from our visitors.
|