11 Eylül 2007 Salı

Robots.txt File

Robots.txt file is used to stop accessing specific part of a web site from web spiders and web robots. Robots exclusion standard or robots.txt protocol is a very popular method to hide a part of web site from search engines. You can specify a particular folder or a file from a particular search engine robot.

To enable this protocol you have to create a file names robots.txt and upload it to the top level folder. This is a plain text file so create it with windows notepad or with any text editor. First you have to specify the user agent and then the directories and files you want to hide. For example this will hide all files and folders from all robots

User-agent: *
Disallow: /

To prevent indexing cgi-bin and images directories use this

User-agent: *
Disallow: /cgi-bin/
Disallow: /images/

If you want to stop specific robots then use the robots name instead of wildcard. To stop Google's robot (googlebot) from accession myfolder use this

User-agent: Googlebot
Disallow: /myfolder/

You can also specify file names instead of folders.

User-agent:*
Disallow: /search.php
Disallow: /download.html

You have to be very careful if you have directives for all robots and particular for a special robot. For eg. Googlebot ignores general directives if there is a special section for Googlebot.

To prevent Yahoo search robot to access a section use

User-agent: Slurp
Disallow: /search-engines/

For msn search it is msnbot.

Normally a web robot visit the site collect the information and store it in it's internal database normally call it search index. So to avoid indexing unwanted or secret files you have to use this robots.txt file.

Google removal tool also uses robots.txt to remove unwanted urls. You can request remove urls from google's index with robots.txt file with this tool.

There are many other tags also exists in robots.txt protocol like crawl-delay, visit-time, request-rate etc. But these are not accepted by all robots. For eg. Googlebot doesnt obey crawl delay directive.

It is also possible to stop indexing a page with meta tag. Put this tag in each page you want to hide from search engines.

Hiç yorum yok: