In my last article you read about How to Use Google Analytics to Check My Blog Users and Visitors. That means you have already added Google Analytics script on your blog or website and you have started checking the user activities on your website. Now next comes how to instruct search engine robots and crawlers to index your website/blog page links. There may be few pages which you don’t want to show publicly. Search engine robots and crawlers keep coming on your website or blog after some interval of days and check if there are new pages or content to be indexed. They do so because they have to update their databases, so that they can show updated information when someone searches on internet. We do have robot.txt and sitemap.xml files which helps search engine robots and crawlers to navigate easily into your website or blog and get the stuff indexed.

robot.txt file is a file which instruct search engine robots and crawlers that which pages can be indexed and which are not allowed. Like if there are some paid content which you don’t want people to be accessed unless they pay for it, then you can instruct crawlers not to read and index these paid pages and don’t disclose any content of these pages publicly. Similarly, there may be login area which you don’t want to be indexed and shown in search engines, you can restrict this with the help of robot.txt file. So, we can say that robot.txt file is set of instructions which tells search engine robots and crawlers which pages can be indexed and which are not. Even you can disallow media files to be indexed. Robot.txt file need to be placed on the root of your website.

robot.txt looks like this:

# /robots.txt file for 
# mail for constructive criticism

User-agent: webcrawler 

User-agent: lycra 
Disallow: / 
User-agent: * 
Disallow: /tmp 
Disallow: /logs 

First two lines in above robot.txt content are commented i.e. lines started with #. Next line is instruction for crawler named “Webcrawler” that nothing is disallowed for you and you can index all the things on website. Next instruction is for crawler named “Lycra” that you cannot do indexing on our website. Next instruction is all rest of crawlers, i.e. User-agent: *, disallow indexing for all files which are under “/tmp” and “/logs”, although you can index rest of items on website.

Sitemap.xml ix an XML file which is list of all possible urls your website or blog is making. So, by reading the these links search engine crawlers and robots can navigate to these urls to read content on these. Beside this we can add additional information about the page in sitemap.xml like how often the content of any url is changed, when last this page is updated or what is the priority for crawling any page. You can easily create sitemap.xml file for your website online these days and get it placed on the root of your website. If you are using WordPress for your website or blog then there are many plugins available which will automatically keep updating the sitemap.xml files for you.

Both robot.txt and sitemap.xml are very important part of any website or blog, because both helps search engine robots and crawlers to do their jobs easily. We must add these file to our website or blog for better search engine indexing. 🙂 🙂 🙂