Friday, January 26, 2007
I'm often asked about how Google and search engines work. One key question is: how does Google know what parts of a website the site owner wants to have show up in search results? Can publishers specify that some parts of the site should be private and non-searchable? The good news is that those who publish on the web have a lot of control over which pages should appear in search results.
The key is a simple file called robots.txt that has been an industry standard for many years. It lets a site owner control how search engines access their web site. With robots.txt you can control access at multiple levels -- the entire site, through individual directories, pages of a specific type, down to individual pages. Effective use of robots.txt gives you a lot of control over how your site is searched, but its not always obvious how to achieve exactly what you want. This is the first of a series of posts on how to use robots.txt to control access to your content.
What does robots.txt do?
The web is big. Really big. You just won't believe how vastly hugely mind-bogglingly big it is. I mean, you might think it's a lot of work maintaining your website, but that's just peanuts to the whole web. (with profound apologies to Douglas Adams)Search engines like Google read through all this information and create an index of it. The index allows a search engine to take a query from users and show all the pages on the web that match it.
In order to do this Google has a set of computers that continually crawl the web. They have a list of all the websites that Google knows about and read all the pages on each of those sites. Together these machines are known as the Googlebot. In general you want Googlebot to access your site so your web pages can be found by people searching on Google.
However, you may have a few pages on your site you don't want in Google's index. For example, you might have a directory that contains internal logs, or you may have news articles that require payment to access. You can exclude pages from Google's crawler by creating a text file called robots.txt and placing it in the root directory. The robots.txt file contains a list of the pages that search engines shouldn't access. Creating a robots.txt is straightforward and it allows you a sophisticated level of control over how search engines can access your web site.
In addition to the robots.txt file -- which allows you to concisely specify instructions for a large number of files on your web site -- you can use the robots META tag for fine-grain control over individual pages on your site. To implement this, simply add specific META tags to HTML pages to control how each individual page is indexed. Together, robots.txt and META tags give you the flexibility to express complex access policies relatively easily.
A simple example
Here is a simple example of a robots.txt file.
The User-Agent line specifies that the next section is a set of instructions just for the Googlebot. All the major search engines read and obey the instructions you put in robots.txt, and you can specify different rules for different search engines if you want to. The Disallow line tells Googlebot not to access files in the logs sub-directory of your site. The contents of the pages you put into the logs directory will not show up in Google search results.User-Agent: Googlebot
Preventing access to a file
If you have a news article on your site that is only accessible by registered users, you'll want it excluded from Google's results. To do this, simply add a META tag into the html file, so it starts something like:
This stops Google from indexing this file. META tags are particularly useful if you have permission to edit the individual files but not the site-wide robots.txt. They also allow you to specify complex access-control policies on a page-by-page basis.<html>
<meta name="googlebot" content="noindex">
You can find out more about robots.txt at http://www.robotstxt.org and at Google's Webmaster help center, which contains lots of helpful information, including:
- How to create a robots.txt file
- Descriptions of each user-agent that Google uses
- How to use pattern matching
- How often we recrawl your robots.txt file
There is also a useful list of the bots used by the major search engines: http://www.robotstxt.org/wc
Coming soon: a post detailing the use of robots and metatags, and another on specific examples for common cases.
Update: Added a sentence to paragraph 9 on access-control policies.