Proposal for Host Canonization via Robots.txt
This is a proposal to indicate a preferred host name (e.g. domain with or
without "www") for search engine robots by adding a "Canonize-host" entry to
the robots.txt file.
For example:
User-agent: *
Canonize-host: www.example.com
or
User-agent: *
Canonize-host: example.com
or
User-agent: *
Canonize-host: 10.20.30.40
Rationale
It has always been a common practice to make a web site accessible both
with and without a "www" host name. This remains the way sites are almost
always configured by default by an ISP under managed hosting plans. While
potentially interesting from a usability standpoint (both www.example.com
and its shorter form, example.com, will work when typed in a browser's
address field), this results in several problems as soon as the different
URLs pointing to the same host are published on the web in spite of the
site's maintainer preference for one specific form.
Known issues include:
- Discrepancy between search engine result URLs and URL preference of
site maintainers
- Inconsistencies within search engine results (some pages on a site
listed with "www", others without)
- Same site with and without "www" is listed as having a different
"page rank"
- Failed matches between search engine results and categorization
schemes
- Difficulty, for a search engine, to accurately determine whether
different URLs pointing to the same IP address (e.g. HTTP 1.1 virtual
hosts) are actually meant to point to the same web site, or not (after
all, the content itself can change between crawls)
- No clarity about number of
actual sites indexed by search engines (are different uncanonized URLs pointing to the same
web site counted
multiple times?)
Solutions to this are limited in part because:
- Inconsistent incoming links (e.g. with and without "www") are not
under the control of a site's maintainer
- While HTTP redirects could be used to express a preference (e.g. by
permanently redirecting accesses to example.com to www.example.com, or
vice versa), not all managed hosting providers give the customer access
to such configuration options
Resorting to robots.txt to solve this problem comes natural for several
reasons:
- Robots.txt provides a method "for encoding instructions to
visiting robots"
- Robots.txt is popular among robots
- Robots.txt is always accessible by a site's maintainer
- Robots.txt is already site-centered (one robots.txt per site)
- Martijn Koster's "A
Method for Web Robots Control" RFC allows for extensions to
the robots.txt format ("extension = token : *space value [comment] CRLF")
While this discussion centers on the presence or lack of the "www" host
name, which is a very practical and frequent issue, the aim is to propose a
flexible solution that can be applied to other situations as well.
Conclusion
In consideration of the above, the proposal is made to define an
extension token named "Canonize-host", allowing the maintainer of a web site
to indicate a preferred host name value to be used by robots to access and
index the site.
More specifically:
- Robots should interpret and follow this preference in the same way
as they would process a permanent HTTP redirect (status 301)
- Search engines and web categorization systems ("directories") should
consider the preference as a request to update their host name records,
if required
Feedback
Any feedback is most certainly appreciated.
Michael C. Battilana |