|
Having recently discovered a client using supposed ‘regular expression’ (wildcard matching) in their robots.txt file to meet my SEO recommendations on duplicate file exclusion, I simply had to put something down in writing about this. It seems Google is advising webmasters via Google tools to use wildcard matching with asterisks to exclude files by URL variable e.g Disallow: *ref=68*. Whilst a very nice method for excluding files easily from Google (and MSN it seems) this is still not an accepted method under official robots.txt protocol and will still leave a webmaster with problems on other search engines.
I know Google doesn’t like to admit that other search engines still exist, but they do, so this solution can only apply to a specific user agent reference in robots.txt, the remaining spiders / crawlers will still have to have the usual standard ‘exclusion by folder’ method anyway - which means you need to find a more suitable solution for storing files which shouldn’t be indexed regardless.
Presumably a robots.txt file excluding via both methods to cover all search engines should list the specific user agents first before the catchall so as to avoid overwriting specific commands with global instructions:
User-Agent: Googlebot
Disallow: *ref=68*
User-Agent: *
Disallow: /references/
In my opinion I wouldn’t bother with the first entry if I needed to resolve the entry for standard robots.txt exclusion protocols anyway:
Note also that globbing and regular expression are not supported in either the User-agent or Disallow lines. The ‘*’ in the User-agent field is a special value meaning “any robot”. Specifically, you cannot have lines like “User-agent: *bot*”, “Disallow: /tmp/*” or “Disallow: *.gif”.
Related Links
Google Support on Pattern Matching
Robots.txt
Technorati Tags: robots.txt, exclusion protocol, regular expression, pattern matching, robots command, spiders, indexing, seo, googlebot
(Powered by WordPress) Copyright © Matt Peskett 2007.
Registered Firetop Ltd Office - 27 Old Gloucester Street, London, WC1N 3XX. Company No: 4854392 - VAT: 821 4717 45.
Matt @ Work >> Home
Matt @ Play >> Home
Matt's Photo Albums
Matt's Photo Tag Cloud
42 queries. 0.424 seconds.