Firetop

Matt Peskett ‘at work and at play’

September 4, 2008

Robots.txt Doesn’t Support Regular Expression

by @ 12:13 pm. Blogged under Web Technology, Search Optimisation

Robots.txt Doesn’t Support Regular ExpressionHaving recently discovered a client using supposed ‘regular expression’ (wildcard matching) in their robots.txt file to meet my SEO recommendations on duplicate file exclusion, I simply had to put something down in writing about this. It seems Google is advising webmasters via Google tools to use wildcard matching with asterisks to exclude files by URL variable e.g Disallow: *ref=68*. Whilst a very nice method for excluding files easily from Google (and MSN it seems) this is still not an accepted method under official robots.txt protocol and will still leave a webmaster with problems on other search engines.

I know Google doesn’t like to admit that other search engines still exist, but they do, so this solution can only apply to a specific user agent reference in robots.txt, the remaining spiders / crawlers will still have to have the usual standard ‘exclusion by folder’ method anyway - which means you need to find a more suitable solution for storing files which shouldn’t be indexed regardless.

Presumably a robots.txt file excluding via both methods to cover all search engines should list the specific user agents first before the catchall so as to avoid overwriting specific commands with global instructions:

User-Agent: Googlebot
Disallow: *ref=68*

User-Agent: *
Disallow: /references/

In my opinion I wouldn’t bother with the first entry if I needed to resolve the entry for standard robots.txt exclusion protocols anyway:

Note also that globbing and regular expression are not supported in either the User-agent or Disallow lines. The ‘*’ in the User-agent field is a special value meaning “any robot”. Specifically, you cannot have lines like “User-agent: *bot*”, “Disallow: /tmp/*” or “Disallow: *.gif”.

Related Links
Google Support on Pattern Matching
Robots.txt

Technorati Tags: , , , , , , , ,





Leave a Reply

(Powered by WordPress) Copyright © Matt Peskett 2007.
Registered Firetop Ltd Office - 27 Old Gloucester Street, London, WC1N 3XX. Company No: 4854392 - VAT: 821 4717 45.

Matt @ Work >> Home

Business Blogging

Matt Peskett

Firetop Ltd

Tel: +44(0)845 226 3232
Fax: +44(0)871 247 0971
Email:

Blog Admin

Last 10 Visitors

Google Adwords Qualified

Add to Technorati Favorites



Blogarama - The Blog Directory

Matt @ Play >> Home

Pleasure Blogging

Reader Poll


Photography

Matt's Photo Albums
Matt's Photo Tag Cloud

www.flickr.com
This is a Flickr badge showing photos in a set called 2007 Gallery. Make your own badge here.

NEW! Firetop Community

Annual Blog Archives

September 2008
M T W T F S S
« Jul    
1234567
891011121314
15161718192021
22232425262728
2930  

Chat online with Matt

42 queries. 0.424 seconds.