Robots.txt Error
Been a while since I’ve had time to post anything. I do have some interesting observations regarding my move from www.belchfire.net to themes.belchfire.net as it relates to search engine listings. But that’s not what I want to talk about just now.
Everything was going swimmingly until last month. Then for some inexplicable reason, all search engines but Google and the new beta MSN search dropped all my pages but one. At first I thought it was due to removing the mod_rewrite redirect rules I had originally put in place when I changed URLs (since that time I’ve removed them and now run both sites in tandem). Turns out I had an error in my robots.txt file on both sites that basically told all search times to not index anything. Badness…
Fixed the error and re-submitted to the major search engines and looks like they’re indexing pages again now. Whew. I was getting worried. Only being indexed by one search engine, even if Google, leaves one at risk of losing considerable traffic if anything changes with their systems or algorythms. So take note, check your logs (grep by User-Agent) and ensure your robots.txt file, if you have/use one, is formatted correctly. Read the Web Robots Pages for proper rule formatting. There is also a Robots.txt Validator at http://www.sxw.org.uk/computing/robots/check.html I recommend for error checking.
On a side not, you don’t need a robots.txt file. Generally you would want one for exclusions, defining pages or directories in your site you do not wish search engine spiders to index. And an interesting observation is that Google and the new beta MSN search would seem to ignore the robots.txt file, or at least may ignore it if if finds formatting errors. A robots.txt file is a plain-text file you put in the root of your web folder (world-readable - usually chmod 644) and contains rules such as:
User-agent: * Disallow: admin.php Disallow: /admin/ Disallow: /private/ User-agent: [Ww]eb[Bb]andit Disallow: / User-agent: Alexibot Disallow: /
User-agent: * means ‘all user-agents’. Disallow indicates that the User-agent specified above is not to index, in this example, any file named admin.php and the entire directories /admin/ and /private/. The other two User-agent rules define specific User-agents and that they should not index any files (Disallow: /). Repeat as desired. The format of your robots.txt file should adhear to at least the 1994 Standard for Robots Exclusion. Note that the field name (User-agent and Disallow, are case-sensitive, but the value field is not - So it is valid to enter Googlebot or googlebot, but the case-insensitive googlebot is recommended - I use case-sensitive value fields by choice - old Un*x habit hard to break). Oh, note that no User-agent is required to read and honour anything in a robots.txt file. In fact many don’t. And since a User-agent can easily be spoofed, even disallowing specific User-agents may be ineffective. But its generally a good thing to have. Here’s my robots.txt file for themes.belchfire.net and the ‘bad’ User-agents I block:
http://themes.belchfire.net/oldrobots/robots.txt
Per my most recent post, do not use this robots.txt file on your site!
A User-agent is what any browser, spider or script identifies itself as and that gets passed along with the http request to the web server, which then records that in the web logs. There are ‘good’ User-agents, such as Mozilla, Gecko, Opera and search engines such as Yahoo Slurp and Googlebot, and there are bad User-agents such as email harvesters and site rippers. It is those known bad User-agents I list in my robots.txt file and instruct (perhaps in futility) to not index my sites. For User-agents that spoof, you have to take other measures, assuming you can even identify them, such as by blocking the referring URL or IP.
Oh, that brings me to the potential mis-uses of robots.txt in that, if you enter files or folders you don’t want indexed and a ‘bad’ User-agent reads your robots.txt file, it could simply ignore your Disallow fields and go ahead and index the very files and folders you explicitely wish to disallow. So for anything you don’t want indexed, it’s probably a good idea to password protect those areas of your web site, just in case!