Archive for the ‘robots’ Category

Invision Robots.txt - Do you need one - What do you put in one?

Saturday, April 7th, 2007

Invision Power Board does not ship with a propagated robots.txt file, or a robots file at all. Not sure why that is. I suppose they leave that up to you. For the uninformed, a robots.txt file is a plain-text file you drop into the root of your webroot, defining what pages you don’t want search engines (or spiders) to index. Good search engines and spiders look for and adhear to rules found in your robots.txt file. Bad bots don’t care. So be aware of this.

So do you need a robots.txt file?

Well, there are thousands of IPB boards out there with no robots file and it doesn’t seem to hurt them. Indeed, I didn’t have a robots file for my board on Belchfire.net for years. Then while looking for ways to increase my AdSense revenues (and/or CTR and eCPM), I remembered reading something about Google and other search engines penalizing one for duplicate content, or dropping your pages into their Supplemental Index, which I found I was in after learning what it was (more on that later).

Not so much duplicate content on your own site (within the same domain), but content mirrored on another site or under a different domain (www.domain.tld vs domain.tld - which many of us do for out own domain). I also found when I checked my web stats that some of the pages people were following from Google had really odd URLs. Like a topic’s Linear and Outline view (or Threaded), Print this topic, and most commonly, the Lo-fi version. I hate that when people search Google and click through to a link to my site that they get the Lo-fi version, devoid of my ads, styling/branding and pretty much everything else! Maybe the Lo-fi version was a good idea at the time or for handhelds or something, but I don’t see any benefit to it any longer, and what’s the need of Google/Yahoo/MSN is indexing your regular versions already? I’m sure you’ve experienced it yourself; you search Google, click on a SERP and you get the dang lo-fi version. When that happens to me, more often than not, I’ll click the Full-Version link so I can see links, images, profiles and everything else.

So for these reasons I decided to take steps to try to limit this type of thing and mitigate any risks should Google decide I had duplicate content and washes my pages (or site) out of their database altogether, and maybe try to get out of the Supplemental Index I found I was in. How? First, by creating a custom robots.txt file for IPB. I started with the Lo-fi link, added the optional topic view URLs and went from there. Here’s what I ended up with:

User-agent: *
Disallow: /lofiversion
Disallow: /index.php?act=Login
Disallow: /index.php?act=Search
Disallow: /index.php?act=Shoutbox
Disallow: /index.php?act=Reg
Disallow: /index.php?act=Msg
Disallow: /index.php?act=Mail
Disallow: /index.php?act=Forward
Disallow: /index.php?act=Track
Disallow: /index.php?act=Post
Disallow: /index.php?act=Print
Disallow: /index.php?act=ST
Disallow: /index.php?act=boardrules
Disallow: /index.php?act=Help
Disallow: /index.php?act=Stats
Disallow: /index.php?act=Members
Disallow: /index.php?act=Online
Disallow: /index.php?act=calendar
Disallow: /index.php?act=SR
Disallow: /index.php?act=ICQ
Disallow: /index.php?act=MSN
Disallow: /index.php?act=AOL
Disallow: /index.php?act=AIM
Disallow: /index.php?act=SC
Disallow: /index.php?act=task
Disallow: /index.php?act=findpost
Disallow: /index.php?act=UserCP
Disallow: /index.php?act=report
Disallow: /index.php?act=buddy
Disallow: /index.php?act=legends
Disallow: /index.php?CODE=
Disallow: /index.php?act=attach
Disallow: /index.php?&&CODE=
Disallow: /index.php?&debug=1
Disallow: /index.php?act=Profile
Disallow: /index.php?showuser
Disallow: /index.php?s=
Disallow: /*&view=getnewpost$
Disallow: /*&view=getlastpost$
Disallow: /*&mode=linear$
Disallow: /*&mode=threaded$
Disallow: /*&mode=linearplus$
Disallow: /*&p=
Disallow: /*&pid=

I had a few other URLs but due to limitations of rule formatting in a robots file (like, standard regex), and not knowing what each particular search spider understands, I trimmed it to the above, confirming compliance with robots formating rules via New Robots.txt Syntax Checker and the Web Robots Pages. Copy the above into a file named robots.txt and drop it into your root IPB folder. Must be named robots.txt of course, and ensure is world readable. Good stuff.

Next, since I’m also generating daily Google Sitemap files via the Python sitemap generation script (using my weblogs via the accesslog method), matching filter actions needed to be added to my sitemap gen config.xml file. Here are the rules if you use Google Sitemaps via the Python generator:

<filter action="drop" type="regexp" pattern="lofiversion" />
<filter action="drop" type="regexp" pattern="index.php?act=Login" />
<filter action="drop" type="regexp" pattern="index.php?act=Search" />
<filter action="drop" type="regexp" pattern="index.php?act=Shoutbox" />
<filter action="drop" type="regexp" pattern="index.php?act=Reg" />
<filter action="drop" type="regexp" pattern="index.php?act=Msg" />
<filter action="drop" type="regexp" pattern="index.php?act=Mail" />
<filter action="drop" type="regexp" pattern="index.php?act=Forward" />
<filter action="drop" type="regexp" pattern="index.php?act=Track" />
<filter action="drop" type="regexp" pattern="index.php?act=Post" />
<filter action="drop" type="regexp" pattern="index.php?act=Print" />
<filter action="drop" type="regexp" pattern="index.php?act=ST" />
<filter action="drop" type="regexp" pattern="index.php?act=boardrules" />
<filter action="drop" type="regexp" pattern="index.php?act=Help" />
<filter action="drop" type="regexp" pattern="index.php?act=Stats" />
<filter action="drop" type="regexp" pattern="index.php?act=Members" />
<filter action="drop" type="regexp" pattern="index.php?act=Online" />
<filter action="drop" type="regexp" pattern="index.php?act=calendar" />
<filter action="drop" type="regexp" pattern="index.php?act=SR" />
<filter action="drop" type="regexp" pattern="index.php?act=ICQ" />
<filter action="drop" type="regexp" pattern="index.php?act=MSN" />
<filter action="drop" type="regexp" pattern="index.php?act=AOL" />
<filter action="drop" type="regexp" pattern="index.php?act=AIM" />
<filter action="drop" type="regexp" pattern="index.php?act=SC" />
<filter action="drop" type="regexp" pattern="index.php?act=task" />
<filter action="drop" type="regexp" pattern="index.php?act=findpost" />
<filter action="drop" type="regexp" pattern="index.php?act=UserCP" />
<filter action="drop" type="regexp" pattern="index.php?act=report" />
<filter action="drop" type="regexp" pattern="index.php?act=buddy" />
<filter action="drop" type="regexp" pattern="index.php?act=legends" />
<filter action="drop" type="regexp" pattern="index.php?CODE=" />
<filter action="drop" type="regexp" pattern="index.php?act=attach" />
<filter action="drop" type="regexp" pattern="debug=1" />
<filter action="drop" type="regexp" pattern="index.php?act=Profile" />
<filter action="drop" type="regexp" pattern="index.php?showuser" />
<filter action="drop" type="regexp" pattern="index.php?s=" />
<filter action="drop" type="regexp" pattern="view=getnewpost" />
<filter action="drop" type="regexp" pattern="view=getlastpost" />
<filter action="drop" type="regexp" pattern="mode=linear" />
<filter action="drop" type="regexp" pattern="mode=threaded" />
<filter action="drop" type="regexp" pattern="mode=linearplus" />
<filter action="drop" type="regexp" pattern="p=" />
<filter action="drop" type="regexp" pattern="pid=" />

Took some testing and trial-and-error before the script would stop erroring out. But works great. Feel free to use the above in your sitemap config file.

Finally, I setup a mod_rewrite rule in my Apache config to redirect all entry points to my domain (belchfire.net, www.belchfire.net and in my specific case, themes.belchfire.net) to themes.belchfire.net. If interested, and your host supports mod_rewrite, the rule looks like this:

RewriteCond %{HTTP_HOST} !^(www.)?domain.tld
RewriteRule (.*)$ http://www.domain.tld/$1 [R=301,L]

I expect you don’t have a ‘themes’ subdomain or alias to your site, so just go with www, and substitute ‘domain’ with your domain name and ‘tld’ with your domain’s extension (.com, .net, etc.). Drop into a .htaccess in your IPB root folder or append to existing. Test and be happy.

So what happened?

Well, right away I noticed I was only submitting ~15000 URLs to Google instead of my previous ~45000. Second, the amount of data transfer Google was chewing up indexing my pages dropped from 4.5GB monthly to 2.5GB. Save me some bandwidth charges ;) Third, I got out of Google’s Supplemental Index!

Okay about Google’s Supplemental Index. If Google determines you have duplicate content, orphaned pages, or dynamic URLs with too many query strings, it will index a very low number of pages in it’s regular search results, and drop everything else into it’s Supplemental Index. More on Google Supplemental Index. Bottom line is, you don’t want to be in the Supplemental Index.

One day when checking my indexed pages in Google, it showed only 10 results, and then a note saying:

In order to show you the most relevant results, we have omitted some entries very similar to the 10 already displayed.
If you like, you can
repeat the search with the omitted results included.”

Wasn’t a GoogleDance or some change to the algorithms. Most of my pages were just in this secondary index. No wonder my traffic was stagnating! After doing the above I’m no longer in the Supplemental index and now show a good 110,000 pages in the main index.

Fourth, Awstats showed a drop in unique visitors from around ~25000 a day to ~17000, yet, my page views remained constant. Google Analytics reported no drop in visits or page views. This seems to indicate alot of inconsequential bots and spiders stopped hitting the disallowed URLs. And that Google Analytics seems to be much more accurate than Awstats at detecting human vs machine visitors.

Finally, and most bonus, my AdSense revenues doubled! I kid you not. I went from a paltry $25-$30 a day on average to well over $60 a day. Again, not a spike or just a n00b AdWords client losing his shirt on his first campaign. It’s been months now at the same level. And since then I’ve been able to quadruple my previous AdSense revenues by tweaking my ad types, frequency and placement, upwards of $150 a day spiking to almost $200 on weekends. I’ll share those tweaks with you all in another article.

So question remains, do you need a robots.txt file? Hell no, especially if you don’t generate any income from your board. But, if you do rely on the revenues generated from ads on your board or rely on search engines to drive traffic to your site, a robots.txt file such as mine, crafted to reduce duplicate content being indexed, can potentially increase you indexed pages.  Or at least, ensure you don’t get dropped into search engines’ supplementary index.  And ultimately, potentially increase you ad revenues. Obviously a custom robots.txt file was not the only step I took. And more quantifiable results could have been obtained had I setup a new domain and new board with some posts, then tested each of these in stages over the course of a few months and on each of the major search engines. But I have no other explanation and certainly no complaints about the additional eyes on my site, and the extra money ;)

Oh, if you are not using Google Sitemaps, I highly recommend you look into it. There are modules on Invisionize, or you can use the standalone Python script like I do. Just requires access to your raw weblogs and a host that supports Python and user-CRONTAB entries (to run automatically). Yahoo now recognizes those sitemaps.  Can also help kick start a brand new board or site.