Archive for the ‘ipb’ Category

Free Search Engines - Hosted search software for your web site

Tuesday, June 19th, 2007

I have to admit that I’m a bit of a control freak.  Okay, a BIG control freak.  I prefer to host and maintain all of my sites, software and systems myself.  So when looking for a better search for my web site, I went looking for something free/open source that I could install and run on my web server.  I found several I wanted to share with you.  But first, some background.

One thing I don’t like about Invision Power Board and most any forum or CMS I’ve every used (and I’ve used a great many), is that they all have different search functions for different components.  One search for the forums, another for the download manager, another for the gallery, etc., etc.  Worse if you’re using any third-party modules, which might share the user tables of your CMS/forum, but not much else.  And the output of each varies and while perhaps moderately useful, they lack many advanced features, speed and relevance of true search engines like Google or Yahoo that (lets face it), we’ve all come to understand and expect.  Not to mention zero control over the results returned.  So hence my desire for a separate, stand-alone system that would not rely on the internal search features of the board or it’s modules.

In addition, pages can get stale.  Have you ever searched for something current and gotten search results for a page from 5 years ago?  Yeah, I hate that.  Tip:  When searching Google, try ’search term inurl:2007′.  Helps alot!

But I digress.  As this blog is IPB and SEO focused, let me also mention the additional revenue potential of a clean, structured search and results page with strategically-placed advertisements.  With the included search functions in IPB, I wasn’t making any money.  Now I’m making an additional $5-$15 a day, just off my search page.  Every bit helps, no?  Plus, your members and visitors will appreciate relevant search results augmented with extra functionality like synonyms, related pages and suggestions for misspellings.  At the end, check out my method of redirecting 404’s to the search page to capture that lost traffic!

So there’s the background.  Since this is not a full review or comparison of each search engine listed, it’s up to you to decided which search engine to run.  The contenders (in no particular order):

  • mnoGoSearch - Been around for a while (previously under the name udm search) and easy to compile, install and configure.  One required config file and a few optional ones (pre-propigated with default values) for things like stopwords and such.  Can exclude by string or by regex.  Strings are easiest and won’t slow indexing down, but if indexing URLs with variables, you have to use regex, which is NOT all that easy to get working.  Otherwise, fast indexing, good results and easy results page customization (inclusion of ads, etc.) via search.htm template file.  Several third-party modules available to add additional functionality (such as click-tracking).  CGI/MySQL.  I’m using mnoGoSearch presently but that will likely change by the time you read this.
  • Sphider - My personal favourite.  I was using this before mnoGoSearch and loved it.  PHP/MySQL with web-based admin interface.  When I upgraded my IPB board from 2.17 to 2.2.2, Sphider reported many of my pages as having 0 (zero) size so indexed with with title ‘untitled document’ and no body.  Stopped trying to figure out why after spending many hours troubleshooting.  Gave up and changed to mnoGoSearch, which had no trouble.  Likely something in IPB or the skins I’m using it didn’t like.  Possibly header info - who knows.  Maybe it will work for your site.  Equally as easy to install, setup and configure as mnoGoSearch.  Easier actually, as separate templates for search form, results, plus header and footer.  Requires no compiling as is a PHP script.  And the easiest include/exclude mechanism of the bunch.  No need for regex.  Just enter any string in a URL you want/don’t want indexed it figures it out. Plus, you can see search stats via the admin interface.  This is a big bonus.  If you find that people aren’t finding your content easily, knowing the top search terms gives you an edge and you can use that info to restructure your site, link to frequently searched content, create related new content and so forth.  So try this first.  And if you do and are indexing an IPB site, here are my exclude rules if you want them:
    • s=
      act=rssout
      act=Login
      act=usercp
      act=Forward
      act=Print
      act=Reg
      act=Help
      act=Members
      act=Online
      act=calendar
      act=Stats
      act=Msg
      act=Post
      act=post
      act=Mail
      act=announce
      act=Search
      act=SF
      act=mod
      act=findpost
      blogs
      automodule=blog
      showuser=
      author=
      view=
      mode=
      modfilter=
      do=
      st=
      lofiversion
      prune_day
      sitemap
      CategoryID=
      req=
  • Perlfect Search - I did use Perlfect search a couple years ago and can’t remember why I stopped.  Might have had something to do with it being Perl.  Seems to me I had much trouble getting it installed due to all the Perl modules it required, and I seem to recall it being either slow at indexing or being a resource hog.  Sorry can’t be more specific, but if you like Perl and don’t have a site with a gazillion pages to index, worth a look.  Results are stored in static dbm files so no database required.  Oh, come to think of it I think what was the reason I abandoned it.  Search results took a looooong time to generate as it had to read the entire dbm files on each search (and I have alot of pages).
  • Nutch - Now part of the Apache/Lucene project.  Java.  I don’t do java so have never tried it.  But many seem to like it.  If you’re already running Tomcat or something, might make sense for you.
  • iSearch - Also never tried it.  PHP/MySQL.  Can exclude content within pages via wrapping content in specific comments, so was too restrictive for me personally.  If looking for something PHP/MySQL, I’d skip it and try Sphider instead.
  • Swish-E and Swish++ - I used Swish-E in combination with MHonArc mail-to-HTML converter and had a fantastic system for a mail list archive search.  Swish-E is the original, Swish++ is a fork/rewrite of the original.  Can’t say if it’s any better or what differences there may be between them.  I’d try Swish-E first.  Based on my positive experience with that mail list archive search (indexed 90,000+ messages and added ~500 a day), I may try Swish-E again.  I do recall that the setup and configuration for Swish-E was difficult, but once configured it was great.
  • IBM OmniFind Yahoo! Edition - I just chanced upon this today, which prompted me to write this short article.  So far I am very impressed with this piece of software, as expected.  Commercial-grade software with the speed, polish and features one would expect from power-houses like IBM and Yahoo.  Excellent web-based and command prompt administration, easy rules, the easiest install of the bunch (seriously, it couldn’t get any easier), an API, plus sample code for customized search form and results.  You can even set the interval between spidering pages so as not to overload your server!

Now the caviets.  Most of the above will require that you have root access on a dedicated server, or that you have a virtual dedicated server at minimum.  If you have shared hosting, you’re not entirely screwed.  Try Sphider, iSearch or Perlfect Search.  Sphider and iSearch are standard PHP/MySQL with web-based admin interfaces.  Perlfect Search is a Perl script so you can just drop it in your /cgi-bin/ directory (does require certain Perl modules to be installed so check the requirements and confirm that your host has the required modules installed or would be willing to intall if missing).  So long as you don’t have 100,000 pages to index and only index weekly or so, these should do you fine.  Apart from the requirement of a dedicated server for the others in order to compile, install, configure and run these, server load can get quite high.  Even if you could install one of the others on a shared host, you could eat up shared resources and risk your host disabling or cancelling your account as a result.

Now here’s the cool part.  Most folks don’t setup a custom Error 404 document in Apache.  This is supported on any dedicated host and most shared hosts allow it also.  Basically defining a custom Error 404 document in your web server admin allows requests for missing pages to be redirected to a static page, URL or CGI, rather than showing your visitor the generic web server 404 message.  By defining a custom Error 404 document that displays your search page from one of the above free search engines, you not only trap that traffic that would otherwise be lost, but you can post any banner ads you like, such as Google Adsense, and potentially increase your banner revenues.  Plus, you enable the visitor to search (obviously) for the page they were looking for or find similar pages on your site.  Your visitors will appreciate this and will likely return to your site more often.

Couple of things to know first.  Defining a custom Error 404 document in Apache using a local path will correctly return a 404 response in the httpd header of the request back to the browser.  This is important for Google and search engines that may be spidering your site.  This is the best method.  You can specify a URL to a file on a local or remote host, but this WILL NOT return a 404 response header.  You want to avoid this.  So you have to be mindful of this AND how Apache treats request to a file ending in .cgi, for example, versus one ending in .php or .html.  If, for example, you decide to go with mnoGoSearch, who’s search script is named search.cgi, Apache will pass the requested (missing) URL to mnoGoSearch’s search.cgi script, buggering up the search page output and cause mnoGoSearch to spit out it’s own ‘no results’ message.  This will just cause confusion.  This will not happen if using Sphider, for example, who’s search page is search.php.  How I avoid this problem using mnoGoSearch is by creating a static search.html page that is simply an empty frameset with the actual search engine search form in the main frame.  No variables are passed along and Apache won’t think your error document is a handler script (.cgi).

Anyways, I won’t hold your had with all this.  Read Apache’s error document directives and if what I describe above is an issue, you’ll know it and can then try either a differernt search engine or my static html file with frameset trick.  And don’t limit your new search engine’s use to just your error document handler.  Link it into your main site or even replace your current system’s search system(s) with your chosen engine!

Examples:

Missing poker gratuitonoble pokertilt poker7 card stud gratisstri pokerdownload giochi poker gratistavoli da gioco pokerpoker texano on linepoker on line in italiano7 card stud inlineapoker il giocoкомпютриdownload live pokereuropean poker tourper giocare a pokertornei di pokerpoker online bonusscommesse internetpoker online italiatorneo poker on linepoker on line gratis,poker gratis,download video poker gratisgioco video poker gratisgiochi internetholdem poker,poker texas holdem on line,holdem poker downloadgiochi di poker on linescaricare gratis pokerstrip poker flashpoker comstrategia pokeronline casino gamescasino’ onlineregole crapswww giochi casinoslots machine gratisvip casinogioco di roulettebaccarat onlinemobile casino gamesgiochi jack blackplay free baccarat onlinepc game casinoall slotsgiochi on line casinosistemi rouletteroulette europeacasino bonus no depositi casino onlineonline kenoblack jackcasino per pccasino da scaricare gratis document:  http://themes.belchfire.net/this_file_does_not_exist/foo.bar
Apache custom 404 ErrorDocument definition (httpd.conf and/or VirtualHost):  ErrorDocument 404 /search.html
search.html source:
<html><head>
<meta http-equiv=”Content-Type” content=”text/html; charset=windows-1252″>
<title>Search Belchfire</title>
</head>
<body><iframe name=”search” src=”/cgi-bin/search.cgi” mce_src=”/cgi-bin/search.cgi” marginwidth=”1″ marginheight=”1″
height=”100%” width=”100%” scrolling=”auto” border=”0″ frameborder=”0″>
Your browser does not support inline frames or is currently configured not to
display inline frames.
</iframe></p>
</body>

</html>

Real URL to mnoGoSearch script:  http://themes.belchfire.net/cgi-bin/search.cgi

Update:

I finally figured out my problem with Sphider so have switched back.  I do prefer Sphider out of the lot.  As it turns out I had upgraded PHP on my server from 4.3 to 5.1 and when I did that, my php.ini file was replaced.  In the 5.1 php.ini file, output_buffering was enabled, so Sphider wasn’t getting the content-length header before the body, so must have figured documents were zero so just skipped those altogether.  Turning output_buffering Off in php.ini and restarting Apache fixed it and Sphider was happy again.  I forgot to mention that the other reason I like Sphider is that has great web-based stats, which I really missed with the others.