Last night and this morning I installed htdig as the new search engine for this site internally. Back in August I mentioned that I would start using google because the solution I was trying had stopped working with any reliability. ht://Dig is open source and originated here in San Diego at SDSU.
Total time for installation and customization was about 5 hours total. This is valuable information in case I ever need to install an htdig search engine for a client. Lots of small details in doing this installation. I downloaded the installation as a tar.gz file, then decompressed that to a suitable location (cgi-bin). Then I had to do configure, make, make install. Installing unix software is always an adventure. This site runs FreeBSD (see: colophon, and I was delighted that it went pretty smoothly.
Then I was ready to start running it. This got tricky, but it was straightforward as I was able to tweak the conf/htdig.conf file to do what I like. rundig is the key to indexing a site. At first I had broken images, but it was working properly. The site initially indexes the htdig site itself. Just like any web robot, it goes out and looks at that site just as a browser would. This put my mind at ease, as I was not sure how it would deal with databased content, or the fact that the pages on my site are very include() driven. I was also concerned that because it is a local search engine, it would index files I don’t want indexed. The perl search engine I had originally installed had this problem. It would find older versions of files and garbage files that had become garbage for a reason.
As I got it working, and pointed it at artlung.com, I found a problem. The indexing process was taking far too long. Seems I had an infinite loop happening! In my accessibility slideshow from 1999 I had a problem. The [next] and [previous] links did not give any thought to whether they should actually show or not. The php for that I had written when I really knew very little php, and I ended up with the search engine indexing not just /words/accessibility/?i=0 to /words/accessibility/?i=10, but it was iteratively visiting the “next” and “previous” links like crazy. ?i=-1, ?i=-2, ?=-3, and on until I stopped it at ?i=-115. That would have been 115 versions of the “previous” page that was no different than the “first” poge. The PHP I had written in 1999 was smart enough to handle bad values for $i, but not smart enough to realize that there was no “previous” pages for those pages. The “next” links had the same problem. The htdig indexer was not smart enough to know that it was indexing hundreds of nearly identical pages. The solution was to fix the slideshow code so that it would not produce spurious links like that. After that fix, it was indexed properly and quickly. This is probably another reason that many search engines simply won’t touch pages with querystrings.
The next problem I had was that it was showing bad search results for certain pages. Example: I searched for the word “Zappa” – and I got far more results than I would have expected. Granted, I am a Frank Zappa Fan, but why would the bio page come up in a result for that? Turns out the indexer found the entry inside the bottom <select> box for my Frank Zappa piece. So the search engine was indeed finding an instance of the word “Zappa,” but not a useful one. The solution is to not include the bottom navigation in the pages served to the search engine. I also did the same thing with the blog such that the archived pages don’t show the outbound links to the indexer. In this way you don’t get each and every instance in navigation when you use the search engine. I suspect there will be more tweaks like this.
Next I began playing with the look and feel to match the rest of artlung.com. I used my preexisting styles and made graphical widgets for search results. They are pretty cute, actually. Buttons should look like they want to be clicked, and these even have a bevel! You can’t go wrong with a subtle bevel. I added the ht://Dig logo in a way to my liking (a banner along the bottom). I ended up changing several pages in htdig’s common directory: footer.html , header.html, long.html , and nomatch.html. I also edited the template_map variable to point to my own long.html file.
Once I got it working to my satisfaction I reran rundig manually for the last time, and swapped out the old google include for the new htdig include on the search page. With a highly include-driven site it’s very easy to make these kinds of changes.
The last thing I did was add a daily unix system cron job to reindex the site daily. The indexing process takes about 2 minutes and the best part about it is that I shouldn’t have to think about it. And it should always be up to date (plus or minus 24 hours I suppose).
With luck, this search engine will work well. If you have comments or questions, feel free to ask!