Search Engine Upgrade to ht://Dig
Last night and this morning I installed htdig as the new search engine for this site internally. Back in August I mentioned that I would start using google because the solution I was trying had stopped working with any reliability. ht://Dig is open source and originated here in San Diego at SDSU.
Total time for installation and customization was about 5 hours total. This is valuable information in case I ever need to install an htdig search engine for a client. Lots of small details in doing this installation. I downloaded the installation as a tar.gz file, then decompressed that to a suitable location (cgi-bin). Then I had to do configure, make, make install. Installing unix software is always an adventure. This site runs FreeBSD (see: colophon, and I was delighted that it went pretty smoothly.
Then I was ready to start running it. This got tricky, but it was straightforward as I was able to tweak the conf/htdig.conf file to do what I like. rundig is the key to indexing a site. At first I had broken images, but it was working properly. The site initially indexes the htdig site itself. Just like any web robot, it goes out and looks at that site just as a browser would. This put my mind at ease, as I was not sure how it would deal with databased content, or the fact that the pages on my site are very include() driven. I was also concerned that because it is a local search engine, it would index files I don’t want indexed. The perl search engine I had originally installed had this problem. It would find older versions of files and garbage files that had become garbage for a reason.
As I got it working, and pointed it at artlung.com, I found a problem. The indexing process was taking far too long. Seems I had an infinite loop happening! In my accessibility slideshow from 1999 I had a problem. The [next] and [previous] links did not give any thought to whether they should actually show or not. The php for that I had written when I really knew very little php, and I ended up with the search engine indexing not just /words/accessibility/?i=0 to /words/accessibility/?i=10, but it was iteratively visiting the “next” and “previous” links like crazy. ?i=-1, ?i=-2, ?=-3, and on until I stopped it at ?i=-115. That would have been 115 versions of the “previous” page that was no different than the “first” poge. The PHP I had written in 1999 was smart enough to handle bad values for $i, but not smart enough to realize that there was no “previous” pages for those pages. The “next” links had the same problem. The htdig indexer was not smart enough to know that it was indexing hundreds of nearly identical pages. The solution was to fix the slideshow code so that it would not produce spurious links like that. After that fix, it was indexed properly and quickly. This is probably another reason that many search engines simply won’t touch pages with querystrings.
The next problem I had was that it was showing bad search results for certain pages. Example: I searched for the word “Zappa” – and I got far more results than I would have expected. Granted, I am a Frank Zappa Fan, but why would the bio page come up in a result for that? Turns out the indexer found the entry inside the bottom
You are currently browsing articles tagged perl search engine.
Tags: bad search results, htdig search engine, htdig site, html, include-driven site, local search engine, Perl, perl search engine, PHP, san-diego, search page, search engine, search engines, search results, unix, unix software, unix system, web robot

Recent Comments