How it works

ht://Dig Copyright © 1995-2002 The ht://Dig Group
Please see the file COPYING for license information.


The system performs three major tasks that should be performed in the following order:

1.   Digging

Before you can search, a database of all the documents that need to be searched has to be created.

2.   Merging

Merging consists of two processes:
  1. Converting the databases of all documents to specialized databases for simple, fast searching.
  2. Merging changed information into previously existing databases.
Even though this task could be performed at the same time as the Digging, it is a separate process for efficiency reasons. This also allows for more control over the processes implemented when merging.

3.   Searching

Finally, the databases that were created in the previous steps can be used for actual searches. Normally, searches will be invoked by a CGI (Common Gateway Interface; a program running on the webserver) which gets input from the user through an HTML form.

Digging

Digging is the first step in creating a search database. This system uses the word digging while other systems call it harvesting or gathering. In the ht://Dig system, the program htdig performs the information gathering stage. In this process, the program will act as a regular web user, except that it will follow all hyperlinks that it comes across. (Actually, it will not follow all of them, just those that are within the domain it needs to gather information on...)
Each document it visits is examined and all the unique words in this document are extracted and stored, excepting those specified as too short, too long, or to be excluded by the configuration.

The digging process will create at least two files. The first one is the list of all the words and the second one is a database of URLs and information about the URLs. Other files may be created for a list of all URLs seen, all images seen, ASCII versions of the databases, etc.


Merging

Once the digging process is complete, the data must be converted into something the search engine can actually use. The htmerge program does this.

The term "merge" is used because data from several databases is gathered together and merged into several other databases. The source databases include the databases created by the latest "dig" but also any previous merged databases. The latest dig will produce a database that provides information on new pages and information on changes to previously existing pages; the information on the new pages, and the new information on changes to old pages is merged with the unchanged information to create up-to-date databases.

There are other, optional, tasks which are categorized under the merge phase:

Expiration notification:
The ht://Dig system includes a handy reminder service, "htnotify." This allows HTML authors to add some ht://Dig specific meta information in HTML documents. This meta information is used to email authors after a specified date. Very useful to maintain lists that contain those annoying 'new' graphics with new items. (Hint: things really aren't all that 'new' anymore after 6 months!)
Fuzzy word index creation:
Allows searches using "fuzzy" algorithms to match words. The htfuzzy program can create indexes for several different algorithms.

Searching

Searching is where all the information gathered and organized during the dig and merge stages gets put to use. The htsearch program performs the actual searches. The CGI program, using the HTML "search form" on the website as input performs the search and produces the HTML output (or, the "failed search") which is seen by users.


Last modified: $Date: 2002/01/28 03:56:10 $