ht://Dig Copyright © 1995-2002 The ht://Dig Group
Please see the file COPYING for
license information.
The system performs three major tasks that should be performed in the following order:
Digging is the first step in creating a search database. This
system uses the word digging while other systems call
it harvesting or gathering. In the ht://Dig
system, the program htdig performs
the information gathering stage. In this process, the program
will act as a regular web user, except that it will follow
all hyperlinks that it comes across. (Actually, it
will not follow all of them, just those that are within the
domain it needs to gather information on...)
Each document it visits is examined and all the unique
words in this document are extracted and stored, excepting
those specified as
too short, too
long, or to be
excluded by the configuration.
The digging process will create at least two files. The first one is the list of all the words and the second one is a database of URLs and information about the URLs. Other files may be created for a list of all URLs seen, all images seen, ASCII versions of the databases, etc.
Once the digging process is complete, the data must be converted into something the search engine can actually use. The htmerge program does this.
The term "merge" is used because data from several databases is gathered together and merged into several other databases. The source databases include the databases created by the latest "dig" but also any previous merged databases. The latest dig will produce a database that provides information on new pages and information on changes to previously existing pages; the information on the new pages, and the new information on changes to old pages is merged with the unchanged information to create up-to-date databases.There are other, optional, tasks which are categorized under the merge phase:
Searching is where all the information gathered and organized during the dig and merge stages gets put to use. The htsearch program performs the actual searches. The CGI program, using the HTML "search form" on the website as input performs the search and produces the HTML output (or, the "failed search") which is seen by users.