Index Routines

Main -> Documentation -> Database Builder – Newton -> Creating a New SiteSearch Database -> Database Description (.dsc) File: Structure and Syntax -> Index Routines

Index Routines

An index routine is the part of an index definition in the database description (.dsc) file that defines how the Open SiteSearch Database Builder software creates database index terms and how the software acts on the data (e.g., handling punctuation, extracting codes, etc.). Index routines create index terms in one of two formats: keyword indexes, where each word is its own index entry, or phrase indexes, where the contents of an entire field is an index entry. Database Builder provides a wide range of index routines for both keyword and phrase indexes. The words() and phrase2() routines are most commonly used and are the foundation of several other routines. The syntax and description of each of these two core routines is described below.

Core Keyword and Phrase Routines

words()

Removes all punctuation except single hyphens ('-') and ampersands ('&'). It deletes and collapses all diacritics, escape sequences, and underscores ('_'). If SGML tags are in the data they will be ignored as valid index terms. Finally, this routine replaces special characters with searchable characters. These characters are as follows:

Special Character	Index Character(s)
Dipthong (AE or ae)	ae
Dipthong (OE or oe)	oe
Hooked o (upper & lower)	o
Crossed d (upper & lower)	d
Slashed o (upper & lower)	o
Eth	d
Icelandic thorn	th
Turkish I	i
Hooked u (upper & lower)	u
Slashed l (upper & lower)	l
Script l (upper & lower)	l

phrase2()

Creates a single phrased index term from the input data up to 72 characters in length. By default, it eliminates punctuation except for embedded hyphens ('-') and ampersands ('&'). Replaces slashes ('/') and double hyphens ('- -') with a blank. It deletes and collapses all diacritics, escape sequences, and underscores ('_'). The special characters are replaced in the same manner as described in the words() routine. It does not index leading articles 'a,' 'an,' and 'the.'

You can use the phrase2( ) routine with a custom -ztable to accommodate data with diacritic characters. The pippin utility uses the -ztable during the database build process to determine how it should handle diacritics prior to indexing. If you specify the phrase2() routine with the parameter 1 – that is, phrase2(1) – pippin uses the -ztable. If you specify the phrase2() routine with no parameters – that is, phrase2() – pippin does not use the -ztable.

Example for an author phrase index:

index(10): phrase2(1) from(/* 006 */6/1\ ) with(fldid, pos)

Keyword Routines

The keyword routines below are grouped according to similar functionality to help you decide which routine you need to include in the index definition.

Using punctuation/characters as delimiters

adddelim()	medlinewords()
esdate()	nohypwd()
gxauthr()	wrddelim()

Collapsing punctuation and special characters into a single index

cpunct()	govtdoc()
ddc()	repnum()

Extracting substrings using special punctuation

dblpost()	substr1(param)
nssubst()	uggeocl()
phrbhyp()	ugsbjcl()
substr(param)

Creating an index based on dates and numbers

ercyear()	numrang()
nsnumbr()	pubdate(1)

Indexing class numbers (MARC)

ddc()

lcclass()

udc()

Indexing standard numbers (MARC)

isbn()

musicpb()

Indexing for special situations

itoa()

padzero(param)

Phrase Routines

The following phrase routines are grouped according to similar functionality to help you decide which routine you need to include in the index definition.

Merging two or more subfields into a single phrase

authdat()	combad()
combab(param)	comball()

Extracting substrings using special punctuation

mwphrase()

parenphrase()

uptoparen()

Translating data based on pre-defined values (MARC code, SGML data, and languages)

greekphrase()

marcla(1)

sgmlphrases()

Indexing map coordinate data

coords(param)

rangec(param)

Description of Index Routines

Routine

Phrase or Keyword Routine

Description

adddelim()

Keyword

Identical to the words() routine, but adddelim() allows you to include additional delimiters to separate words in an index. Additional delimiters must be defined in a separate file, which is invoked with the -z option of pippin. See The Pippin Utility for information on the -z option.

authdat()

Phrase

Identical to phrase2() routine except as follows:

Operates only on MARC subfields 'a' (personal author) and 'd' (date), which it combines into a single term of up to 72 characters. Subfield 'd' is added only if it is numeric.
Adds two spaces between subfields 'a' and 'd' and does not remove commas (',') from data.

Note:

This routine is actually a specialized version of combad() and was designed to create author/date terms. You should use the combad routine in most cases.

combab(param)

Phrase

Identical to phrase2() routine except as follows:

Operates only on subfields 'a' and 'b', which it combines into a single term of up to 72 characters.
param is the number of 'b' subfields to include in the term. If param is 1, the first 'b' subfield is used. If param is omitted, all 'b' subfields found are used.

combad()

Phrase

Identical to phrase2() routine except as follows:

Operates only on subfields 'a' and 'd', which it combines into a single term of up to 72 characters.

comball()

Phrase

Identical to phrase2()routine except as follows:

Combines the text from all subfields of the specified tag into a single index term of up to 72 characters.

coords(param)

Phrase

Creates a phrased index term of map coordinate data. This routine is specifically defined for geographical reference data. The parameter() specifies the coordinate followed by the numbering scheme listed below:

'1' is north
'2' is south
'3' is east
'4' is west

Skipping the first character in the field, the parameter is stored in 2 bytes of data for north and south coordinates or 3 bytes of data for east and west coordinates.

cpunct()

Keyword

Collapses all non-alphanumeric characters, including spaces, from the data to create a single index term.

dblpost()

Keyword

Includes each portion of a hyphenated term in the index as a separate word. For example, if the input is 'crop-tending,' two terms are returned 'crop' and 'tending.' Only terms with hyphens are indexed, all other terms are ignored.

ddc()

Keyword

Deletes and collapses all punctuation in the field except periods ('.') to create a single index term. Designed for Dewey Decimal classification.

ercyear()

Keyword

Retrieves a four character year term from any of the following formats:

YYMMDD
MMDDYY
YYYY <text>
YY <text>
<text> YY
<text> YYYY
[YYYY]
<text> [YYYY]
[YYYY] <text>

esdate()

Keyword

Identical to words() except for the following:

Hyphens ('-') are not valid characters and are collapsed from terms.

govtdoc()

Keyword

Deletes and collapses all punctuation. It also excludes any text within parentheses to create a single index term.

Note:

This routine has been designed especially for government document numbers.

greekphrase()

Phrase

Identical to sgmlphrases() with the following additions:

Looks for greek character symbols.
Translates greek symbols to a text description of the greek letter (i.e., 'alpha').

gxauthr()

Keyword

Identical to words() except for the following:

The period ('.') becomes a term delimiter, or separators.

isbn()

Keyword

Contains only valid alphanumeric data. Data within parentheses is not indexed. Designed for the ISBN number.

itoa()

Keyword

Creates a searchable keyword index term from binary data.

lcclass()

Keyword

Formats the term to a standard LC class number searching format of: aaa####.###.a###, where 'a' is an alpha and '#' is a digit. Designed for the LC class number.

marcla(1)

Phrase

Retrieves the language from the fixed 008 field in a MARC record. It converts the 3 character code to the actual text of the language. For example, the code is 'fre,' and the index term generated is 'french.'

This is a phrase index routine since some languages are more than one word. The parameter indicates that the data should be pulled from bytes 36-38 of the 008 field; otherwise, the data is pulled from bytes 38-40.

medlinewords()

Keyword

Identical to words() except for the following:

Periods ('.') are retained as valid characters for terms.

musicpb()

Keyword

Collapses punctuation, spaces, and imbedded alpha characters for MARC music publisher numbers into a standard searching form. A comma-space combination or a space-parenthesis combination is a delimiter for an index term. Any non-numeric data following the numeric portion of the term is ignored. If a double hyphen ('- -') is found, a range of numbers is generated for the terms between the begin and the end points of the range.

mwphrase()

Phrase

Identical to phrase2() except for the following:

Extracts phrase terms only from fields that have more than 1 word (multi-word phrase).

nohypwd()

Keyword

Identical to words() except for the following:

Hyphens ('-') become term delimiters, or separators.

nsnumbr()

Keyword

Indexes only the numeric portion of the terms.

numrang()

Keyword

Generates index terms for a range of numbers. Given a hyphenated combination of numbers, it generates the terms for all the numbers in-between the hyphens.

nssubst()

Keyword

Indexes terms up to the first occurrence of a hyphen ('-'), slash ('/'), or blank (' ').

padzero(param)

Keyword

Pads out a term to the left with zeros given n number of bytes to pad. The parameter () value is the number of bytes to pad.

parenphrase()

Phrase

Identical to phrase2() except for the following:

Only indexes the data within the first occurrence of parentheses.

phrbhyp()

Keyword

Creates a term from the data up to the first occurrence of a hyphen ('-').

pubdate(1)

Keyword

Retrieves the publication date from the fixed 008 field in a MARC record and creates a keyword index term. Any fill characters are converted to zero in the term. The parameter indicates that the date should be pulled from the bytes 8-11 of the 008 field; otherwise, bytes 10-13 are used.

rangec(param)

Phrase

Generates a range of map coordinate terms given 2 endpoints as input. The parameter () value specifies which type of coordinate it is from the following options:

'1' is north
'2' is south
'3' is east
'4' is west

repnum()

Keyword

Deletes and collapses all punctuation in the field to create a single index term.

sgmlphrases()

Phrase

Identical to phrase2() except for the following:

Identifies SGML tags imbedded in the data and removes the tags. The SGML tags are not indexed.

substr(param)

Keyword

Creates an index term from the first n characters in the field. The parameter () value specifies the number of characters to substring.

substr1(param)

Keyword

Creates an index term by skipping the first n characters in the field. The parameter () value specifies the number of characters to skip.

udc()

Keyword

Creates a single term for the universal decimal field. It retains all alphanumerics, periods ('.'), and hyphens ('-').

uggeocl()

Keyword

Creates terms that are only parenthesized data. Data that is not enclosed in parentheses is ignored.

ugsbjcl()

Keyword

Extracts data up to the first blank (' ') or parenthesis. All other terms are ignored.

uptoparen()

Phrase

Identical to phrase2() except for the following:

Indexes the data only up to the first occurrence of a parenthesis.

wrddelim()

Keyword

Identical to words() except for the following:

Allows you to specify a new set of delimiters used to separate words in the index.
Ignores the standard delimiters used in the words() routine and uses the delimiters you define, plus the blank space (' ') character.
Delimiter definitions must be listed in the Delimiters section of the -z option definitions file used by pippin in the database build process. Refer to The Pippin Utility for more information about pippin and the -z option.

[Main][Documentation][Support][Technical Reference][Community][Glossary][Search]

Last Modified: