|
Index
Routines
An index routine
is the part of an index definition in
the database description (.dsc) file
that defines how the Open SiteSearch Database Builder software creates
database index terms and how the software acts on the data (e.g., handling
punctuation, extracting codes, etc.). Index routines create index terms
in one of two formats: keyword indexes, where each word is its own index
entry, or phrase indexes, where the contents of an entire field is an
index entry. Database Builder provides a wide range of index routines
for both keyword and phrase indexes. The words() and phrase2() routines
are most commonly used and are the foundation of several other routines.
The syntax and description of each of these two core routines is described
below.
Core Keyword
and Phrase Routines
words()
Removes all punctuation
except single hyphens ('-') and ampersands ('&'). It deletes and collapses
all diacritics, escape sequences, and underscores ('_'). If SGML tags
are in the data they will be ignored as valid index terms. Finally, this
routine replaces special characters with searchable characters. These
characters are as follows:
Special
Character
|
Index
Character(s)
|
Dipthong
(AE or ae)
|
ae
|
Dipthong
(OE or oe)
|
oe
|
Hooked o
(upper & lower)
|
o
|
Crossed
d (upper & lower)
|
d
|
Slashed
o (upper & lower)
|
o
|
Eth
|
d
|
Icelandic
thorn
|
th
|
Turkish
I
|
i
|
Hooked u
(upper & lower)
|
u
|
Slashed
l (upper & lower)
|
l
|
Script l
(upper & lower)
|
l
|
phrase2()
Creates a single
phrased index term from the input data up to 72 characters in length.
By default, it eliminates punctuation except for embedded hyphens
('-') and ampersands ('&'). Replaces slashes ('/') and double hyphens
('- -') with a blank. It deletes and collapses all diacritics, escape
sequences, and underscores ('_'). The special characters are replaced
in the same manner as described in the words() routine.
It does not index leading articles 'a,' 'an,' and 'the.'
You can use the
phrase2( ) routine with a custom -ztable
to accommodate data with diacritic characters. The pippin
utility uses the -ztable during the database build process to determine
how it should handle diacritics prior to indexing. If you specify the
phrase2() routine with the parameter 1 that is, phrase2(1)
pippin uses the -ztable. If you specify the phrase2() routine with no
parameters that is, phrase2() pippin does not use the -ztable.
Example for an
author phrase index:
index(10): phrase2(1)
from(/* 006 */6/1\ ) with(fldid, pos)
Keyword
Routines
The keyword routines
below are grouped according to similar functionality to help you decide
which routine you need to include in the index definition.
Using punctuation/characters
as delimiters
Collapsing
punctuation and special characters into a single index
Extracting
substrings using special punctuation
Creating an
index based on dates and numbers
Indexing class
numbers (MARC)
Indexing standard
numbers (MARC)
Indexing for
special situations
Phrase Routines
The following
phrase routines are grouped according to similar functionality to help
you decide which routine you need to include in the index definition.
Merging two
or more subfields into a single phrase
Extracting
substrings using special punctuation
Translating
data based on pre-defined values (MARC code, SGML data, and languages)
Indexing map
coordinate data
Description
of Index Routines
Routine
|
Phrase
or Keyword Routine
|
Description
|
adddelim()
|
Keyword
|
Identical
to the words() routine, but adddelim()
allows you to include additional delimiters to separate words in
an index. Additional delimiters must be defined in a separate file,
which is invoked with the -z option of pippin.
See The Pippin Utility for information
on the -z option.
|
authdat()
|
Phrase
|
Identical
to phrase2() routine except as follows:
- Operates
only on MARC subfields 'a' (personal author) and 'd' (date), which
it combines into a single term of up to 72 characters. Subfield
'd' is added only if it is numeric.
- Adds
two spaces between subfields 'a' and 'd' and does not remove
commas (',') from data.
Note: |
This
routine is actually a specialized version of combad()
and was designed to create author/date terms. You should use
the combad routine in most cases.
|
|
combab(param)
|
Phrase
|
Identical
to phrase2() routine except as follows:
- Operates
only on subfields 'a' and 'b', which it combines into a single
term of up to 72 characters.
- param
is the number of 'b' subfields to include in the term. If param
is 1, the first 'b' subfield is used. If param is
omitted, all 'b' subfields found are used.
|
combad()
|
Phrase
|
Identical
to phrase2() routine except as follows:
- Operates
only on subfields 'a' and 'd', which it combines into a single
term of up to 72 characters.
|
comball()
|
Phrase
|
Identical
to phrase2()routine except as follows:
- Combines
the text from all subfields of the specified tag into a single
index term of up to 72 characters.
|
coords(param)
|
Phrase
|
Creates
a phrased index term of map coordinate data. This routine is specifically
defined for geographical reference data. The parameter() specifies
the coordinate followed by the numbering scheme listed below:
- '1' is
north
- '2' is
south
- '3' is
east
- '4' is
west
Skipping
the first character in the field, the parameter is stored in 2 bytes
of data for north and south coordinates or 3 bytes of data for east
and west coordinates.
|
cpunct()
|
Keyword
|
Collapses
all non-alphanumeric characters, including spaces, from the data
to create a single index term.
|
dblpost()
|
Keyword
|
Includes
each portion of a hyphenated term in the index as a separate word.
For example, if the input is 'crop-tending,' two terms are returned
'crop' and 'tending.' Only terms with hyphens are indexed, all other
terms are ignored.
|
ddc()
|
Keyword
|
Deletes
and collapses all punctuation in the field except periods ('.')
to create a single index term. Designed for Dewey Decimal classification.
|
ercyear()
|
Keyword
|
Retrieves
a four character year term from any of the following formats:
- YYMMDD
- MMDDYY
- YYYY
<text>
- YY <text>
- <text>
YY
- <text>
YYYY
- [YYYY]
- <text>
[YYYY]
- [YYYY]
<text>
|
esdate()
|
Keyword
|
Identical
to words() except for the following:
- Hyphens
('-') are not valid characters and are collapsed from terms.
|
govtdoc()
|
Keyword
|
Deletes
and collapses all punctuation. It also excludes any text within
parentheses to create a single index term.
Note: |
This
routine has been designed especially for government document
numbers.
|
|
greekphrase()
|
Phrase
|
Identical
to sgmlphrases() with the following additions:
- Looks
for greek character symbols.
- Translates
greek symbols to a text description of the greek letter (i.e.,
'alpha').
|
gxauthr()
|
Keyword
|
Identical
to words() except for the following:
- The period
('.') becomes a term delimiter, or separators.
|
isbn()
|
Keyword
|
Contains
only valid alphanumeric data. Data within parentheses is not indexed.
Designed for the ISBN number.
|
itoa()
|
Keyword
|
Creates
a searchable keyword index term from binary data.
|
lcclass()
|
Keyword
|
Formats
the term to a standard LC class number searching format of: aaa####.###.a###,
where 'a' is an alpha and '#' is a digit. Designed for the LC class
number.
|
marcla(1)
|
Phrase
|
Retrieves
the language from the fixed 008 field in a MARC record. It converts
the 3 character code to the actual text of the language. For example,
the code is 'fre,' and the index term generated is 'french.'
This is
a phrase index routine since some languages are more than one word.
The parameter indicates that the data should be pulled from bytes
36-38 of the 008 field; otherwise, the data is pulled from bytes
38-40.
|
medlinewords()
|
Keyword
|
Identical
to words() except for the following:
- Periods
('.') are retained as valid characters for terms.
|
musicpb()
|
Keyword
|
Collapses
punctuation, spaces, and imbedded alpha characters for MARC music
publisher numbers into a standard searching form. A comma-space
combination or a space-parenthesis combination is a delimiter for
an index term. Any non-numeric data following the numeric portion
of the term is ignored. If a double hyphen ('- -') is found, a range
of numbers is generated for the terms between the begin and the
end points of the range.
|
mwphrase()
|
Phrase
|
Identical
to phrase2() except for the following:
- Extracts
phrase terms only from fields that have more than 1 word (multi-word
phrase).
|
nohypwd()
|
Keyword
|
Identical
to words() except for the following:
- Hyphens
('-') become term delimiters, or separators.
|
nsnumbr()
|
Keyword
|
Indexes
only the numeric portion of the terms.
|
numrang()
|
Keyword
|
Generates
index terms for a range of numbers. Given a hyphenated combination
of numbers, it generates the terms for all the numbers in-between
the hyphens.
|
nssubst()
|
Keyword
|
Indexes
terms up to the first occurrence of a hyphen ('-'), slash ('/'),
or blank (' ').
|
padzero(param)
|
Keyword
|
Pads out
a term to the left with zeros given n number of bytes
to pad. The parameter () value is the number of bytes to pad.
|
parenphrase()
|
Phrase
|
Identical
to phrase2() except for the following:
- Only
indexes the data within the first occurrence of parentheses.
|
phrbhyp()
|
Keyword
|
Creates
a term from the data up to the first occurrence of a hyphen ('-').
|
pubdate(1)
|
Keyword
|
Retrieves
the publication date from the fixed 008 field in a MARC record and
creates a keyword index term. Any fill characters are converted
to zero in the term. The parameter indicates that the date should
be pulled from the bytes 8-11 of the 008 field; otherwise, bytes
10-13 are used.
|
rangec(param)
|
Phrase
|
Generates
a range of map coordinate terms given 2 endpoints as input. The
parameter () value specifies which type of coordinate it is from
the following options:
- '1' is
north
- '2' is
south
- '3' is
east
- '4' is
west
|
repnum()
|
Keyword
|
Deletes
and collapses all punctuation in the field to create a single index
term.
|
sgmlphrases()
|
Phrase
|
Identical
to phrase2() except for the following:
- Identifies
SGML tags imbedded in the data and removes the tags. The SGML
tags are not indexed.
|
substr(param)
|
Keyword
|
Creates
an index term from the first n characters in the field.
The parameter () value specifies the number of characters to substring.
|
substr1(param)
|
Keyword
|
Creates
an index term by skipping the first n characters in
the field. The parameter () value specifies the number of characters
to skip.
|
udc()
|
Keyword
|
Creates
a single term for the universal decimal field. It retains all alphanumerics,
periods ('.'), and hyphens ('-').
|
uggeocl()
|
Keyword
|
Creates
terms that are only parenthesized data. Data that is not enclosed
in parentheses is ignored.
|
ugsbjcl()
|
Keyword
|
Extracts
data up to the first blank (' ') or parenthesis. All other terms
are ignored.
|
uptoparen()
|
Phrase
|
Identical
to phrase2() except for the following:
- Indexes
the data only up to the first occurrence of a parenthesis.
|
wrddelim()
|
Keyword
|
Identical
to words() except for the following:
- Allows
you to specify a new set of delimiters used to separate words
in the index.
- Ignores
the standard delimiters used in the words() routine
and uses the delimiters you define, plus the blank space (' ')
character.
- Delimiter
definitions must be listed in the Delimiters
section of the -z option definitions file used by
pippin in the database build process. Refer to The
Pippin Utility for more information about pippin and
the -z option.
|
See Also
Index
Definitions
Creating a Database Description (.dsc) File
Database Description (.dsc) File: Structure
and Syntax
Database Description (.dsc) File Example
|