|
Pears-Newton
Index Routine Comparison
Contents
Introduction
Document Conventions
Newton Index Routines Covered in This Document
Phrase Indexes
Keyword Indexes
Introduction
This
document explains how to set up Pears index definitions within a database
description configuration file to create indexes analogous to the
most frequently used Newton index routines. It also shows how to set up index
definitions in a WebZ database configuration file that use Pears index routines as query normalizers.
You
can often use the Pears general-purpose index
routines (ORG.oclc.pears.IndexRoutines.Phrase
and ORG.oclc.pears.IndexRoutines.Words)
with optional parameters to obtain results comparable to a Newton index
routine. In other cases, you use another Pears index routine written for
a specific purpose, such as ORG.oclc.pears.IndexRoutines.PublicationDate
or ORG.oclc.pears.IndexRoutines.Numbers
to emulate a Newton index routine.
In
addition to demonstrating the implementation of Newton index routines
in Pears, this document also demonstrates the flexibility of Pears index
routines. For Newton index routines
not included here, it illustrates how to use the optional parameters of
a Pears index routine to create a similar index in Pears.
Document
Conventions
They also do
not include variables that a particular index
routine may use in other circumstances.
- The
WebZ index definition sections in this document include only the variables
related to using a Pears index routine to perform query normalization
for a specific index. They do not contain other required or optional
parameters in the [index_definition]
section.
- indxpkg
refers to the fully qualified class package name for Pears index
routines, ORG.oclc.pears.IndexRoutines. For example, indxpkg.Phrase
stands for ORG.oclc.pears.IndexRoutines.Phrase. You would use the fully
qualified class package name for the value of the routine parameter
in a dbnamedesc.ini file and the filter parameter in a dbname.ini
file.
Return
to Contents
Newton Index Routines Covered in This Document
Phrase
Indexes
|
Keyword
Indexes
|
|
|
These links lead
to a section for each index routine that:
- briefly describes
the index routine
- indicates how
to set up an Pears index definition in the dbnamedesc.ini
configuration file
- indicates
how to set up the query normalization portion of the index's [index_definition]
section in the database's dbname.ini configuration file
- where applicable,
provides notes or other information
Phrase
Indexes
combad()
|
Description:
|
|
Identical
to Newton phrase2() routine or Pears Phrase
class except that it operates only on subfields 'a' and 'd', which
it combines into a single term of up to 72 characters. |
|
Pears
Index Definition
(in dbnamedesc.ini)
|
WebZ
Index Definition
(in dbname.ini)
|
Notes
|
routine=indxpkg.Phrase
subfield* = 1
subfield* = 4
maxlength = 72
joinFieldsWith = \u0020 |
filter=indxpkg.Phrase
maxlength=72
|
The
Pears Phrase index class can create index terms greater than
72 characters long. The 72-character limit used here demonstrates
how to replicate this limit in Pears. |
|
Return
to Index Routine List
greekphrase()
|
Description:
|
|
Same as
the Pears Phrase class,
except that it:
- removes
angle-brackets "<>" and their enclosed character
strings
- looks
for specified Unicode Greek characters, and substitutes text equivalents
of the Greek character
|
|
Pears
Index Definition
(in dbnamedesc.ini)
|
WebZ
Index Definition
(in dbname.ini)
|
Notes
|
routine=indxpkg.Phrase
stripHTML=true
replace=replaceGreek
[replaceGreek]
\u03B1 = alpha
\u03B2 = beta
\u03B3 = gamma
... |
filter=indxpkg.Phrase
stripHTML=true
replace=replaceGreek
[replaceGreek]
\u03B1 = alpha
\u03B2 = beta
\u03B3 = gamma
... |
See
note (2) under phrase2().
Since
Pears indexes and stores the Unicode representation of Greek
characters, it may no longer be necessary to replace Greek
characters with text equivalents.
|
|
Return
to Index Routine List
marcla()
|
Description:
|
|
Retrieves
the language from the fixed 008 field in a MARC record. It converts
the 3-character code to the actual text of the language. For example,
if the code is 'fre,' it generates the index term 'french.'
If marcla
contains the parameter (1), it pulls the data from bytes 36-38 of
the 008 field; otherwise, it pulls the from bytes 38-40.
|
|
Pears
Index Definition
(in dbnamedesc.ini)
|
WebZ
Index Definition
(in dbname.ini)
|
Notes
|
routine=indxpkg.MarcLanguage
tagpath=8
startOffset=35
OR
(if the Newton routine had a parameter)
startOffset=37
|
Not
applicable because this is a restrictor index.
|
startOffset
begins counting bytes with position 0. |
|
Return
to Index Routine List
phrase2()
|
Description:
|
|
Creates
a single phrased index term from the input data up to 72 characters
in length. By default, it eliminates punctuation except for embedded
hyphens ('-') and ampersands ('&'). Replaces slashes ('/') and
double hyphens ('- -') with a blank. It deletes and collapses all
diacritics, escape sequences, and underscores ('_'), and replaces
special characters (see [replaceC] section). It does not index the
leading articles 'a,' 'an,' and 'the.' |
|
Pears
Index Definition
(in dbnamedesc.ini)
|
WebZ
Index Definition
(in dbname.ini)
|
Notes
|
routine=indxpkg.Phrase
collapse=~`@#$%^*()_+={
[}]|\<,>.?\u0309\u0300
\u0301\u0302\u0303\u0304
\u0306\u0307\u0308\u030C
\u030A\uFE20\uFE21\u0315
\u030B\u0310\u0327\u0328
\u0323\u0324\u0325\u0333
\u0332\u0326\u031C\u032E
\u0313
replaceChar=replaceC
[replaceC]
/ = \u0020
\u00C6 = ae
\u00E6 = AE
\u0152 = oe
\u0153 = OE
\u01A0 = o
\u01A1 = o
\u0110 = d
\u0111 = d
\u00D8 = o
\u00F8 = o
\u00DE = th
\u00FE = th
\u0131 = i
\u01AF = u
\u01B0 = u
\u0141 = l /* el */
\u0142 = l /* El */
\u2113 = l /* El */
|
filter=indxpkg.Phrase
collapse=~`@#$%^*()_+={
[}]|\<,>.?\u0309\u0300
\u0301\u0302\u0303\u0304
\u0306\u0307\u0308\u030C
\u030A\uFE20\uFE21\u0315
\u030B\u0310\u0327\u0328
\u0323\u0324\u0325\u0333
\u0332\u0326\u031C\u032E
\u0313
replaceChar=replaceC
[replaceC]
/ = \u0020
\u00C6 = AE
\u00E6 = AE
\u0152 = OE
\u0153 = OE
\u01A0 = o
\u01A1 = o
\u0110 = d
\u0111 = d
\u00D8 = o
\u00F8 = o
\u00DE = th
\u00FE = th
\u0131 = i
\u01AF = u
\u01B0 = u
\u0141 = l /* El */
\u0142 = l /* El */
\u2113 = l /* El */ |
(1)
The Pears Phrase index routine can create index terms greater
than 72 characters long. This example does not include this
restriction.
(2)
These collapse and replace parameters approximate the normalization
performed in the Newton phrase2() (and words()) routines.
Many
of the collapses and replaces may no longer be desirable because
of the Unicode support built into Pears and WebZ in SiteSearch
4.2.0.
Therefore,
you may want to keep some diacritics and special characters
in the indexes and remove them from the collapse and replace
lists shown in this example.
|
|
Return
to Index Routine List
sgmlphrase()
|
Description:
|
|
Identical
to Newton phrase(2) routine except that it also removes all text
within angle brackets and the angle brackets themselves (< and
>).
|
|
Pears
Index Definition
(in dbnamedesc.ini)
|
WebZ
Index Definition
(in dbname.ini)
|
Notes
|
routine=indxpkg.Phrase
stripHTML=true |
filter=indxpkg.Phrase
stripHTML=true
|
|
|
Return
to Index Routine List
uptoparen()
|
Description:
|
|
Identical
to Newton phrase2() routine except that it indexes the data only up
to the first occurrence of a left parenthesis. |
|
Pears
Index Definition
(in dbnamedesc.ini)
|
WebZ
Index Definition
(in dbname.ini)
|
Notes
|
routine=indxpkg.Phrase
indexUpTo*=( |
filter=indxpkg.Phrase
indexUpTo*=(
|
|
|
Keyword
Indexes
|
adddelim()
|
|
Description:
|
|
Identical
to the Newton words() routine or the Pears Words class, but adddelim()
allows you to include additional delimiters to separate words in an
index. |
|
|
|
Pears
Index Definition
(in dbnamedesc.ini)
|
WebZ
Index Definition
(in dbname.ini)
|
Notes
|
routine=indxpkg.Words
extraDelimiters=aa |
filter=indxpkg.Words
extraDelimiters=AA |
AA
represents the additional delimiters. |
|
Return to Index Routine List
cpunct()
|
Description:
|
|
Collapses
all non-alphanumeric characters, including spaces, from the data to
create a single index term. |
|
Pears
Index Definition
(in dbnamedesc.ini)
|
WebZ
Index Definition
(in dbname.ini)
|
Notes
|
routine=indxpkg.Words
OccurrenceRoutine= \
ORG.oclc.pears.Bartlett.
wordfield
collapse=~`@#$%^&*()_=+[]\
{}|;':"< >?,/.- |
filter=indxpkg.Words
collapse=~`@#$%^&*()_=+[]\
{}|;':"< >?,/.-
|
To
use a space as a collapse character, insert the space between
two other characters, such as the < and > in this example,
or use the Unicode designation (\u0020) for a space.
|
|
Return
to Index Routine List
ddc()
|
Description:
|
|
Designed
for Dewey Decimal classification. Deletes and collapses all punctuation
in the field except periods ('.') to create a single index term. |
|
Pears
Index Definition
(in dbnamedesc.ini)
|
WebZ
Index Definition
(in dbname.ini)
|
Notes
|
routine=indxpkg.Phrase
collapse=?!,"':;&_-< >[]
|
filter=indxpkg.Phrase
collapse=?!,"':;&_-< >[] |
To
use a space as a collapse character, insert the space between
two other characters, such as the < and > in this example,
or use the Unicode designation (\u0020) for a space.
|
|
Return to Index Routine List
esdate()
|
Description:
|
|
Identical
to Newton words() routine or Pears Words class except that hyphens
('-') are not valid characters and are collapsed from terms. |
|
Pears
Index Definition
(in dbnamedesc.ini)
|
WebZ
Index Definition
(in dbname.ini)
|
Notes
|
routine=indxpkg.Words
OccurrenceRoutine= \
ORG.oclc.pears.Bartlett.wordfield
collapse=- |
filter=indxpkg.Words
collapse=-
|
|
|
Return
to Index Routine List
govtdoc()
|
Description:
|
|
Designed
for indexing government document numbers. Deletes and collapses all
punctuation and excludes any text within parentheses to create a single
index term. |
|
Pears
Index Definition
(in dbnamedesc.ini)
|
WebZ
Index Definition
(in dbname.ini)
|
Notes
|
routine=indxpkg.PhraseMinusBoundPhrases
bounds=()
collapse=?!,"':;&_-.< >[]
|
filter=indxpkg.Phrase
bounds=()
collapse=?!,"':;&_-.< >[]
|
|
|
Return to Index Routine List
gxauthr()
|
Description:
|
|
Identical
to Newton words() routine except that periods ('.') become term delimiters,
or separators. |
|
Pears
Index Definition
(in dbnamedesc.ini)
|
WebZ
Index Definition
(in dbname.ini)
|
Notes
|
routine=indxpkg.Words
OccurrenceRoutine= \
ORG.oclc.pears.Bartlett.wordfield
extraDelimiters=.
removeDelimiters=\u0020
|
filter=indxpkg.Words
extraDelimiters=.
removeDelimiters=\u0020 |
|
|
Return to Index Routine List
isbn()
|
Description:
|
|
Designed
for the ISBN number. The indexed term contains only valid alphanumeric
data. Does not index data within parentheses. |
|
Pears
Index Definition
(in dbnamedesc.ini)
|
WebZ
Index Definition
(in dbname.ini)
|
Notes
|
routine=indxpkg.
WordsMinusBoundPhrases
collapse=~`@#$%^*()_+={[}]
|\<,>.?-&
bounds=() |
filter=indxpkg.Words
collapse=~`@#$%^*()_+=
{[}]|\<,>.?-& |
Use
this routine with caution. In
WebZ, the query parser delimits queries with parentheses that
are removed before passing the query to the filter as separate
words. Thus, a query entered with parentheses does not have
the parenthetical data removed and fails to match on that
data (as that was removed during the indexing process).
|
|
Return to Index Routine List
lcclass()
|
Description:
|
|
Designed
for the LC class number. Formats the term to a standard LC class number
search format of: aaa####.###.a###, where 'a' is an alphabetic character
and '#' is a numeral. Pads alphabetic characters with underscores
(_) and digits with zeros (0). |
|
Pears
Index Definition
(in dbnamedesc.ini)
|
WebZ
Index Definition
(in dbname.ini)
|
Notes
|
routine=indxpkg.LCClass
|
filter=indxpkg.LCClass
|
|
|
Return to Index Routine List
medlinewords()
|
Description:
|
|
Identical
to Newton words() routine or Pears Words class except that it retains
periods ('.') as valid characters in index terms. |
|
Pears
Index Definition
(in dbnamedesc.ini)
|
WebZ
Index Definition
(in dbname.ini)
|
Notes
|
routine=indxpkg.Words
OccurrenceRoutine= \
ORG.oclc.pears.Bartlett.wordfield
removeDelimiters=. |
filter=indxpkg.Words
removeDelimiters=.
|
|
|
Return to Index Routine List
nohypwd()
|
Description:
|
|
Identical
to Newton words() routine and Pears Words class except that hyphens
('-') become term delimiters, or separators. |
|
Pears
Index Definition
(in dbnamedesc.ini)
|
WebZ
Index Definition
(in dbname.ini)
|
Notes
|
routine=indxpkg.Words
OccurrenceRoutine= \
ORG.oclc.pears.Bartlett.wordfield
extraDelimiters=- |
filter=indxpkg.Words
extraDelimiters=- |
|
|
Return to Index Routine List
nsnumbr()
|
Description:
|
|
Indexes
only the numeric portion of the terms. |
|
Pears
Index Definition
(in dbnamedesc.ini)
|
WebZ
Index Definition
(in dbname.ini)
|
Notes
|
routine=indxpkg.Numbers |
filter=indxpkg.Numbers
|
|
|
Return to Index Routine List
nssubst()
|
Description:
|
|
Indexes
terms up to the first occurrence of a hyphen ('-'), slash ('/'), or
blank (' '). |
|
Pears
Index Definition
(in dbnamedesc.ini)
|
WebZ
Index Definition
(in dbname.ini)
|
Notes
|
routine=indxpkg.Words
indexUpTo*=\u0020
indexUpTo*=-
indexUpTo*=/ |
filter=indxpkg.Words
indexUpTo*=\u0020
indexUpTo*=-
indexUpTo*=/ |
See
note (2) for phrase2().
|
|
Return to Index Routine List
numrang()
|
Description:
|
|
Generates
index terms for a range of numbers. Given a hyphenated combination
of numbers, it generates the terms for all the numbers between the
hyphens.
For example,
if a field contains the number range "1960-1963", it generates
the terms "1960", "1961", "1962",
and "1963".
|
|
Pears
Index Definition
(in dbnamedesc.ini)
|
WebZ
Index Definition
(in dbname.ini)
|
Notes
|
routine=indxpkg.YearRange
mustContain=- |
filter=indxpkg.PublicationDate |
The
query term does not need to be converted into a range, since
the YearRange index creates index terms for each number in
the range.
|
|
Return to Index Routine List
padzero(param)
|
Description:
|
|
Pads
a term to the left with zeros given n number of bytes to pad.
In Newton, you specify the number of bytes to pad, such as padzero(3)
for 3 bytes. |
|
Pears
Index Definition
(in dbnamedesc.ini)
|
WebZ
Index Definition
(in dbname.ini)
|
Notes
|
routine=indxpkg.Numbers
zeropad=n |
filter=indxpkg.Numbers
zeropad=n |
n
is the number of zeros to pad.
|
|
Return to Index Routine List
phrbhyp()
|
Description:
|
|
Creates
a term from the data in a field up to the first occurrence of a
hyphen ('-').
|
|
Pears
Index Definition
(in dbnamedesc.ini)
|
WebZ
Index Definition
(in dbname.ini)
|
Notes
|
routine=indxpkg.Phrase
indexUpTo*=- |
filter=indxpkg.Phrase
indexUpTo*=-
|
|
|
Return to Index Routine List
pubdate(1)
|
Description:
|
|
Retrieves
the publication date from the fixed 008 field of a MARC record and
creates a keyword index term. Converts any fill characters to zero
in the term.
The parameter
(1) indicates that the date should be pulled from the bytes 8-11
of the 008 field; otherwise, bytes 10-13 are used.
|
|
Pears
Index Definition
(in dbnamedesc.ini)
|
WebZ
Index Definition
(in dbname.ini)
|
Notes
|
routine=indxpkg.PublicationDate
startOffset=7
OR
startOffset=9
|
filter=indxpkg.PublicationDate
|
|
|
Return to Index Routine List
repnum()
|
Description:
|
|
Deletes
and collapses all punctuation in the field to create a single index
term. |
|
Pears
Index Definition
(in dbnamedesc.ini)
|
WebZ
Index Definition
(in dbname.ini)
|
Notes
|
routine=indxpkg.Phrase
collapse=?!,$&:;'"_-<>[ ]. |
filter=indxpkg.Phrase
collapse=?!,$&:;'"_-<>[ ].
|
To
use a space as a collapse character, insert the space between
two other characters, such as the [ and ] in this example,
or use the Unicode designation (\u0020) for a space.
|
|
Return to Index Routine List
substr(param)
|
Description:
|
|
Creates
an index term from the first n characters in the field. In
Newton, the parameter (param) value specifies the number of
characters to include in the substring. |
|
Pears
Index Definition
(in dbnamedesc.ini)
|
WebZ
Index Definition
(in dbname.ini)
|
Notes
|
routine=indxpkg.Words
OccurrenceRoutine= \
ORG.oclc.pears.Bartlett.wordfield
maxLength=n |
filter=indxpkg.Words
maxLength=n |
n
is the number of characters to include in the index term.
|
|
Return to Index Routine List
substr1(param)
|
Description:
|
|
Creates
an index term by skipping the first n
characters in the field. In Newton, the parameter (param) value
specifies the number of characters to skip. |
|
Pears
Index Definition
(in dbnamedesc.ini)
|
WebZ
Index Definition
(in dbname.ini)
|
Notes
|
routine=indxpkg.Words
OccurrenceRoutine= \
ORG.oclc.pears.Bartlett.wordfield
startOffset=n
|
filter=indxpkg.Words
startOffset=n |
n
is the number of characters to skip.
|
|
Return to Index Routine List
udc()
|
Description:
|
|
Creates
a single term for the universal decimal number field. It retains all
alphanumeric characters, periods ('.'), and hyphens. |
|
Pears
Index Definition
(in dbnamedesc.ini)
|
WebZ
Index Definition
(in dbname.ini)
|
Notes
|
routine=indxpkg.Words
collapse=~`@#$%^&*()_=+[]\
{}|;':"< >?,/ |
filter=indxpkg.Words
collapse=~`@#$%^&*()_=+[]\
{}|;':"< >?,/
|
To
use a space as a collapse character, insert the space between
two other characters, such as the < and > in this example,
or use the Unicode designation (\u0020) for a space.
|
|
Return to Index Routine List
ugsbjcl()
|
Description:
|
|
Extracts
data up to the first blank (' ') or left parenthesis. Ignores the
remainder of the field.
|
|
Pears
Index Definition
(in dbnamedesc.ini)
|
WebZ
Index Definition
(in dbname.ini)
|
Notes
|
routine=indxpkg.Phrase
indexUpTo*=\u0020
indexUpTo*=( |
filter=indxpkg.Phrase
indexUpTo*=\u0020
indexUpTo*=(
|
|
|
Return to Index Routine List
wrddelim()
|
Description:
|
|
Identical
to Newton words() routine except that it:
- allows
you to specify a new set of delimiters used to separate words
in the index
- ignores
the standard delimiters used in the words() routine and uses the
delimiters you define, plus the blank space (' ') character
|
|
Pears
Index Definition
(in dbnamedesc.ini)
|
WebZ
Index Definition
(in dbname.ini)
|
Notes
|
routine=indxpkg.Words
OccurrenceRoutine= \
ORG.oclc.pears.Bartlett.wordfield
delimiters=\u0020
extraDelimiters=. |
filter=indxpkg.Words
delimiters=\u0020
extraDelimiters=. |
In
this example, the only delimiters defined are a space and
a period.
|
|
See Also
Pears
Database Description Configuration File
Creating a New Pears Database
Making a Pears Database Available through
WebZ
Database Configuration Files
Sections and Variables
Pears Index Routines
|