Main -> Documentation -> Database Builder – Pears -> Pears-Newton Index Routine Comparison

Pears-Newton Index Routine Comparison

Contents

Introduction
Document Conventions
Newton Index Routines Covered in This Document
Phrase Indexes
Keyword Indexes


Introduction

This document explains how to set up Pears index definitions within a database description configuration file to create indexes analogous to the most frequently used Newton index routines. It also shows how to set up index definitions in a WebZ database configuration file that use Pears index routines as query normalizers.

You can often use the Pears general-purpose index routines (ORG.oclc.pears.IndexRoutines.Phrase and ORG.oclc.pears.IndexRoutines.Words) with optional parameters to obtain results comparable to a Newton index routine. In other cases, you use another Pears index routine written for a specific purpose, such as ORG.oclc.pears.IndexRoutines.PublicationDate or ORG.oclc.pears.IndexRoutines.Numbers to emulate a Newton index routine.

In addition to demonstrating the implementation of Newton index routines in Pears, this document also demonstrates the flexibility of Pears index routines. For Newton index routines not included here, it illustrates how to use the optional parameters of a Pears index routine to create a similar index in Pears.


Document Conventions

  • dbnamedesc.ini refers to a Pears database description configuration file. While you can use any name you wish for a database description configuration file, including "desc" in the file name is a convention that helps you differentiate this file from the database's WebZ database configuration file.
  • dbname.ini refers to a WebZ database configuration file for a Pears database.
  • The Pears index definition sections in this document include only the parameters required to replicate a specific Newton index routine. They do not contain the variables required for all index definitions:
  • index=uniqueID_number
    tagpath*=BER_tag_path

They also do not include variables that a particular index routine may use in other circumstances.

  • The WebZ index definition sections in this document include only the variables related to using a Pears index routine to perform query normalization for a specific index. They do not contain other required or optional parameters in the [index_definition] section.
  • indxpkg refers to the fully qualified class package name for Pears index routines, ORG.oclc.pears.IndexRoutines. For example, indxpkg.Phrase stands for ORG.oclc.pears.IndexRoutines.Phrase. You would use the fully qualified class package name for the value of the routine parameter in a dbnamedesc.ini file and the filter parameter in a dbname.ini file.

Return to Contents


Newton Index Routines Covered in This Document

Phrase Indexes

Keyword Indexes 

These links lead to a section for each index routine that:

  • briefly describes the index routine
  • indicates how to set up an Pears index definition in the dbnamedesc.ini configuration file
  • indicates how to set up the query normalization portion of the index's [index_definition] section in the database's dbname.ini configuration file
  • where applicable, provides notes or other information

Phrase Indexes

combad()

Description:

   Identical to Newton phrase2() routine or Pears Phrase class except that it operates only on subfields 'a' and 'd', which it combines into a single term of up to 72 characters.
 

Pears Index Definition
(in dbnamedesc.ini)

WebZ Index Definition
(in dbname.ini)

Notes

routine=indxpkg.Phrase
subfield* = 1
subfield* = 4
maxlength = 72
joinFieldsWith = \u0020
filter=indxpkg.Phrase
maxlength=72
The Pears Phrase index class can create index terms greater than 72 characters long. The 72-character limit used here demonstrates how to replicate this limit in Pears.

Return to Index Routine List


greekphrase()

Description:   

Same as the Pears Phrase class, except that it:

  • removes angle-brackets "<>" and their enclosed character strings
  • looks for specified Unicode Greek characters, and substitutes text equivalents of the Greek character
 

Pears Index Definition
(in dbnamedesc.ini)

WebZ Index Definition
(in dbname.ini)

Notes

routine=indxpkg.Phrase
stripHTML=true
replace=replaceGreek

[replaceGreek]
\u03B1 = alpha
\u03B2 = beta
\u03B3 = gamma
...

filter=indxpkg.Phrase
stripHTML=true
replace=replaceGreek

[replaceGreek]
\u03B1 = alpha
\u03B2 = beta
\u03B3 = gamma
...

See note (2) under phrase2().

Since Pears indexes and stores the Unicode representation of Greek characters, it may no longer be necessary to replace Greek characters with text equivalents.


Return to Index Routine List

marcla()

Description:

  

Retrieves the language from the fixed 008 field in a MARC record. It converts the 3-character code to the actual text of the language. For example, if the code is 'fre,' it generates the index term 'french.'

If marcla contains the parameter (1), it pulls the data from bytes 36-38 of the 008 field; otherwise, it pulls the from bytes 38-40.

 

Pears Index Definition
(in dbnamedesc.ini)

WebZ Index Definition
(in dbname.ini)

Notes

routine=indxpkg.MarcLanguage
tagpath=8
startOffset=35

OR (if the Newton routine had a parameter)

startOffset=37

Not applicable because this is a restrictor index.

startOffset begins counting bytes with position 0.

Return to Index Routine List


phrase2()

Description:

   Creates a single phrased index term from the input data up to 72 characters in length. By default, it eliminates punctuation except for embedded hyphens ('-') and ampersands ('&'). Replaces slashes ('/') and double hyphens ('- -') with a blank. It deletes and collapses all diacritics, escape sequences, and underscores ('_'), and replaces special characters (see [replaceC] section). It does not index the leading articles 'a,' 'an,' and 'the.'
 

Pears Index Definition
(in dbnamedesc.ini)

WebZ Index Definition
(in dbname.ini)

Notes

routine=indxpkg.Phrase
collapse=~`@#$%^*()_+={
[}]|\<,>.?\u0309\u0300
\u0301\u0302\u0303\u0304
\u0306\u0307\u0308\u030C
\u030A\uFE20\uFE21\u0315
\u030B\u0310\u0327\u0328
\u0323\u0324\u0325\u0333
\u0332\u0326\u031C\u032E
\u0313
replaceChar=replaceC

[replaceC]
/ = \u0020
\u00C6 = ae
\u00E6 = AE
\u0152 = oe
\u0153 = OE
\u01A0 = o
\u01A1 = o
\u0110 = d
\u0111 = d
\u00D8 = o
\u00F8 = o
\u00DE = th
\u00FE = th
\u0131 = i
\u01AF = u
\u01B0 = u
\u0141 = l /* el */
\u0142 = l /* El */
\u2113 = l /* El */

filter=indxpkg.Phrase
collapse=~`@#$%^*()_+={
[}]|\<,>.?\u0309\u0300
\u0301\u0302\u0303\u0304
\u0306\u0307\u0308\u030C
\u030A\uFE20\uFE21\u0315
\u030B\u0310\u0327\u0328
\u0323\u0324\u0325\u0333
\u0332\u0326\u031C\u032E
\u0313
replaceChar=replaceC

[replaceC]
/ = \u0020
\u00C6 = AE
\u00E6 = AE
\u0152 = OE
\u0153 = OE
\u01A0 = o
\u01A1 = o
\u0110 = d
\u0111 = d
\u00D8 = o
\u00F8 = o
\u00DE = th
\u00FE = th
\u0131 = i
\u01AF = u
\u01B0 = u
\u0141 = l /* El */
\u0142 = l /* El */
\u2113 = l /* El */

(1) The Pears Phrase index routine can create index terms greater than 72 characters long. This example does not include this restriction.

(2) These collapse and replace parameters approximate the normalization performed in the Newton phrase2() (and words()) routines.

Many of the collapses and replaces may no longer be desirable because of the Unicode support built into Pears and WebZ in SiteSearch 4.2.0.

Therefore, you may want to keep some diacritics and special characters in the indexes and remove them from the collapse and replace lists shown in this example.


Return to Index Routine List

sgmlphrase()

Description:

  

Identical to Newton phrase(2) routine except that it also removes all text within angle brackets and the angle brackets themselves (< and >).

 

Pears Index Definition
(in dbnamedesc.ini)

WebZ Index Definition
(in dbname.ini)

Notes

routine=indxpkg.Phrase
stripHTML=true

filter=indxpkg.Phrase
stripHTML=true


Return to Index Routine List

uptoparen()

Description:

   Identical to Newton phrase2() routine except that it indexes the data only up to the first occurrence of a left parenthesis.
 

Pears Index Definition
(in dbnamedesc.ini)

WebZ Index Definition
(in dbname.ini)

Notes

routine=indxpkg.Phrase
indexUpTo*=(
filter=indxpkg.Phrase
indexUpTo*=(


Return to Index Routine List   

Return to Contents


Keyword Indexes

 

adddelim()

 

Description:

   Identical to the Newton words() routine or the Pears Words class, but adddelim() allows you to include additional delimiters to separate words in an index.  
   
 

Pears Index Definition
(in dbnamedesc.ini)

WebZ Index Definition
(in dbname.ini)

Notes

routine=indxpkg.Words
extraDelimiters=aa
filter=indxpkg.Words
extraDelimiters=AA
AA represents the additional delimiters.

Return to Index Routine List

cpunct()

Description:

   Collapses all non-alphanumeric characters, including spaces, from the data to create a single index term.
 

Pears Index Definition
(in dbnamedesc.ini)

WebZ Index Definition
(in dbname.ini)

Notes

routine=indxpkg.Words
OccurrenceRoutine= \
ORG.oclc.pears.Bartlett.
wordfield
collapse=~`@#$%^&*()_=+[]\
{}|;':"< >?,/.-
filter=indxpkg.Words
collapse=~`@#$%^&*()_=+[]\
{}|;':"< >?,/.-

To use a space as a collapse character, insert the space between two other characters, such as the < and > in this example, or use the Unicode designation (\u0020) for a space.

Return to Index Routine List

ddc()

Description:

   Designed for Dewey Decimal classification. Deletes and collapses all punctuation in the field except periods ('.') to create a single index term.
 

Pears Index Definition
(in dbnamedesc.ini)

WebZ Index Definition
(in dbname.ini)

Notes

routine=indxpkg.Phrase
collapse=?!,"':;&_-< >[]

filter=indxpkg.Phrase
collapse=?!,"':;&_-< >[]

To use a space as a collapse character, insert the space between two other characters, such as the < and > in this example, or use the Unicode designation (\u0020) for a space.

Return to Index Routine List

esdate()

Description:

   Identical to Newton words() routine or Pears Words class except that hyphens ('-') are not valid characters and are collapsed from terms.
 

Pears Index Definition
(in dbnamedesc.ini)

WebZ Index Definition
(in dbname.ini)

Notes

routine=indxpkg.Words
OccurrenceRoutine= \
ORG.oclc.pears.Bartlett.wordfield
collapse=-
filter=indxpkg.Words
collapse=-


Return to Index Routine List

govtdoc()

Description:

   Designed for indexing government document numbers. Deletes and collapses all punctuation and excludes any text within parentheses to create a single index term.
 

Pears Index Definition
(in dbnamedesc.ini)

WebZ Index Definition
(in dbname.ini)

Notes

routine=indxpkg.PhraseMinusBoundPhrases
bounds=()
collapse=?!,"':;&_-.< >[]

filter=indxpkg.Phrase
bounds=()
collapse=?!,"':;&_-.< >[]


Return to Index Routine List


gxauthr()

Description:

   Identical to Newton words() routine except that periods ('.') become term delimiters, or separators.
 

Pears Index Definition
(in dbnamedesc.ini)

WebZ Index Definition
(in dbname.ini)

Notes

routine=indxpkg.Words
OccurrenceRoutine= \
ORG.oclc.pears.Bartlett.wordfield
extraDelimiters=.
removeDelimiters=\u0020
filter=indxpkg.Words
extraDelimiters=.
removeDelimiters=\u0020


Return to Index Routine List


isbn()

Description:

   Designed for the ISBN number. The indexed term contains only valid alphanumeric data. Does not index data within parentheses.
 

Pears Index Definition
(in dbnamedesc.ini)

WebZ Index Definition
(in dbname.ini)

Notes

routine=indxpkg.
WordsMinusBoundPhrases
collapse=~`@#$%^*()_+={[}]
|\<,>.?-&
bounds=()
filter=indxpkg.Words
collapse=~`@#$%^*()_+=
{[}]|\<,>.?-&

Use this routine with caution. In WebZ, the query parser delimits queries with parentheses that are removed before passing the query to the filter as separate words. Thus, a query entered with parentheses does not have the parenthetical data removed and fails to match on that data (as that was removed during the indexing process).


Return to Index Routine List


lcclass()

Description:

   Designed for the LC class number. Formats the term to a standard LC class number search format of: aaa####.###.a###, where 'a' is an alphabetic character and '#' is a numeral. Pads alphabetic characters with underscores (_) and digits with zeros (0).
 

Pears Index Definition
(in dbnamedesc.ini)

WebZ Index Definition
(in dbname.ini)

Notes

routine=indxpkg.LCClass filter=indxpkg.LCClass

Return to Index Routine List


medlinewords()

Description:

   Identical to Newton words() routine or Pears Words class except that it retains periods ('.') as valid characters in index terms.
 

Pears Index Definition
(in dbnamedesc.ini)

WebZ Index Definition
(in dbname.ini)

Notes

routine=indxpkg.Words
OccurrenceRoutine= \
ORG.oclc.pears.Bartlett.wordfield
removeDelimiters=.
filter=indxpkg.Words
removeDelimiters=.


Return to Index Routine List


nohypwd()

Description:

   Identical to Newton words() routine and Pears Words class except that hyphens ('-') become term delimiters, or separators.
 

Pears Index Definition
(in dbnamedesc.ini)

WebZ Index Definition
(in dbname.ini)

Notes

routine=indxpkg.Words
OccurrenceRoutine= \
ORG.oclc.pears.Bartlett.wordfield
extraDelimiters=-
filter=indxpkg.Words
extraDelimiters=-


Return to Index Routine List


nsnumbr()

Description:

   Indexes only the numeric portion of the terms.
 

Pears Index Definition
(in dbnamedesc.ini)

WebZ Index Definition
(in dbname.ini)

Notes

routine=indxpkg.Numbers filter=indxpkg.Numbers

Return to Index Routine List


nssubst()

Description:

   Indexes terms up to the first occurrence of a hyphen ('-'), slash ('/'), or blank (' ').
 

Pears Index Definition
(in dbnamedesc.ini)

WebZ Index Definition
(in dbname.ini)

Notes

routine=indxpkg.Words
indexUpTo*=\u0020
indexUpTo*=-
indexUpTo*=/

filter=indxpkg.Words
indexUpTo*=\u0020
indexUpTo*=-
indexUpTo*=/

See note (2) for phrase2().


Return to Index Routine List


numrang()

Description:

  

Generates index terms for a range of numbers. Given a hyphenated combination of numbers, it generates the terms for all the numbers between the hyphens.

For example, if a field contains the number range "1960-1963", it generates the terms "1960", "1961", "1962", and "1963".

 

Pears Index Definition
(in dbnamedesc.ini)

WebZ Index Definition
(in dbname.ini)

Notes

routine=indxpkg.YearRange
mustContain=-
filter=indxpkg.PublicationDate

The query term does not need to be converted into a range, since the YearRange index creates index terms for each number in the range.


Return to Index Routine List


padzero(param)

Description:

   Pads a term to the left with zeros given n number of bytes to pad. In Newton, you specify the number of bytes to pad, such as padzero(3) for 3 bytes.
 

Pears Index Definition
(in dbnamedesc.ini)

WebZ Index Definition
(in dbname.ini)

Notes

routine=indxpkg.Numbers
zeropad=n
filter=indxpkg.Numbers
zeropad=n

n is the number of zeros to pad.


Return to Index Routine List


phrbhyp()

Description:

  

Creates a term from the data in a field up to the first occurrence of a hyphen ('-').

 

Pears Index Definition
(in dbnamedesc.ini)

WebZ Index Definition
(in dbname.ini)

Notes

routine=indxpkg.Phrase
indexUpTo*=-
filter=indxpkg.Phrase
indexUpTo*=-


Return to Index Routine List


pubdate(1)

Description:

  

Retrieves the publication date from the fixed 008 field of a MARC record and creates a keyword index term. Converts any fill characters to zero in the term.

The parameter (1) indicates that the date should be pulled from the bytes 8-11 of the 008 field; otherwise, bytes 10-13 are used.

 

Pears Index Definition
(in dbnamedesc.ini)

WebZ Index Definition
(in dbname.ini)

Notes

routine=indxpkg.PublicationDate
startOffset=7

OR

startOffset=9
filter=indxpkg.PublicationDate


Return to Index Routine List


repnum()

Description:

   Deletes and collapses all punctuation in the field to create a single index term.
 

Pears Index Definition
(in dbnamedesc.ini)

WebZ Index Definition
(in dbname.ini)

Notes

routine=indxpkg.Phrase
collapse=?!,$&:;'"_-<>[ ].
filter=indxpkg.Phrase
collapse=?!,$&:;'"_-<>[ ].

To use a space as a collapse character, insert the space between two other characters, such as the [ and ] in this example, or use the Unicode designation (\u0020) for a space.


Return to Index Routine List


substr(param)

Description:

   Creates an index term from the first n characters in the field. In Newton, the parameter (param) value specifies the number of characters to include in the substring.
 

Pears Index Definition
(in dbnamedesc.ini)

WebZ Index Definition
(in dbname.ini)

Notes

routine=indxpkg.Words
OccurrenceRoutine= \
ORG.oclc.pears.Bartlett.wordfield
maxLength=n
filter=indxpkg.Words
maxLength=n

n is the number of characters to include in the index term.


Return to Index Routine List


substr1(param)

Description:

   Creates an index term by skipping the first n characters in the field. In Newton, the parameter (param) value specifies the number of characters to skip.
 

Pears Index Definition
(in dbnamedesc.ini)

WebZ Index Definition
(in dbname.ini)

Notes

routine=indxpkg.Words
OccurrenceRoutine= \
ORG.oclc.pears.Bartlett.wordfield
startOffset=n

filter=indxpkg.Words
startOffset=n

n is the number of characters to skip.


Return to Index Routine List


udc()

Description:

   Creates a single term for the universal decimal number field. It retains all alphanumeric characters, periods ('.'), and hyphens.
 

Pears Index Definition
(in dbnamedesc.ini)

WebZ Index Definition
(in dbname.ini)

Notes

routine=indxpkg.Words
collapse=~`@#$%^&*()_=+[]\
{}|;':"< >?,/
filter=indxpkg.Words
collapse=~`@#$%^&*()_=+[]\
{}|;':"< >?,/

To use a space as a collapse character, insert the space between two other characters, such as the < and > in this example, or use the Unicode designation (\u0020) for a space.


Return to Index Routine List


ugsbjcl()

Description:

  

Extracts data up to the first blank (' ') or left parenthesis. Ignores the remainder of the field.

 

Pears Index Definition
(in dbnamedesc.ini)

WebZ Index Definition
(in dbname.ini)

Notes

routine=indxpkg.Phrase
indexUpTo*=\u0020
indexUpTo*=(
filter=indxpkg.Phrase
indexUpTo*=\u0020
indexUpTo*=(


Return to Index Routine List


wrddelim()

Description:

  

Identical to Newton words() routine except that it:

  • allows you to specify a new set of delimiters used to separate words in the index
  • ignores the standard delimiters used in the words() routine and uses the delimiters you define, plus the blank space (' ') character
 

Pears Index Definition
(in dbnamedesc.ini)

WebZ Index Definition
(in dbname.ini)

Notes

routine=indxpkg.Words
OccurrenceRoutine= \
ORG.oclc.pears.Bartlett.wordfield
delimiters=\u0020
extraDelimiters=.
filter=indxpkg.Words
delimiters=\u0020
extraDelimiters=.

In this example, the only delimiters defined are a space and a period.


Return to Index Routine List   

Return to Contents


See Also

Pears Database Description Configuration File
Creating a New Pears Database
Making a Pears Database Available through WebZ
Database Configuration Files – Sections and Variables
Pears Index Routines

 

[Main][Documentation][Support][Technical Reference][Community][Glossary][Search]

Last Modified: