|
Pears
System Overview
Contents
Introduction
Document Conventions
Features
Pears in SiteSearch 4.2.0
Process Description
Pears Components
Document
Conventions
Introduction
Pears
is a new database engine shipped with SiteSearch 4.2.0. The OCLC
Office of Research developed Pears as a replacement for the Newton
database engine used to build databases for many OCLC products and shipped
with SiteSearch 4.0.x /4.1.x . (Database Builder for OCLC
SiteSearch 4.2.0 includes both Pears and Newton.) The Pears
source code is available from the OCLC Office of Research under an
Open Source license.
The SiteSearch
FTP site includes several sample Pears databases. To keep the installers
to a reasonable size, the SiteSearch 4.2.0 Database Builder installers
do not include these databases.
This
document introduces you to Pears's features, describes the relationship
of Pears with other SiteSearch components, and describes the Pears database
building process and the elements that comprise Pears.
Features
This table summarizes
the major features of Pears. A companion document, Pears-Newton
Comparison, compares and contrasts the features of Pears and Newton.
Feature
|
Description
|
Java-based
|
Pears is
written in Java, which offers advantages such as:
- Portability
Building a Pears database on one machine and using it on
any machine that supports the Java standard without any porting
work.
- Customization
Modifying the Pears software by extending the base classes.
- Error
handling Errors generate exceptions, which move up
the Java class hierarchy until a class can handle the exception.
Many Pears classes also include an option for continuing when
a recoverable error occurs.
|
Internationalization/
Unicode support |
Pears converts
the data in all records to Unicode characters. The Unicode
Standard provides the capacity to encode all of the characters
used for the written languages of the world. Thus, Pears databases
can contain records in non-Roman languages and/or records with mathematical
and other technical symbols.
Making databases
with Unicode characters available to patrons requires a search engine
that can process queries with Unicode characters and Web browsers
that support the display of Unicode. In SiteSearch 4.2.0, WebZ can
handle Unicode. Versions 4.x and higher of the Netscape® Navigator
and Communicator and Microsoft® Internet Explorer browsers also
support Unicode.
|
Flexible
record indexing |
Pears
supports:
- An unlimited
number of indexes per database
- Custom
index routines
- Adding
and/or removing indexes in an existing database without dumping,
reindexing, and reloading the database
- Index
terms up to 2000 characters long (while longer index terms are
possible, OCLC does not recommend them)
|
Record storage
format |
Pears stores
database records in ASN.1/BER (ISO 8824) format, which supports arbitrarily
large records with complex hierarchical structures containing both
text and binary data. |
Record export |
With the
appropriate record handler, you can export records from a Pears database.
|
Customizable
record handlers (data conversion utilities) |
Pears provides
record handlers that import and export records in a number of input
formats. In addition, you can create customized record handlers
to handle additional input formats.
|
Physical
database file |
Pears stores
a database in a single physical database (.pdb)
file. |
Partitioned
databases |
Pears supports
partitioned databases, where a single logical database is distributed
across multiple physical database files. Pears allows you to partition
a single set of input records into physical partitions of a fixed
size or divide them equally across a specified number of partitions.
You can
also extend the Pears partitioning functionality by creating a class
with your own partitioning criteria.
|
Database
size |
Pears supports
databases of up two billion records (using a partitioned database). |
Embeddability |
It is possible
to embed Pears within another application to allow single-record updates
to Pears databases or to use the Pears search engine to query Pears
databases. In SiteSearch 4.2.0, the Record Builder application and
WebZ, respectively, demonstrate this feature. |
Single configuration
file |
A single
Pears database description configuration file
includes all information necessary to import, index, and store records
in a Pears database. |
Single database
update and creation utility |
Pears's Bartlett
utility manages the database build process. It calls other Java classes
that convert input data records to BER format, index the records,
and store the records, index terms, and postings lists in the database. |
Fail-safe
updates
|
Pears is
designed so that database updates fail without corrupting the database.
Pears uses a temporary journal file to store
database updates and only modifies the physical database at the
end of the build process.
|
Journaling
|
Pears
supports maintaining journal files that contain all the changes
to a database during an update (adding, modifying, or deleting records).
You can subsequently use these journal files to apply these transactions
to a database backup if necessary.
|
Relevance
ranking |
Pears databases
track of the frequency of occurrence of terms in a record and within
a database. This allows records to be ranked on the significance of
a patron's search term within the records retrieved in response to
a search. |
Restrictors |
Pears supports
the creation of restrictors, which can speed up some searches significantly.
Restrictors associate certain information about a record (such as
its publication date or language) with all index terms extracted from
that record. For example, patrons can use a language restrictor to
limit a search to only records of a certain language. |
Proximity
information |
For keyword
indexes, Pears can store proximity information about a term's proximity
to another term in the same record. This allows patrons to include
WITH or NEAR proximity operators when searching a Pears database.
|
Pears
and SiteSearch
SSDOT
SiteSearch
includes a Pears version of the SiteSearch
Database Operations Tool (SSDOT). SSDOT provides a menu-driven interface
to the Pears Bartlett utility and automates many common database building
and database administration tasks.
WebZ
Pears databases
are local SiteSearch databases that you can make
available to patrons through the WebZ interface.
Record Builder
You can perform
single-record updates of Pears databases with Database Builder's Record
Builder application. Record Builder also supports single-record import
and export for Pears databases.
SiteSearch Newton Databases
It is possible
to convert existing local Newton databases
to Pears databases. SSDOT for Pears
handles some of the steps in the conversion process, but you must create
a Pears database description configuration
file for the database.
Process
Description
This section provides
a high-level overview of the database building process. Links lead to
descriptions of various elements involved in the Pears system. (See Creating
a New Pears Database for step-by-step procedures for building a Pears
database.) A companion document, Pears Database
Build Process, presents a graphical view of this process.
The database building
process begins when you issue a command for Bartlett
to create a Pears database. You can do this from the command line or through
the Pears version of SSDOT. The inputs
that you provide to Pears are an input data file,
a database description configuration file,
and for SGML/XML and delimited text input records, a .tags
file.
Step
|
Action
|
|
Bartlett
reads the information from the [DB] section of the database
description configuration file to determine the name and location
of the database's physical database (.pdb) file.
For a new database, Bartlett creates a .pdb file. |
|
Bartlett
obtains the database's input record type (input_record_type)
from the [DB] section of the database description configuration
file and loads the appropriate record handler.
|
|
The record
handler checks the database description configuration file for its
[Handleinput_record_type] section. This section contains
the record handler's input parameters, such as collapse characters,
or for SGML/XML and delimited text records, the name and location
of the database's .tags file. It may also
refer to a record filter.
|
|
The record
handler reads the records from the input file, one at a time, and
converts them to BER records, which get passed to Bartlett. During
this process, the record filter (if used) determines whether to
exclude input records from the BER record stream.
For SGML/XML
and delimited text records, the record handler uses the Tags file
to determine the BER record structure (tag paths) for the records.
For MARC records, it uses the MARC fields and subfields in the records
to create appropriate BER tag paths.
|
|
Bartlett
stores the BER records in the database. |
|
Bartlett
evaluates each record in the input stream, one field at a time,
to determine which index routines
to call for indexing the data from each field.
Bartlett
compares a field's BER tag path to the tag paths in the [index_definition]
sections in the database description configuration file. When it
finds a match, it calls the appropriate index routine, which indexes
the field. One field may be referenced in more than one index definition.
After Bartlett
calls all the index routines associated with a field, it moves on
to next field in the record and repeats the above steps.
Bartlett
repeats this field-by-field evaluation for every record in the input
stream and calls the appropriate index routines.
|
|
Bartlett
checks the database description configuration file for [index_definition]
sections without tagpath references, which have not yet been processed.
These sections define restrictor indexes or global stopwords. Bartlett
calls the classes specified in these sections. After each class completes
its task, Bartlett modifies the index terms and posting lists accordingly. |
|
Bartlett
adds the index terms and posting lists to the database. (For a large
database, Bartlett may add the index terms and posting lists at several
times during the build process.) |
|
Bartlett
copies new records and their associated index terms and postings lists
to the end of the .pdb file. |
|
If the database
updates contain revisions to existing records, Bartlett now copies
these revisions to the .pdb file. |
|
Once Bartlett
knows that it updated the .pdb file successfully, it deletes its temporary
journal file. |
Note: |
|
The last
three steps of the database build process illustrate Pears's provisions
for preventing corruption of a database's .pdb file.
In
steps
and ,
Bartlett writes database updates to a temporary
journal file rather than the .pdb file.
If the update
in step
fails (most likely because of a shortage of disk space), Bartlett
stops the update. Any existing data in the .pdb file remains intact.
You can subsequently apply the updates in the temporary journal
file to the database or run the update again.
|
Return
to Contents
Pears
Components
Bartlett
Bartlett manages
the process of creating and maintaining Pears databases. Bartlett is a
command-line utility with many options. The Pears
version of SSDOT provides a menu-driven interface for performing the
most common database administration tasks Bartlett.
Bartlett uses
the information in the database description configuration file to call
the appropriate record handler and index routines required to complete
the database update. It also manages other Pears utility classes that
process database records, create index terms, and create posting lists.
Database
Description Configuration File (dbname_desc.ini) file
A database description
configuration file contains information that Bartlett requires to create
or update a Pears database, including:
- initialization
parameters
- the database
name
- name and
location for its physical database file
- block
size
- input record
format
- record handler
for processing the input records and its input parameters
- database indexes
and special characteristics of these indexes such as restrictors and
stopwords
For more information:
Pears Database Description Configuration
File
Input
Data Files
Pears accepts
input data files with records in the following formats:
- SGML/XML (requires
a Tags file)
- Delimited text
(requires a Tags file)
- USMARC
- ChinaMarc
- Unimarc
- Newton databases
- Pears databases
.tags
File
A .tags file associates
the fields in SGML/XML or delimited text records to the BER tag paths
for storing the records in a Pears database. It is similar, but not identical
to a Newton .dtd file.
Pears requires
a .tags file for importing or exporting records in SGML/XML format. For
importing delimited text records, Pears creates a .tags file for you if
it doesn't exist. This .tags file generally works well, but it doesn't
allow you to use some options available when you create your own .tags
file.
For more information:
The Pears Tags (.tags) File
Pears
Database File (.pdb File)
Pears uses a single
physical file to store the database records, index terms, postings lists,
and logical indirection associated with a database. There are logical
regions allocated to each of these parts of the database file.
In a partitioned
Pears database, several physical .pdb files comprise a single logical
database.
Record
Handlers
Record handlers
are classes in the ORG.oclc.RecordHandler package that perform data conversion.
All record handlers import data into a Pears database; many record handlers
also export records from a Pears database. Each record handler operates
on a specific form of input data records. A record handler's name indicates
the type of input data it converts. For example, HandleSGML operates on
SGML or XML data records.
When creating
or updating a Pears database, a record handler converts raw input
data from an external file to a stream of BER records. It also converts
characters in the raw record to Unicode in the BER record.
When exporting
records from a Pears database, a record handler converts BER records
stored in a Pears database to a specified record format and adds them
to an external file. For example, if you import records in SGML format,
you can subsequently export them in SGML format.
For more information:
Pears Record Handlers
Record
Filters
Record filters
filter out (exclude) records from an input file and prevent them from
being added to a Pears database. Record filters belong to the ORG.oclc.RecordHandler
class package. Using a record filter is optional. You may specify a record
filter as one of the input parameters to a record handler in the [Handleinput_file_type]
section of the database description configuration file.
If you specify
a record filter, the record handler passes each record from the input
file to the record filter. The record filter uses a filtering criterion
to determine whether to discard the record from the input stream or to
pass the record on to Bartlett for further processing.
For example,
the FilterByTagPresence class excludes records based on the presence or
absence of a specified field. You can extend FilterByTagPresence to exclude
records based on the value of a specified field.
Index
Routines
Index routines
belong to the ORG.oclc.pears.IndexRoutines class package. They play two
roles record indexing and query normalization.
When creating
a Pears database, an index routine creates index terms from specified
field(s) in a database record. The index routine obtains its input parameters
from variables specified in an index definition in the database description
configuration file. These input parameters vary, but always include a
unique index identification number, the BER tag path(s) to the field(s)
to be indexed, and the name of the index routine. Optional parameters
include characters to be collapsed (removed) when creating an index term,
delimiters that separate fields within a record, and the maximum length
of an index term.
When searching
a Pears database, an index routine may also serve as a query normalizer
in WebZ. A query normalizer allows you to manipulate patrons' search terms
to better match specific database indexes. For a given index, you generally
use the same Java class as both an index routine and a query normalizer.
For more information:
Pears Index Routines
Temporary
Journal File
During a database
update, Bartlett adds BER records, index terms, and postings lists to
a temporary journal file that "shadows" the .pdb file. Bartlett
only accesses the .pdb file to add new records to the end of the file
and to copy updates from the journal file to the Pears database. After
the update completes successfully, Bartlett deletes the temporary journal
file.
If a database
update fails for any reason, the .pdb file remains intact and uncorrupted.
Should an update fail when adding new records to a file, you can reapply
the changes in the journal file to the Pears database. If a database update
fails during record conversion or indexing, only the temporary journal
file can become corrupted. You can correct the problem that caused the
failure, delete the temporary journal file, and run the update again.
Return
to Contents
See
Also
Pears-Newton
Comparison
|