Main -> Documentation -> Database Builder – Pears -> System Overview
Pears System Overview

 

Contents

Introduction
Document Conventions
Features
Pears in SiteSearch 4.2.0
Process Description
Pears Components


Document Conventions


Introduction

Pears is a new database engine shipped with SiteSearch 4.2.0. The OCLC Office of Research developed Pears as a replacement for the Newton database engine used to build databases for many OCLC products and shipped with SiteSearch 4.0.x /4.1.x . (Database Builder for OCLC SiteSearch 4.2.0 includes both Pears and Newton.) The Pears source code is available from the OCLC Office of Research under an Open Source license.

The SiteSearch FTP site includes several sample Pears databases. To keep the installers to a reasonable size, the SiteSearch 4.2.0 Database Builder installers do not include these databases.

This document introduces you to Pears's features, describes the relationship of Pears with other SiteSearch components, and describes the Pears database building process and the elements that comprise Pears.


Features

This table summarizes the major features of Pears. A companion document, Pears-Newton Comparison, compares and contrasts the features of Pears and Newton.

Feature

Description

Java-based

Pears is written in Java, which offers advantages such as:

  • Portability – Building a Pears database on one machine and using it on any machine that supports the Java standard without any porting work.
  • Customization – Modifying the Pears software by extending the base classes.
  • Error handling – Errors generate exceptions, which move up the Java class hierarchy until a class can handle the exception. Many Pears classes also include an option for continuing when a recoverable error occurs.
Internationalization/
Unicode support

Pears converts the data in all records to Unicode characters. The Unicode Standard provides the capacity to encode all of the characters used for the written languages of the world. Thus, Pears databases can contain records in non-Roman languages and/or records with mathematical and other technical symbols.

Making databases with Unicode characters available to patrons requires a search engine that can process queries with Unicode characters and Web browsers that support the display of Unicode. In SiteSearch 4.2.0, WebZ can handle Unicode. Versions 4.x and higher of the Netscape® Navigator and Communicator and Microsoft® Internet Explorer browsers also support Unicode.

Flexible record indexing

 Pears supports:

  • An unlimited number of indexes per database
  • Custom index routines
  • Adding and/or removing indexes in an existing database without dumping, reindexing, and reloading the database
  • Index terms up to 2000 characters long (while longer index terms are possible, OCLC does not recommend them)
Record storage format Pears stores database records in ASN.1/BER (ISO 8824) format, which supports arbitrarily large records with complex hierarchical structures containing both text and binary data.
Record export With the appropriate record handler, you can export records from a Pears database.
Customizable record handlers (data conversion utilities)

Pears provides record handlers that import and export records in a number of input formats. In addition, you can create customized record handlers to handle additional input formats.

Physical database file Pears stores a database in a single physical database (.pdb) file.
Partitioned databases

Pears supports partitioned databases, where a single logical database is distributed across multiple physical database files. Pears allows you to partition a single set of input records into physical partitions of a fixed size or divide them equally across a specified number of partitions.

You can also extend the Pears partitioning functionality by creating a class with your own partitioning criteria.

Database size Pears supports databases of up two billion records (using a partitioned database).
Embeddability It is possible to embed Pears within another application to allow single-record updates to Pears databases or to use the Pears search engine to query Pears databases. In SiteSearch 4.2.0, the Record Builder application and WebZ, respectively, demonstrate this feature.
Single configuration file A single Pears database description configuration file includes all information necessary to import, index, and store records in a Pears database.
Single database update and creation utility Pears's Bartlett utility manages the database build process. It calls other Java classes that convert input data records to BER format, index the records, and store the records, index terms, and postings lists in the database.

Fail-safe updates

Pears is designed so that database updates fail without corrupting the database. Pears uses a temporary journal file to store database updates and only modifies the physical database at the end of the build process.

Journaling

Pears supports maintaining journal files that contain all the changes to a database during an update (adding, modifying, or deleting records). You can subsequently use these journal files to apply these transactions to a database backup if necessary.

Relevance ranking Pears databases track of the frequency of occurrence of terms in a record and within a database. This allows records to be ranked on the significance of a patron's search term within the records retrieved in response to a search.
Restrictors Pears supports the creation of restrictors, which can speed up some searches significantly. Restrictors associate certain information about a record (such as its publication date or language) with all index terms extracted from that record. For example, patrons can use a language restrictor to limit a search to only records of a certain language.
Proximity information For keyword indexes, Pears can store proximity information about a term's proximity to another term in the same record. This allows patrons to include WITH or NEAR proximity operators when searching a Pears database.

Pears and SiteSearch

SSDOT

SiteSearch includes a Pears version of the SiteSearch Database Operations Tool (SSDOT). SSDOT provides a menu-driven interface to the Pears Bartlett utility and automates many common database building and database administration tasks.

WebZ

Pears databases are local SiteSearch databases that you can make available to patrons through the WebZ interface.

Record Builder

You can perform single-record updates of Pears databases with Database Builder's Record Builder application. Record Builder also supports single-record import and export for Pears databases.

SiteSearch Newton Databases

It is possible to convert existing local Newton databases to Pears databases. SSDOT for Pears handles some of the steps in the conversion process, but you must create a Pears database description configuration file for the database.


Process Description

This section provides a high-level overview of the database building process. Links lead to descriptions of various elements involved in the Pears system. (See Creating a New Pears Database for step-by-step procedures for building a Pears database.) A companion document, Pears Database Build Process, presents a graphical view of this process.

The database building process begins when you issue a command for Bartlett to create a Pears database. You can do this from the command line or through the Pears version of SSDOT. The inputs that you provide to Pears are an input data file, a database description configuration file, and for SGML/XML and delimited text input records, a .tags file.

Step

Action

Step 1 Bartlett reads the information from the [DB] section of the database description configuration file to determine the name and location of the database's physical database (.pdb) file. For a new database, Bartlett creates a .pdb file.
Step 2

Bartlett obtains the database's input record type (input_record_type) from the [DB] section of the database description configuration file and loads the appropriate record handler.

Step 3

The record handler checks the database description configuration file for its [Handleinput_record_type] section. This section contains the record handler's input parameters, such as collapse characters, or for SGML/XML and delimited text records, the name and location of the database's .tags file. It may also refer to a record filter.

Step 4

The record handler reads the records from the input file, one at a time, and converts them to BER records, which get passed to Bartlett. During this process, the record filter (if used) determines whether to exclude input records from the BER record stream.

For SGML/XML and delimited text records, the record handler uses the Tags file to determine the BER record structure (tag paths) for the records. For MARC records, it uses the MARC fields and subfields in the records to create appropriate BER tag paths.

Step 5 Bartlett stores the BER records in the database.
Step 6

Bartlett evaluates each record in the input stream, one field at a time, to determine which index routines to call for indexing the data from each field.

Bartlett compares a field's BER tag path to the tag paths in the [index_definition] sections in the database description configuration file. When it finds a match, it calls the appropriate index routine, which indexes the field. One field may be referenced in more than one index definition.

After Bartlett calls all the index routines associated with a field, it moves on to next field in the record and repeats the above steps.

Bartlett repeats this field-by-field evaluation for every record in the input stream and calls the appropriate index routines.

Step 7
Bartlett checks the database description configuration file for [index_definition] sections without tagpath references, which have not yet been processed. These sections define restrictor indexes or global stopwords. Bartlett calls the classes specified in these sections. After each class completes its task, Bartlett modifies the index terms and posting lists accordingly.
Step 8
Bartlett adds the index terms and posting lists to the database. (For a large database, Bartlett may add the index terms and posting lists at several times during the build process.)
Step 9
Bartlett copies new records and their associated index terms and postings lists to the end of the .pdb file.
Step 10
If the database updates contain revisions to existing records, Bartlett now copies these revisions to the .pdb file.
Step 11
Once Bartlett knows that it updated the .pdb file successfully, it deletes its temporary journal file.

Note:   

The last three steps of the database build process illustrate Pears's provisions for preventing corruption of a database's .pdb file.

In steps Reference to Step 5 and Reference to Step 7, Bartlett writes database updates to a temporary journal file rather than the .pdb file.

If the update in step Step 9 fails (most likely because of a shortage of disk space), Bartlett stops the update. Any existing data in the .pdb file remains intact. You can subsequently apply the updates in the temporary journal file to the database or run the update again.

Return to Contents


Pears Components

Bartlett

Bartlett manages the process of creating and maintaining Pears databases. Bartlett is a command-line utility with many options. The Pears version of SSDOT provides a menu-driven interface for performing the most common database administration tasks Bartlett.

Bartlett uses the information in the database description configuration file to call the appropriate record handler and index routines required to complete the database update. It also manages other Pears utility classes that process database records, create index terms, and create posting lists.


Database Description Configuration File (dbname_desc.ini) file

A database description configuration file contains information that Bartlett requires to create or update a Pears database, including:

  • initialization parameters
    • the database name
    • name and location for its physical database file
    • block size
  • input record format
  • record handler for processing the input records and its input parameters
  • database indexes and special characteristics of these indexes such as restrictors and stopwords

For more information: Pears Database Description Configuration File


Input Data Files

Pears accepts input data files with records in the following formats:

  • SGML/XML (requires a Tags file)
  • Delimited text (requires a Tags file)
  • USMARC
  • ChinaMarc
  • Unimarc
  • Newton databases
  • Pears databases

.tags File

A .tags file associates the fields in SGML/XML or delimited text records to the BER tag paths for storing the records in a Pears database. It is similar, but not identical to a Newton .dtd file.

Pears requires a .tags file for importing or exporting records in SGML/XML format. For importing delimited text records, Pears creates a .tags file for you if it doesn't exist. This .tags file generally works well, but it doesn't allow you to use some options available when you create your own .tags file.

For more information: The Pears Tags (.tags) File


Pears Database File (.pdb File)

Pears uses a single physical file to store the database records, index terms, postings lists, and logical indirection associated with a database. There are logical regions allocated to each of these parts of the database file.

In a partitioned Pears database, several physical .pdb files comprise a single logical database.


Record Handlers

Record handlers are classes in the ORG.oclc.RecordHandler package that perform data conversion. All record handlers import data into a Pears database; many record handlers also export records from a Pears database. Each record handler operates on a specific form of input data records. A record handler's name indicates the type of input data it converts. For example, HandleSGML operates on SGML or XML data records.

When creating or updating a Pears database, a record handler converts raw input data from an external file to a stream of BER records. It also converts characters in the raw record to Unicode in the BER record.

When exporting records from a Pears database, a record handler converts BER records stored in a Pears database to a specified record format and adds them to an external file. For example, if you import records in SGML format, you can subsequently export them in SGML format.

For more information: Pears Record Handlers


Record Filters

Record filters filter out (exclude) records from an input file and prevent them from being added to a Pears database. Record filters belong to the ORG.oclc.RecordHandler class package. Using a record filter is optional. You may specify a record filter as one of the input parameters to a record handler in the [Handleinput_file_type] section of the database description configuration file.

If you specify a record filter, the record handler passes each record from the input file to the record filter. The record filter uses a filtering criterion to determine whether to discard the record from the input stream or to pass the record on to Bartlett for further processing.

For example, the FilterByTagPresence class excludes records based on the presence or absence of a specified field. You can extend FilterByTagPresence to exclude records based on the value of a specified field.


Index Routines

Index routines belong to the ORG.oclc.pears.IndexRoutines class package. They play two roles – record indexing and query normalization.

When creating a Pears database, an index routine creates index terms from specified field(s) in a database record. The index routine obtains its input parameters from variables specified in an index definition in the database description configuration file. These input parameters vary, but always include a unique index identification number, the BER tag path(s) to the field(s) to be indexed, and the name of the index routine. Optional parameters include characters to be collapsed (removed) when creating an index term, delimiters that separate fields within a record, and the maximum length of an index term.

When searching a Pears database, an index routine may also serve as a query normalizer in WebZ. A query normalizer allows you to manipulate patrons' search terms to better match specific database indexes. For a given index, you generally use the same Java class as both an index routine and a query normalizer.

For more information: Pears Index Routines


Temporary Journal File

During a database update, Bartlett adds BER records, index terms, and postings lists to a temporary journal file that "shadows" the .pdb file. Bartlett only accesses the .pdb file to add new records to the end of the file and to copy updates from the journal file to the Pears database. After the update completes successfully, Bartlett deletes the temporary journal file.

If a database update fails for any reason, the .pdb file remains intact and uncorrupted. Should an update fail when adding new records to a file, you can reapply the changes in the journal file to the Pears database. If a database update fails during record conversion or indexing, only the temporary journal file can become corrupted. You can correct the problem that caused the failure, delete the temporary journal file, and run the update again.

Return to Contents


See Also

Pears-Newton Comparison


[Main][Documentation][Support][Technical Reference][Community][Glossary][Search]