Main -> Documentation -> Database Builder – Newton -> Creating a New SiteSearch Database -> Planning a SiteSearch Database

Planning a SiteSearch Database

When creating a SiteSearch database, it is important to begin with an overall database design. A good design will help ensure that your database is well structured and useful to your users. Included below are some guidelines for designing a database. Use these guidelines to analyze your source data, your users, and your database indexes.

Requirements

As you begin the planning procedures described below, you should have the following elements:

  • the database source data, and
  • the Open SiteSearch Database Builder 4.0.x /4.1.x software or the entire OCLC SiteSearch 4.0.x/4.1.x suite.

Analyzing Your Source Data

Procedure

When designing a SiteSearch database, it is important to know as much as you can about your source data. This helps you develop a good indexing scheme for your database. Follow the steps listed below to examine your source data in greater detail.

1. Review data conversion options.

You must convert source records to ASN.1/BER format before adding it to a SiteSearch database. The Database Builder software includes two utilities for converting data: marcconv converts records stored in USMARC, and sgmlconv works with records that have SGML markup. OCLC also provides conversion programs for many commercial databases for an additional fee. Contact SiteSearch Support at (800) 848-5878, extension 6414, if you want to know if a program is available for a specific database.

If your data is not in MARC or SGML, one option is to use PERL or another scripting language to add SGML tags using the existing field structure of your source data as a guide. You may also write your own ASN.1/BER conversion program or contract with OCLC for this service.

2. Make a list of all the data elements in your source data.

Review any documentation from the database producer. Examine the data itself, you may find instances where you may wish to break the data into more discrete elements than those provided directly by the producer. For instance, a date field might be broken into year, month, and day to create more useful indexes. You may also find cases where you want to expand data from codes into the full equivalent expression.

Note:

If you have USMARC source data and are using the marcconv utility for the data conversion, steps 2 and 3 below are unnecessary as marcconv creates an ASN.1/BER structure that follows the MARC structure.

3. Create an outline, or schema, of your source data to help you visualize the structural relationship of the data fields. This outline is also a useful reference tool when defining your database indexes in the database description (.dsc) file as part of the database creation process. You can use the first column of the Source Data Planning Template to create this outline and map it to your source data. For additional resources about the data, review other types of source information, such as vendor documentation or actual raw data files.

4. Before deciding on how to process the data in SiteSearch, take some time to consider your end users and how they will approach the data. You may want to interview end users and/or support staff who will work with the end users. Use the following questions as guidelines:

  • Who is the primary audience?
  • How knowledgeable is the audience about the subject?
  • What kinds of searches will they create? How sophisticated could searches become?
  • What results does the audience expect? How important are precision and recall to the audience?
  • Will there be other audiences using the database? What are their needs?

If the database is available from other sources, you may also want to review those implementations.

Defining Your Indexes

Procedure

Indexes are the most important part of a good database; they allow the Newton search engine to match the user's query to the contents of the document. Indexes are lists of words or phrases that are extracted from the database records according to rules that you specify in the database description (.dsc) file. The usefulness of your database is a direct result of the indexes you design.

5. Decide what indexes you want to build. Decide which fields to include. Remember that an index can contain many fields, and a field can occur in many indexes. It's particularly important to remember that if you do not index a field, the user cannot retrieve the record using that field. Consider

  • your source data,
  • your end user, and
  • the fields in your source data that are important to your end user.
  • You may record your indexing decisions in the Source Data Planning Template.
  • 6. Decide on other properties of your index, such as:

    • whether to filter out frequently occurring words such as "a", "an", and "the" from the index.
    • whether to allow search terms to be pluralized.
    • whether the index terms are created as words or phrases.

    7. Decide on restrictors. Restrictors allow users to narrow a large search result efficiently. Look for elements that would allow users to focus a large search result to the most useful records. Each database is different, but the following list includes examples of the fields most commonly used as restrictors.

    • Year of publication allows users to focus on a particular range of years
    • Language allows users to include only those documents they can understand.
    • Format allows users to limit the search to items that suit their need.
    • Full text allows users to search available databases with limited full text and focus only on the full text items.

    See Also

    Creating a New SiteSearch Database
    Source Data Planning Template
    Creating a Database Description File


    [Main][Documentation][Support][Technical Reference][Community][Glossary][Search]

    Last Modified: