Open SiteSearch 4.1.1 Final API Specification: Class IdentifyCopyright

Overview

Package

Class

Use

Tree

Deprecated

Index

Help

Open SiteSearch 4.1.1
Final

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: INNER | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

ORG.oclc.resources.html
Class IdentifyCopyright

java.lang.Object
  |
  +--ORG.oclc.resources.html.IdentifyCopyright

public class IdentifyCopyright
extends Object

Takes an html page as a String and locates the most likely copyright statement. Assuming the most commonly observed pattern of Copyright DateRange CopyrightOwner, the routine breaks assigns the date (if there) and publisher (assumed to be Copyright Owner) The routine could be more sophisticated, such as checking to see if the date is reasonable or dealing with non-standard characters, but it seems pretty good for capturing text to be checked by a human. Accessor methods to return the cleaned date, publisher, and copyright (date + publisher) are used.

Field Summary
`String`	`copyright`
`String`	`date`
`String`	`publisher`

Constructor Summary
`IdentifyCopyright(String text)` Constuctor based on text.

Method Summary
`static String`	`cleanTags(String s)` routine to remove all html tagging leaving only visible text
`String`	`getCopyright()`
`String`	`getDate()` Accessor method for title
`String`	`getPublisher()` Accessor method for publisher
`int`	`indexOfAlphanum(String text)` Looks for the first AlphaNumeric Character (should prob.
`String`	`removeBracketed(String s)` Removes html tagging but keeps spacing
`static String`	`trimNonCharOrDigit(String s)` trim end of non alphanumeric characters
`int`	`YearBreak(String text)`

Methods inherited from class java.lang.Object

clone, 
equals, 
finalize, 
getClass, 
hashCode, 
notify, 
notifyAll, 
toString, 
wait, 
wait, 
wait

Field Detail

date

public String date

publisher

public String publisher

copyright

public String copyright

Constructor Detail

IdentifyCopyright

public IdentifyCopyright(String text)

Constuctor based on text. Beginning at is fine. If the document is large, providing the last couple thousand characters first, and if the publisher is not found providing the first couple of thousand characters would be a good strategy.

Method Detail

getPublisher

public String getPublisher()

Accessor method for publisher

Returns:: the publisher

getDate

public String getDate()

Accessor method for title

Returns:: the date

getCopyright

public String getCopyright()

YearBreak

public int YearBreak(String text)

removeBracketed

public String removeBracketed(String s)

Removes html tagging but keeps spacing

cleanTags

public static String cleanTags(String s)

routine to remove all html tagging leaving only visible text

trimNonCharOrDigit

public static String trimNonCharOrDigit(String s)

trim end of non alphanumeric characters

indexOfAlphanum

public int indexOfAlphanum(String text)

Looks for the first AlphaNumeric Character (should prob. change name to indexOfLetterOrDigit)