Open SiteSearch 4.1.1
Final

ORG.oclc.resources.html
Class IdentifyCopyright

java.lang.Object
  |
  +--ORG.oclc.resources.html.IdentifyCopyright

public class IdentifyCopyright
extends Object

Takes an html page as a String and locates the most likely copyright statement. Assuming the most commonly observed pattern of Copyright DateRange CopyrightOwner, the routine breaks assigns the date (if there) and publisher (assumed to be Copyright Owner) The routine could be more sophisticated, such as checking to see if the date is reasonable or dealing with non-standard characters, but it seems pretty good for capturing text to be checked by a human. Accessor methods to return the cleaned date, publisher, and copyright (date + publisher) are used.


Field Summary
 String copyright
           
 String date
           
 String publisher
           
 
Constructor Summary
IdentifyCopyright(String text)
          Constuctor based on text.
 
Method Summary
static String cleanTags(String s)
          routine to remove all html tagging leaving only visible text
 String getCopyright()
           
 String getDate()
          Accessor method for title
 String getPublisher()
          Accessor method for publisher
 int indexOfAlphanum(String text)
          Looks for the first AlphaNumeric Character (should prob.
 String removeBracketed(String s)
          Removes html tagging but keeps spacing
static String trimNonCharOrDigit(String s)
          trim end of non alphanumeric characters
 int YearBreak(String text)
           
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

date

public String date

publisher

public String publisher

copyright

public String copyright
Constructor Detail

IdentifyCopyright

public IdentifyCopyright(String text)
Constuctor based on text. Beginning at is fine. If the document is large, providing the last couple thousand characters first, and if the publisher is not found providing the first couple of thousand characters would be a good strategy.
Method Detail

getPublisher

public String getPublisher()
Accessor method for publisher
Returns:
the publisher

getDate

public String getDate()
Accessor method for title
Returns:
the date

getCopyright

public String getCopyright()

YearBreak

public int YearBreak(String text)

removeBracketed

public String removeBracketed(String s)
Removes html tagging but keeps spacing

cleanTags

public static String cleanTags(String s)
routine to remove all html tagging leaving only visible text

trimNonCharOrDigit

public static String trimNonCharOrDigit(String s)
trim end of non alphanumeric characters

indexOfAlphanum

public int indexOfAlphanum(String text)
Looks for the first AlphaNumeric Character (should prob. change name to indexOfLetterOrDigit)

Open SiteSearch 4.1.1
Final