opennlp.tools.tokenize
Class TokenizerME

java.lang.Object
  extended by opennlp.tools.tokenize.TokenizerME
All Implemented Interfaces:
Tokenizer
Direct Known Subclasses:
Tokenizer, Tokenizer, Tokenizer, Tokenizer

public class TokenizerME
extends java.lang.Object

A Tokenizer for converting raw text into separated tokens. It uses Maximum Entropy to make its decisions. The features are loosely based off of Jeff Reynar's UPenn thesis "Topic Segmentation: Algorithms and Applications.", which is available from his homepage: .

Version:
$Revision: 1.21 $, $Date: 2008/03/05 16:45:13 $
Author:
Tom Morton

Field Summary
static java.util.regex.Pattern alphaNumeric
          Alpha-Numeric Pattern
 
Constructor Summary
TokenizerME(opennlp.maxent.MaxentModel mod)
          Class constructor which takes the string locations of the information which the maxent model needs.
 
Method Summary
 double[] getTokenProbabilities()
          Returns the probabilities associated with the most recent calls to tokenize() or tokenizePos().
 void setAlphaNumericOptimization(boolean opt)
          Used to have the tokenizer ignore tokens which only contain alpha-numeric characters.
 java.lang.String[] tokenize(java.lang.String s)
          Tokenize a string.
 Span[] tokenizePos(java.lang.String d)
          Tokenizes the string.
static opennlp.maxent.GISModel train(opennlp.maxent.EventStream evc)
          Trains the TokenizerME, use this to create a new model.
static void train(opennlp.maxent.EventStream evc, java.io.File output, java.lang.String encoding)
          Trains the TokenizerME, use this to create a new model.
 boolean useAlphaNumericOptimization()
          Returns the value of the alpha-numeric optimization flag.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

alphaNumeric

public static final java.util.regex.Pattern alphaNumeric
Alpha-Numeric Pattern

Constructor Detail

TokenizerME

public TokenizerME(opennlp.maxent.MaxentModel mod)
Class constructor which takes the string locations of the information which the maxent model needs.

Parameters:
mod -
Method Detail

getTokenProbabilities

public double[] getTokenProbabilities()
Returns the probabilities associated with the most recent calls to tokenize() or tokenizePos().

Returns:
probability for each token returned for the most recent call to tokenize. If not applicable an empty array is returned.

tokenizePos

public Span[] tokenizePos(java.lang.String d)
Tokenizes the string.

Parameters:
d - The string to be tokenized.
Returns:
A span array containing individual tokens as elements.

train

public static opennlp.maxent.GISModel train(opennlp.maxent.EventStream evc)
                                     throws java.io.IOException
Trains the TokenizerME, use this to create a new model.

Parameters:
evc -
Returns:
the new model
Throws:
java.io.IOException

train

public static void train(opennlp.maxent.EventStream evc,
                         java.io.File output,
                         java.lang.String encoding)
                  throws java.io.IOException
Trains the TokenizerME, use this to create a new model.

Parameters:
evc -
output -
Throws:
java.io.IOException

setAlphaNumericOptimization

public void setAlphaNumericOptimization(boolean opt)
Used to have the tokenizer ignore tokens which only contain alpha-numeric characters.

Parameters:
opt - set to true to use the optimization, false otherwise.

useAlphaNumericOptimization

public boolean useAlphaNumericOptimization()
Returns the value of the alpha-numeric optimization flag.

Returns:
true if the tokenizer should use alpha-numeric optization, false otherwise.

tokenize

public java.lang.String[] tokenize(java.lang.String s)
Description copied from interface: Tokenizer
Tokenize a string.

Specified by:
tokenize in interface Tokenizer
Parameters:
s - The string to be tokenized.
Returns:
The String[] with the individual tokens as the array elements.


Copyright 2008 Jason Baldridge, Gann Bierner, and Thomas Morton. All Rights Reserved.