opennlp.tools.tokenize
Class TokenizerME
java.lang.Object
opennlp.tools.tokenize.TokenizerME
- All Implemented Interfaces:
- Tokenizer
- Direct Known Subclasses:
- Tokenizer, Tokenizer, Tokenizer, Tokenizer
public class TokenizerME
- extends java.lang.Object
A Tokenizer for converting raw text into separated tokens. It uses
Maximum Entropy to make its decisions. The features are loosely
based off of Jeff Reynar's UPenn thesis "Topic Segmentation:
Algorithms and Applications.", which is available from his
homepage: .
- Version:
- $Revision: 1.21 $, $Date: 2008/03/05 16:45:13 $
- Author:
- Tom Morton
|
Field Summary |
static java.util.regex.Pattern |
alphaNumeric
Alpha-Numeric Pattern |
|
Constructor Summary |
TokenizerME(opennlp.maxent.MaxentModel mod)
Class constructor which takes the string locations of the
information which the maxent model needs. |
|
Method Summary |
double[] |
getTokenProbabilities()
Returns the probabilities associated with the most recent
calls to tokenize() or tokenizePos(). |
void |
setAlphaNumericOptimization(boolean opt)
Used to have the tokenizer ignore tokens which only contain alpha-numeric characters. |
java.lang.String[] |
tokenize(java.lang.String s)
Tokenize a string. |
Span[] |
tokenizePos(java.lang.String d)
Tokenizes the string. |
static opennlp.maxent.GISModel |
train(opennlp.maxent.EventStream evc)
Trains the TokenizerME, use this to create a new model. |
static void |
train(opennlp.maxent.EventStream evc,
java.io.File output,
java.lang.String encoding)
Trains the TokenizerME, use this to create a new model. |
boolean |
useAlphaNumericOptimization()
Returns the value of the alpha-numeric optimization flag. |
| Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
alphaNumeric
public static final java.util.regex.Pattern alphaNumeric
- Alpha-Numeric Pattern
TokenizerME
public TokenizerME(opennlp.maxent.MaxentModel mod)
- Class constructor which takes the string locations of the
information which the maxent model needs.
- Parameters:
mod -
getTokenProbabilities
public double[] getTokenProbabilities()
- Returns the probabilities associated with the most recent
calls to tokenize() or tokenizePos().
- Returns:
- probability for each token returned for the most recent
call to tokenize. If not applicable an empty array is
returned.
tokenizePos
public Span[] tokenizePos(java.lang.String d)
- Tokenizes the string.
- Parameters:
d - The string to be tokenized.
- Returns:
- A span array containing individual tokens as elements.
train
public static opennlp.maxent.GISModel train(opennlp.maxent.EventStream evc)
throws java.io.IOException
- Trains the
TokenizerME, use this to create a new model.
- Parameters:
evc -
- Returns:
- the new model
- Throws:
java.io.IOException
train
public static void train(opennlp.maxent.EventStream evc,
java.io.File output,
java.lang.String encoding)
throws java.io.IOException
- Trains the
TokenizerME, use this to create a new model.
- Parameters:
evc - output -
- Throws:
java.io.IOException
setAlphaNumericOptimization
public void setAlphaNumericOptimization(boolean opt)
- Used to have the tokenizer ignore tokens which only contain alpha-numeric characters.
- Parameters:
opt - set to true to use the optimization, false otherwise.
useAlphaNumericOptimization
public boolean useAlphaNumericOptimization()
- Returns the value of the alpha-numeric optimization flag.
- Returns:
- true if the tokenizer should use alpha-numeric optization, false otherwise.
tokenize
public java.lang.String[] tokenize(java.lang.String s)
- Description copied from interface:
Tokenizer
- Tokenize a string.
- Specified by:
tokenize in interface Tokenizer
- Parameters:
s - The string to be tokenized.
- Returns:
- The String[] with the individual tokens as the array
elements.
Copyright 2008 Jason Baldridge, Gann Bierner, and Thomas Morton. All Rights Reserved.