If you ever need to get n-grams tokens of a given string, you can simply use the helpers provided by Lucene Analyzer.
First you need to build your own Analyzer. This can be done using a ShinlgeMatrixFilter with the parameters as required. E.g.
public class NGramAnalyzer extends Analyzer {
@Override
public TokenStream tokenStream(String fieldName, Reader reader) {
return new StopFilter(true, new LowerCaseFilter
(new ShingleMatrixFilter
(new StandardTokenizer(Version.LUCENE_20, reader),2,3,' ')),
StopAnalyzer.ENGLISH_STOP_WORDS_SET);
}
}
The parameters of the ShingleMatrixFilter states the minimum and maximum shingle size. "Shingle" is another name for token N-Grams.
Then you can you use the Analyzer as:
public static void main(String[] args) {
try {
String str = "An easy way to write an analyzer for tokens bi-gram (or even tokens n-grams) with lucene";
Analyzer analyzer = new NGramAnalyzer();
TokenStream stream = analyzer.tokenStream("content", new StringReader(str));
OffsetAttribute offsetAttribute = stream.addAttribute(OffsetAttribute.class);
TermAttribute charTermAttribute = stream.addAttribute(TermAttribute.class);
stream.reset();
while (stream.incrementToken()) {
String term = charTermAttribute.toString();
System.out.println(term);
}
} catch (IOException ie) {
System.out.println("IO Error " + ie.getMessage());
}
}
No comments:
Post a Comment